Role SummaryWe are looking for a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our cloud-native infrastructure. The ideal candidate will bring strong hands-on experience in AWS, Kubernetes, Docker, CI/CD pipelines, monitoring, and automation using Python, and will work closely with development and operations teams to build resilient, highly available systems.Key Responsibilities- Design, deploy, and maintain highly available and scalable systems on AWS - Manage and operate containerized applications using Docker and Kubernetes (EKS) - Build, maintain, and optimize CI/CD pipelines using Jenkins - Automate operational workflows and routine tasks using Python scripting - Implement and manage monitoring, alerting, and observability using Grafana and Prometheus - Ensure system reliability, performance, uptime, and scalability - Participate in incident response, root cause analysis (RCA), and post-incident reviews - Implement Infrastructure as Code (IaC) and automation best practices - Collaborate with development teams to improve system architecture and deployment strategies - Enforce security, compliance, and operational best practices in cloud environments - Continuously improve system efficiency through automation, tooling, and process optimizationRequired Skills & Qualifications- Strong hands-on experience with AWS services (EC2, S3, IAM, VPC, RDS, EKS, etc.) - Solid experience with Kubernetes (EKS) and Docker - Proficiency in Python scripting for automation and monitoring - Experience designing and managing CI/CD pipelines using Jenkins - Strong understanding of DevOps principles and CI/CD best practices - Hands-on experience with Grafana and Prometheus for monitoring and alerting - Strong knowledge of Linux systems and networking fundamentals - Experience with Git or other version control systems - Understanding of microservices architectureGood to Have- Experience with Terraform or CloudFormation - Knowledge of Helm, ArgoCD, or similar deployment tools - Familiarity with log management tools (ELK / EFK stack) - Understanding of SRE practices such as SLIs, SLOs, SLAs, and error budgets - AWS and/or Kubernetes certifications (CKA / CKAD)
Job Title
Site Reliability Engineer