We are looking for a Site Reliability Engineer (SRE) with a strong focus on infrastructure reliability, scalability, and automation. This role emphasizes building resilient cloud platforms, improving system availability, and reducing operational toil through automation. You will work closely with platform and engineering teams to ensure high availability, performance, and observability of systems running on AWS and Kubernetes. Key Responsibilities Design and operate highly available, scalable infrastructure on AWS Manage and optimize Kubernetes clusters (EKS or self-managed) Define and enforce SLIs, SLOs, and error budgets Build and maintain Infrastructure as Code (IaC) using Terraform Develop and manage CI/CD pipelines using GitHub Actions Automate infrastructure and operational workflows using Python Improve system reliability, latency, and performance Implement observability solutions (metrics, logs, traces) Lead incident response , root cause analysis (RCA), and postmortems Reduce toil through automation and continuous improvement Ensure security, compliance, and cost efficiency of infrastructure Required Skills & Qualifications Strong hands-on experience with AWS (EKS) Deep understanding of Kubernetes (cluster operations, scaling, networking) Experience with Terraform for infrastructure provisioning Proficiency in Docker and container ecosystems Hands-on experience with GitHub Actions (CI/CD pipelines) Strong scripting skills in Python for automation Experience with monitoring tools like Prometheus, Grafana, Splunk Solid understanding of Linux systems, networking, and distributed systems Experience with incident management and on-call processes Core SRE Focus Areas Reliability Engineering & Production Readiness Monitoring, Alerting & Observability Incident Management & Postmortems Capacity Planning & Performance Tuning Infrastructure Automation & Self-Healing Systems
Job Title
Site Reliability Engineer