Job Title: Site Reliability Engineer (SRE)Job OverviewWe are looking for a skilled Site Reliability Engineer (SRE) to join our team and help build, scale, and maintain highly reliable and resilient cloud-based systems. The ideal candidate will have a strong foundation in cloud infrastructure, automation, observability, and incident management, with a focus on improving system reliability and performance.Key ResponsibilitiesDesign, build, and maintain highly available and scalable cloud infrastructure (primarily on Azure).Implement and manage Infrastructure as Code (IaC) using tools like Terraform, Helm, or Ansible.Develop and optimize CI/CD pipelines with integrated security and quality checks.Deploy, manage, and orchestrate containerized applications using Kubernetes and Docker.Establish and enhance observability practices including monitoring, logging, tracing, and alerting.Collaborate with development teams to define SLIs, SLOs, and implement effective alerting strategies.Participate in on-call rotations, respond to production incidents, and perform root cause analysis (RCA).Continuously improve system reliability, availability, and performance through automation and best practices.Drive a culture of reliability, scalability, and operational excellence.Required QualificationsBachelor’s or Master’s degree in Computer Science, Engineering, or a related field.3–8 years of experience as a Site Reliability Engineer or in a similar role.Hands-on experience with cloud platforms such as Azure (preferred) or AWS.Strong experience with Infrastructure as Code (Terraform preferred).Proficiency in scripting languages such as Python, Bash, or PowerShell.Experience with CI/CD tools and automation pipelines.Solid understanding of containerization and orchestration (Docker, Kubernetes).Experience in incident management, on-call support, and root cause analysis.Preferred QualificationsExperience with observability tools such as Grafana, Prometheus, ELK Stack.Familiarity with on-call and incident management tools like PagerDuty or Zenduty.Experience defining and managing SLIs, SLOs, and SLAs.Knowledge of security best practices in CI/CD and cloud environments.Key SkillsSystem Reliability & High AvailabilityObservability & MonitoringIncident Response & TroubleshootingAutomation & Infrastructure as CodeCloud & Container TechnologiesCollaboration & CommunicationWork ModelHybrid (Onsite + Remote)
Job Title
Site Reliability Engineer