Job Description

RESPONSIBILITIES Operate and optimize Kubernetes-based infrastructure using HELM/ kustomize for deployment and configuration management. Build and maintain CI/CD pipelines for infrastructure and application deployments. Manage and monitor cloud infrastructure on AWS (EKS, EC2, S3, IAM, VPC, etc.). and on premise infrastructure Ensure observability through logging, monitoring, and alerting systems (e.g., Prometheus, Grafana, Cloudwatch, DataDog ). Implement and enforce security best practices across infrastructure components. Participate in on-call rotations, incident response, and root cause analysis. Support scaling of systems to meet demand while maintaining reliability. Collaborate with engineering and security teams on architecture and deployment strategies. Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.MUST HAVE SKILLS 3 - 6+ years of hands-on experience in SRE roles 2 - 4+ years of managing productionKubernetes environments Currently operatingproduction EKS clusters(hands-on, not observational) Deep expertise inKubernetes (EKS or self-managed) and Helm Strong understanding ofnetworking fundamentals: TCP/IP, DNS, VPNs, firewalls, load balancing Practical experience withAWS services: EKS, EC2, IAM, S3, CloudWatch, VPC Solid exposure to containerization (Docker) and CI/CD pipelines (e.g., Bitbucket Pipelines, GitHub Actions, ArgoCD, Flux CD) Proven experience handling production systems, on-call rotations, and real-time incident response Proficiency in at least one programming language (Python or Go preferred) Clear understanding of theSoftware Development Life Cycle (SDLC) Strong automation mindset with a bias toward eliminating manual toil Ability tobuild and maintain Grafana dashboards using PromQL(or equivalent) Strong grasp of SRE principles: SLIs, SLOs, error budgets, incident and post-incident managementNICE TO HAVE Experience in regulated industries (healthcare,fintech). Experience with incident management and disaster recovery.QUALIFICATIONS/EXPERIENCE Minimum of 3 years with 2+ years of SRE experience. BTech/BE/BS or MTech/MCA/ME/MS 2+ years of work experience with Amazon Web Services (AWS) 2+ years of work experience with Kubernetes 2+ years of work experience with Site Reliability Engineering Working in a hybrid settingWHAT DAY TO DAY LOOKS LIKE Monitoring Service-Level Indicators (SLIs) Setting Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) Responding to Incidents Writing Postmortems Automating System Tasks Cross-Department Collaboration Building Software for DevOps, SRE, and Support Teams Fixing Support Escalation Issues Optimizing On-Call Rotations and Processes Documenting "Tribal" Knowledge Conducting Post-Incident Reviews

Job Title

Company : Tricog Health

Location : Bengaluru, Karnataka

Created : 2026-02-24

Job Type : Full Time