Skip to Main Content

Job Title


Principal SRE


Company : Hydrolix


Location : Mumbai city, Maharashtra


Created : 2026-02-11


Job Type : Full Time


Job Description

We are looking for a Principal Site Reliability Engineer to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.Key Responsibilities:Reliability Engineering: Design and build automated systems that ensure the reliability and scalability of our Kubernetes clusters and Hydrolix deployments across multiple cloud platforms, eliminating manual operational tasks.Automation and Efficiency: Identify, quantify, and systematically eliminate repetitive manual work through automation and improved tooling, eliminating toil and freeing the team to focus on high-value work.Observability Infrastructure: Build and enhance comprehensive observability systems that provide deep visibility into system behavior, enable debugging and troubleshooting, and support data-driven reliability decisionsCI/CD and Deployment Automation: Design and build robust CI/CD pipelines and deployment automation that enable safe, frequent releases with minimal human intervention.Infrastructure Reliability: Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.Service Optimization: Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.Root Cause Analysis: Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.Collaboration and Customer EngagementCross-Functional Teamwork: Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.Knowledge Sharing: Document systems, create runbooks, and share knowledge across the organization to build collective expertise in reliability engineering.Reliability Advocacy: Champion SRE best practices and foster a culture of operational excellence across the organization.Reliability Systems: Build and maintain centralized reliability platforms, tools, and services that empower all engineering teams to operate their systems effectively.Global Team Collaboration: Collaborate with a distributed team of engineers worldwide to provide round-the-clock support and continuous improvement of our reliability posture.Customer-Facing Reliability: Work with customers to understand reliability requirements and ensure our platform meets their operational needs.Qualifications and Skills:SRE Expertise:With a minimum 10+ years of proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role, supporting large-scale, complex distributed systems in production.Demonstrated ability to operate at a principal level by setting reliability direction, defining standards, and influencing system design across multiple teams.Architecture, Performance & ScalabilityDeep experience designing and evolving system architectures with reliability, scalability, and operability as first-class concerns.In-depth experience in application and infrastructure performance tuning and scaling to handle heavy workloads under varying traffic patterns and failure scenarios.Ability to identify systemic bottlenecks, capacity risks, and inefficiencies, and drive long-term architectural improvements.Automation, Platform & Infrastructure EngineeringExceptional track record of eliminating toil through automation, including building internal platforms or frameworks that enable safe, scalable self-service.In-depth knowledge of configuration management and Infrastructure as Code (IaC) tools such as Terraform, Pulumi, and Ansible for provisioning and managing infrastructure consistently across environments.Observability & Reliability EngineeringDeep expertise in observability tools and practices, with the ability to design end-to-end monitoring strategies aligned with business outcomes.Strong understanding of core reliability concepts, including SLIs, SLOs, SLAs, error budgets, golden signals, and quality gates.Hands-on experience with distributed tracing, synthetic monitoring, end-user monitoring, performance testing, and chaos engineering.Proven experience driving blameless postmortems and ensuring learnings result in measurable reliability improvements.Kubernetes & Distributed SystemsDeep understanding of Kubernetes architecture, operations, failure modes, and ecosystem tooling.Experience designing and operating multi-cluster and/or multi-region Kubernetes platforms at scale.Cloud & Multi-Cloud ExpertiseDemonstrated proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode), with experience building cloud-native systems.Familiarity with multi-cloud or hybrid architectures and the operational trade-offs involved.Networking, Security & Traffic ManagementExperience with network load balancing, traffic management, and capacity planning at scale.Strong understanding of security technology stacks, Transport Layer Security (TLS), certificate management, and standard networking protocols and configurations.Data & Storage SystemsExperience working with SQL databases; familiarity with PostgreSQL is a plus.Ability to reason about performance, availability, and scaling characteristics of data-intensive systems.Programming & Systems EngineeringStrong programming ability in Go, Python, or Rust, with a proven ability to build and maintain production-quality tools, services, and automation.Comfortable reviewing, shaping, and influencing code across multiple teams and services.Linux & Infrastructure FundamentalsDeep experience with Linux systems, including performance tuning, capacity planning, and low-level system troubleshooting.Incident Management & Operational ExcellenceExtensive experience leading high-severity incidents, managing cross-team response, and driving post-incident reviews.Ability to translate incident learnings into systemic fixes, architectural changes, and improved operational standards.We look forward to seeing how you can make an impact at Hydrolix.