Location : Hyderabad-L&T Metro-flr 1-9,11&12 Job Title: Jr. Site Reliability Engineer (SRE) – Azure StorageRole OverviewWe are seeking a Site Reliability Engineer (SRE) to support Azure Storage deployments and operations across public, sovereign, and pre‑production environments. The role focuses on deployment reliability, incident response, infrastructure health, automation, and data‑driven operational insights. Key ResponsibilitiesReliability, Deployments & OperationsExecute Azure Storage (Classic/ XPF / Direct Drive) tenant and infrastructure deployments across public, sovereign, and pre‑production environments.Monitor and maintain server uptime, tenant stability, and overall environment health.Track and reduce offline capacity and long‑running (long‑tail) deployments to improve deployment completion times.Manage end‑to‑end release tracking for storage components and ensure deployment compliance.Incident Management & TroubleshootingAcknowledge, triage, and resolve deployment‑related incidents and operational alerts.Apply technical mitigations (including node recovery) to unblock critical deployments.Lead Severity‑2 bridge calls, coordinating with engineering, partner, and vendor teams through resolution.Manually create and manage Incident Communication Management (ICM) records when required.Root Cause Analysis & Stability ImprovementsPerform root cause analysis (RCA) for hardware, infrastructure, and release‑related failures.Analyze recurring deployment faults and failure trends; file defects with actionable remediation details.Investigate and correct incorrect fault‑bucket assignments to improve diagnostic accuracy.Collect and analyze hardware logs; deliver structured reports to engineering and vendor teams.Process, Automation & DocumentationIdentify and drive automation opportunities for repetitive or high‑risk operational tasks.Develop, maintain, and publish SOPs, TSGs, troubleshooting playbooks, and KB articles.Improve workflows through automation, procedural updates, and process optimizations.Reporting & Stakeholder CommunicationPublish daily operational status reports and defect summaries.Deliver weekly dashboards and quality reports covering deployment health, reliability metrics, and SLO adherence.Provide regular status updates to stakeholders and participate in daily syncs with on‑call teams. Required Skills & ExperienceStrong experience in Azure cloud operations, SRE, or large‑scale infrastructure support.Hands‑on experience with incident triage, RCA, and production support.Solid understanding of storage systems, hardware failures, and deployment pipelines.Experience working in 24x7 on‑call / shift‑based operational environments. Good‑to‑Have / Preferred SkillsHyper‑V: Virtualization troubleshooting and host‑level diagnostics.Azure DevOps: CI/CD pipelines, release tracking, automation, and operational workflows.Kusto / Azure Data Explorer (ADX):Writing KQL queries for operational insightsBuilding dashboards for deployment health, defects, capacity, and reliability metricsExperience with automation scripting (PowerShell, Python, or similar). Work ModelHybrid: 3 days work from office, 2 days work from home. Note:Resource’s are multiparked thus allocation will be on FCFS basis only.CI Blocking is valid for 3 days, if no CI feedback is received within 3 days, resource will be automatically made available for other requirements.If no response on proposed profiles within 3 days, RR will be marked on hold.
Job Title
Site Reliability Engineer