Role : GenAI Site Reliability Engineer Level : Senior AssociateTower : AI Operations & Platform Support (AI Managed Services)Experience : 5-10 yearsKey Skills : Monitoring & Alerting; Incident Investigation; Troubleshooting; Automation/Scripting; Cloud Operations; GenAI Platform OperationsEducational Qualification : Bachelor’s degree in Computer Science/IT or relevant field (Master’s or relevant certifications preferred)Work Location : Bangalore / HyderabadJob DescriptionAs an AC Senior Associate GenAI Site Reliability Engineer, you will operate and improve monitoring for in-scope GenAI services and AI workloads, investigate incidents, and implement reliability improvements. You will build dashboards, tune alerts, document runbooks, and automate repetitive operational tasks to improve stability and reduce time to restore.Key Responsibilities:1.Monitoring, Alerting & Service Health: Build and maintain dashboards and alerts for availability, latency, error rates, and overall service health for in-scope GenAI services. Tune thresholds and alert routing to reduce noise and improve actionable detection, improving MTTA and MTTR. cident Triage, Investigation & Restoration: Triage incidents, gather evidence, and perform structured troubleshooting using logs/metrics/traces and documented runbooks. Execute restoration steps and coordinate escalations to platform owners, engineering teams, or vendors for complex issues. Provide clear technical updates during live events and document resolution details for future reference and trend analysis. 3.Problem Prevention & Reliability Improvements: Contribute to root-cause investigations and implement corrective actions (monitoring improvements, configuration changes, resilience enhancements). Identify recurring failure modes and propose fixes that reduce repeat incidents and improve overall service stability. Support verification of corrective actions by monitoring outcomes and validating that improvements reduce incident recurrence. 4.Performance Troubleshooting Support: Assist with latency and error investigations by gathering diagnostics, isolating contributing factors, and proposing mitigations. Partner with engineering teams to validate fixes and monitor post-deployment impact on service health and performance. 5.Automation & Scripting: Automate diagnostics and routine operational tasks to reduce manual effort and improve consistency (scripts, repeatable checks, standardized steps). Maintain and document operational scripts and ensure they are usable and supportable by the broader team. 6.Documentation & Knowledge Management: Maintain runbooks, troubleshooting guides, and knowledge articles for frequent scenarios and standard operating procedures. Document known issues, standard resolutions, and escalation paths to improve first-time fix rate and onboarding efficiency. 7.Change Readiness & Post-Change Validation: Support operational readiness for changes by validating monitoring readiness, runbook updates, and post-change verification steps. Execute post-change checks and report regressions or unexpected behavior promptly to ensure rapid remediation. 8.Continuous Improvement & Service Reporting Inputs: Identify operational pain points and recommend improvements to monitoring, alerting, runbooks, and support workflows. Provide inputs to service reporting on incident trends, recurring issues, and improvement opportunities related to GenAI reliability. 9.Quality, Controls & Operational Discipline: Follow defined operational processes (incident, request, change) and maintain high-quality ticket hygiene and documentation discipline. Comply with security and access controls for supported tools and environments; proactively raise operational risks or control gaps for mitigation. 10.Collaboration & Team Support: Collaborate with peers and leads to coordinate workload, share knowledge, and support consistent execution standards across the pod. Support onboarding and knowledge transfer by maintaining clear documentation and participating in team enablement activities.Required Skills: Hands-on experience supporting production services in a cloud environment, including monitoring, troubleshooting, and incident response. Experience building dashboards and alerts and working with logs/metrics/traces to diagnose issues and reduce time to restore. Strong analytical problem-solving skills and ability to implement reliability improvements and corrective actions in a controlled manner. Experience working within ITIL-aligned processes (incident, request, change) and maintaining runbooks/knowledge articles. Preferred: experience with ITSM and observability tooling (e.g., client ITSM and monitoring tools; ServiceNow, CloudWatch, Datadog, Splunk, New Relic). Familiarity with GenAI services (AWS Bedrock, OpenAI/ChatGPT Enterprise) is desirable. AWS certifications are highly preferred.Managed Services- AI ServicesAt PwC, we relentlessly focus on working with our clients to bring the power of technology and humans together and create simple yet powerful solutions. We imagine a day when our clients can simply focus on their business, knowing that they have a trusted partner for their IT needs. Every day, we are motivated and passionate about making our clients better.Within our Managed Services platform, PwC delivers integrated services and solutions that are grounded in deep industry experience and powered by the talent that you would expect from the PwC brand. The PwC Managed Services platform delivers scalable solutions that add more excellent value to our client’s enterprise through technology and human-enabled experiences. Our team of highly skilled and trained global professionals, combined with the latest advancements in technology and process, allows us to provide effective and efficient outcomes. With PwC’s Managed Services, our clients can focus on accelerating their priorities, including optimizing operations and accelerating outcomes. PwC brings a consultative first approach to operations, leveraging our deep industry insights, world-class talent, and assets to enable transformational journeys that drive sustained client outcomes. Our clients need flexible access to world-class business and technology capabilities that keep pace with today’s dynamic business environment.Within our global Managed Services platform, we provide AI Managed Services where we focus more so on the evolution of our clients’ AI portfolio. Our focus is to empower our clients to navigate and capture the value of their application portfolio while cost-effectively operating and protecting their solutions. We do this so that our clients can focus on what matters most to your business: accelerating dynamic, efficient and cost-effective growth.As a member of our AI Managed Service team, we are looking for candidates who thrive working in a high-paced work environment capable of working on a mix of critical Application Evolution Service offerings and engagement, including help desk support, enhancement and optimization work, as well as strategic roadmap and advisory level work. It will also be critical to lend experience and effort in helping win and support customer engagements from not only a technical perspective, but also a relationship perspective.
Job Title
Gen AI Site Reliability Engineer (SRE)