Skip to Main Content

Job Title


Site Reliability Engineer III [T500-24447]


Company : McDonald's Global Office in India


Location : Agra, Uttar pradesh


Created : 2026-03-19


Job Type : Full Time


Job Description

About McDonald’s:One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and operations hubs, designed to expand McDonald's global talent base and in-house expertise. Our new office in Hyderabad will bring together knowledge across business, technology, analytics, and AI, accelerating our ability to deliver impactful solutions for the business and our customers across the globe. Job Description: Site Reliability Engineer (SRE) – RealTime CDP Position Summary:We are seeking a highly skilled Site Reliability Engineer (SRE) to support the RealTime Customer Data Platform (CDP). This role will ensure platform availability, performance, scalability, and operational excellence across real time streaming, identity resolution, audience services, and activation APIs. This engineer will work closely with Data Engineering, DevOps, Architecture, and Product partners to maintain a reliable, resilient, and secure global customer data platform. In this role, the SRE will play a critical part in monitoring, incident response, CI/CD optimization, change management, and observability best practices, aligned to SRE workflows and operational patterns used across teams such as C360 and global SRE standards. Primary Responsibilities:Reliability, Monitoring & Incident Management Observability & Performance Engineering Change, Deployment & Automation Platform Operations (GCP Focused) Cross Team Collaboration Who We’re Looking For:An operationally excellent, automation driven SRE who elevates reliability, reduces toil, and strengthens observability for a mission critical RealTime CDP platform powering global personalization and engagement. Own reliability across Realtime time CDP components, focusing on availability, latency, throughput, and error rate SLIs, consistent with internal SRE standards from SRE Framework.Configure and maintain monitoring, alerting, dashboards, and service health indicators using industry tools (e.g., Grafana, Prometheus, New Relic, Open Telemetry, ELK).Participate in daily production system calls, triage issues, and serve as liaison across domain teams, following expectations documented in the C360 SRE Guide.Lead or support P1/P2 production incidents, joining production bridges and coordinating with engineering teams until resolution, as described in SRE Assignments.Drive root cause analysis (RCA), error budget burn analysis, and budget burn analysis, and long corrective actions term corrective actions.Implement and maintain end-to-end-to-end observability, including logs, traces, metrics, synthetic checks, and API health probes.Develop SLOs, SLIs, and error budgets, aligned with internal standards from SRE Framework (error budget management, burn rate monitoring).Conduct performance tuning, load/capacity testing, and failure mode analysis of Realtime streaming systems.Review and validate deployments via CI/CD pipelines, ensuring change safety consistent with SRE expectations defined in C360 processes (e.g., CI/CD alignment and minimizing human error).Implement and improve canary, staged rollouts, and rollback strategies for Realtime services.Identify gaps in deployment processes and automate manual tasks to reduce operational overhead.Operate and support cloud infrastructure across GCP services such as Pub/Sub, Dataflow, Cloud Run, Big Query, GKE, Memory store, Cloud Logging/Monitoring.Ensure platform readiness, high availability, cost optimization, and compliance for regional deployments.Support Realtime streaming pipelines, API services, audience engines, and identity systems.Work closely with Data Engineering squads to ensure new features follow operational and reliability standards.Collaborate with Data Governance, Architecture, Security, and Global Support teams to align with enterprise policies.Participate in sprint planning, refinement, and story acceptance criteria (per SRE expectations.Provide SRE requirements for new features and ensure monitoring/alerting is designed into all releases.Strong experience with any cloud platform but GCP is highly preferred, especially Pub/Sub, Dataflow/Beam, Cloud Run, GKE, Big Query, and Cloud Monitoring.Handson expertise with Kafka, Flink, Spark Streaming, or similar Realtime frameworks.Proficiency with Python, Bash, YAML, and operational scripting.Experience configuring observability stacks: Prometheus, Grafana, Datadog, ELK, Open Telemetry.Strong knowledge of CI/CD pipelines, GitOps, deployment automation, and release engineering.Familiarity with SLOs, SLIs, error budgets, incident management protocols, and RCA processesStrong debugging skills across distributed systems, streaming pipelines, APIs, and containerized services.Experience with capacity planning, performance tuning, load testing, and disaster recovery design.Ability to assess recurring product issues and drive long term fixes, consistent with SRE role descriptions.5–10 years of experience in SRE, DevOps, Platform Engineering, or Reliability Engineering.3+ years working with GCP or another major cloud provider.Experience supporting high volume, low latency Realtime systems.Experience participating in or leading production incident calls, as described in C360 SRE operations.Experience with CDPs or Mar Tech ecosystems (e.g., mParticle, Braze, Adobe, Tealium).Familiarity with customer identity systems, audience engines, or real time activation systems.Experience supporting global, multi region cloud deployments region cloud deployments.Understanding privacy, consent, and enterprise data compliance patterns.