Staff Site Reliability Engineer (Staff SRE) Walt Disney Animation Studios worldclass filmmakers, artists, and technical collaborators create the magic of animation. Bring your unique talents, passion and ideas to our team and prepare to play in a creative, artistfriendly environment. We are seeking a Staff SRE with expertise in systems administration on Linux platforms, software development (Python, Go, Java, Node), CI pipeline tools (Jenkins), Git source management, cloud hosting (AWS, GCP, Azure), container computing (Docker, OCI) and web technologies. The ideal candidate will enjoy the diversity and challenges of working at various levels in the foundational deployment stack, from configuration management to developing CI/CD infrastructure and processes. This role resides within the Platform and Infrastructure team at Walt Disney Animation Studios (WDAS). We build the tools and manage the infrastructure that artists use daily to create our celebrated animated content. The SRE team focuses on optimizing service deployments and improving availability, latency, performance, efficiency, and observability of systems at WDAS. Projects aim for simple, performant solutions to complex problems using Agile and DevOps methodologies. Critical to success in this role is an aptitude for working collaboratively with a technical team. You will develop and drive requirements and strategies while supporting services and core services infrastructure. Our studio thrives from a variety of technical backgrounds and experiences, so we encourage applicants even if they have experiences not specified below. Responsibilities As a Staff SRE, you will translate ideas into tangible products that shape experiences by focusing on automation, resiliency, efficiency, stability, security, performance, capacity management, and documentation. You will serve as a subjectmatter expert in multiple areas and be the goto individual for SRE principles and best practices. You will continuously improve reliability aspects for our services, with a focus on SLIs and SLOs, raising reliability for largescale userfacing and internal services. You will maintain a strong understanding of stakeholder workflows and translate targeted solutions into endtoend architectural designs. Support onpremises and cloud deployments using infrastructureascode, selfhealing, and security automation patterns. Deploy and manage deployments across environments. Develop telemetry, alerts, and automated responses to reduce MTTR. Collaborate and provide technical excellence within and across teams. Consult on best practices and develop tools to enable smooth adoption of service reliability practices and methods. Identify improvement areas in reliability, efficiency, and operations. Build tools to help the SRE team quickly pinpoint, isolate and resolve infrastructure, platform and application issues. Refine monitoring processes, configurations, and thresholds. Promote sustainable incident response and blameless postmortems. Develop runbooks and tools to streamline processes and shorten problem resolution time. Write code that improves scalability, performance, maintainability, and security. Maintain alert configurations and documentation as needed. Improve CI/CD processes to increase release cadence and success. Apply Chaos Engineering principles and methodologies to test under realworld conditions. Mentor SREs, sysadmins and systems engineers in technical and nontechnical SRE responsibilities. Required Education BS in Computer Science, Computer Engineering, Electrical Engineering or a related field. Key Qualifications 7+ years of experience in SRE, DevOps, technical operations, systems engineering, software engineering or related discipline. Proficient, collaborative, and experienced in building reliable, scalable enterprise systems. Excellent communication skills, both verbal and written. Passionate and curious about leveraging technology while continuously learning. Skilled in container orchestration (Docker, Kubernetes, Rancher, AWS ECS/EKS) in production environments. Experience with configuration management and infrastructureascode (Terraform, Helm, CloudFormation, Ansible, Puppet). Comfortable with one or more programming languages (Python, Java, Scala, Go, Rust, Ruby, or similar). Skilled in Cloud/PaaS/SaaS environments (AWS, Azure, Google Cloud). Handson experience using source control (Git, GitHub) and feature branching strategies. Experience with CI tools (Jenkins, GitLab CI/CD, AWS CodeBuild, CodeDeploy, Spinnaker). Knowledge of best practices and IT operations for alwaysup, alwaysavailable services. Expertise in scalable testing, automation, continuous integration frameworks and best practices. Experience in SDLC, distributed systems, networking, hardware, logistics, operations or capacity planning. UNIX/Linux administration, troubleshooting, performance tuning, and security. DevOps and/or SRE experience. Experience with monitoring and observability tooling (Datadog, Prometheus, Grafana). Experience automating infrastructure, deployment and testing using CloudFormation, Ansible or Terraform. Experience with Service Level Objectives and Error Budgets. Understanding of Chaos Engineering principles and methodologies. Bonus Qualifications Expertise in web server administration. The Walt Disney Company is an Equal Opportunity Employer. The hiring range for this position in British Columbia, Canada is C$124,200 to C$166,700 CAD per year. The base pay offered will take into account internal equity and may vary depending on the candidates geographic region, jobrelated knowledge, skills, and experience among other factors. A full range of medical, financial, and/or other variable pay or benefits may be offered at the level and position offered. #J-18808-Ljbffr
Job Title
Staff Site Reliability Engineer (Staff SRE)