As a Cloud SRE your role will combine software engineering and systems engineering disciplines to ensure that software systems are available, scalable, and maintainable. Within the Cloud SRE team, you will work on a unique intersection of SRE and Software Development building and enabling adoption of our global monitoring capabilities.Description:Enterprise Technology plays a critical part in shaping the future of mobility. If you’re looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people’s lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create vehicles that are as smart as you are.This company is seeking an experienced Site Reliability Engineer (SRE) to join our team and lead the development, enhancement, and extension of our global monitoring and observability platform.Our Site Reliability Engineering (SRE) team builds applications, tools, and provides best practices to ensure the uptime of our critical cloud services.As a Cloud SRE your role will combine software engineering and systems engineering disciplines to ensure that software systems are available, scalable, and maintainable. Within the Cloud SRE team, you will work on a unique intersection of SRE and Software Development building and enabling adoption of our global monitoring capabilities.Responsibilities:Write, configure, and deploy code that improves service reliability for existing or new systems; set standards for others with respect to code quality.Provide helpful and actionable feedback and review for code or production changes.Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors.Lead debugging, troubleshooting, and analysis of service architecture and design.Participate in on-call rotation.Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others.Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand.Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.Troubleshoot and resolve issues in our dev, test, and production environments.Participate in postmortem analysis and create preventative measures for future incidents.Qualifications:Bachelor’s degree in Computer Science, Engineering, Mathematics or equivalent experience.3+ years of experience as an SRE, DevOps Engineer, Software Engineer or similar role.Strong experience with Golang development and desired familiarity with Terraform Provider development.Proficient with monitoring and observability tools, particularly OpenTelemetry or other tools.Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.Experience with relational and document databases.Ability to debug, optimize code, and automate routine tasks.Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.Excellent verbal and written communication skills.Same Posting Description for Internal and External Candidates
Job Title
Site Reliability Engineer - New [T500-17559]