Job DescriptionWe are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a strong background in DevOps and SRE practices, with at least 5 years of hands-on experience in designing, implementing, and maintaining scalable, reliable, and secure infrastructure for cloud-native applications.You will report directly to the Sr Software Engineering Manager and work out of our Atlanta, GA location on a hybrid work schedule. For the first 90 days, New Hires must be prepared to work 100% onsite M-F.ResponsibilitiesDesign, build, and maintain scalable infrastructure on cloud platforms (GCP, AWS, Azure).Develop and implement CI/CD pipelines for automated deployment and testing.Monitor, troubleshoot, and optimize system performance, reliability, and availability.Lead incident response, root cause analysis, and post-mortem reviews.Implement and manage infrastructure as code (IaC) using tools such as Terraform, Ansible, or CloudFormation.Develop and maintain observability solutions (monitoring, logging, alerting) using tools like Prometheus, Grafana, ELK, Datadog, etc.Collaborate with development teams to ensure best practices in application reliability, scalability, and security.Automate operational tasks and improve system efficiency through scripting and tooling.Mentor and guide junior engineers in SRE and DevOps practices.Ensure compliance with security standards and participate in audits.QualificationsYOU MUST HAVEBachelor's or Master's degree in Computer Science, Engineering, or related field.5+ years of software engineering experience, with 3+ years in ML Ops, agentic AI, Databricks, data lake, or cloud platforms. Minimum 5 years of experience in DevOps, SRE, or related roles.Strong expertise in cloud platforms (GCP, AWS, Azure).Proficient
Job Title
Sr Advanced Software Engineer - (DevOps, SRE & AI)