The Safe Enablement team, a subset of the AI Platform Team, carries a mission of building site-reliable practices and guardrails into the platforms the AI Platform team builds and the Analytics Communities use. We accomplish this goal by collaborating with Enterprise partners to identify requirements and best practices, building self-service tools equipping DS&A LOBs to build resilience, and by building automation which can measure and enforce guardrails before assets hit production.The Hartford’s Safe Enablement team is seeking a Site Reliability Engineer to support Resilience planning, execution, and post mortem activities. This role will work closely with the AI Platform Team, Data Science and AI development teams, MLE’s, Business Teams, and Enterprise governors to understand and implement key resilience, security, compliance, reliability, and disaster recovery capabilities. Key ResponsibilitiesPartner with Enterprise governors to ascertain key reliability, security, and resilience requirements set by The Hartford and bring those requirements into the Platform team for implementationPatternize resilience capabilities into useful tools, services, and products to be used by customers to ensure users of the Platform build fault-tolerant systemsDevelop governing frameworks to ensure each release is compliant with the standards we expectLiaise with key business and technical customers to understand predictive applications and their infrastructure. Through this consultation, you would be working with them to build resilience and reliability capabilities into their application.Drive IM, cloud ops, and RE efforts across the platform by applying industry best practices and maturing existing the problem management lifecycle by building standards and contributing to runbooks, standard operating procedures, and incident management lifecyclePerformance engineering of deployed analytics and AI solutions across the portfolio to ascertain enhancement opportunitiesRequired Skills & Experience: 4+ years of experience programming in Python to build automation tools, operational scripts, and platform support capabilities, including infrastructure and reliability automation.4+ years of experience using Infrastructure as Code to provision and manage cloud environments, including Terraform and/or CloudFormation, with a focus on repeatability, security, and scalability.2–3 years of experience deploying and operating systems on public cloud platforms such as AWS and/or Google Cloud Platform, including familiarity with serverless architectures, multi-region deployments, and recoverability strategies.4+ years of experience designing and operationalizing resilience, reliability, and disaster recovery capabilities for distributed systems and ML/AI platforms, including performance engineering and fault-tolerant system design.2+ years of experience building and maintaining CI/CD pipelines using tools such as GitHub and Jenkins, including embedding security checks, compliance gates, and automated validation into deployment workflows.4+ years of experience applying core reliability engineering concepts, including authoring runbooks, operational guides, and automation to support resilient platform operations.3+ years of experience designing observability solutions, including logging, monitoring, and alerting using tools such as Splunk, with dashboards and metrics that surface service health, SLO/SLA adherence, and early-warning signals for ML and data workloads.4+ years of hands-on experience with incident and problem management practices, including ITIL-based processes, postmortems, and blameless root cause analysis, as well as disaster recovery planning, failover testing, and resilience frameworks such as FMEA.Foundational knowledge of networking fundamentals and operations architecture to support IT service management (ITSM) automation and distributed system reliability.Familiarity with relational databases such as Snowflake or other RDBMS platforms, with an understanding of data reliability, availability, and consistency requirements in analytics and ML environments.Nice to Have Knowledge of NIST 800-171Working in Agile / a consultative mindset – we function as an agile team, which places a responsibility and accountability on developers to take ownership of their work. We agree on acceptance criteria as a goal, and work to find the best ways to implement a solution which meets those criteria. Seldom do our stories define precisely how to build a solution giving space for engineers to design together and implement the best possible solutions A mind for resilience – not necessarily a security engineer, but someone keen on establishing best practices, following those practices, and fostering adoption of those practices at all levelsExceptional Communication and collaboration skills – we never work alone, and often work with customers across DS&A and the Enterprise.What We OfferCollaborative work environment with global teams.Competitive salary and comprehensive benefits.Continuous learning and professional development opportunities.
Job Title
Staff Site Reliability Engineer