Our Site Reliability Engineering (SRE) team is responsible for architecting, engineering, operating, and securing the cuttingedge, multicloud iEnergy SaaS platform used to deliver ours and partner industryleading cloud solutions. Contribute to an environment where science, engineering, largescale cloud computing, and modern DevSecOps intersect. As a member of SRE team, you will be entrusted with big challenges, working closely with development engineering in the DevSecOps organization, with regional technical engineers, Information Technology engineers, and architects. Bring your expertise and curiosity to an environment offering unparalleled depth and breadth of exposure to todays most relevant cloud technologies, DevSecOps practices, and petrotechnical science.Overall responsibilities SRE is a critical and visible role, central to running a mutli-tiered cloud infrastructure, applications and workloads across public, private and hybrid cloud environments. SREs are required to have in-depth knowledge of Cloud technologies. SREs collaborate with development engineers, architects, technical leads and IT engineers to ensure uptime for cloud applications. SREs are expected to build and use tooling, automation, scripting and latest best practices to ensure services remain up and running, performant, resilient and secure. Responsibilities: Deploy and configure new Public Cloud tenants via automation Jenkins, Terraform, Ansible, gitlab[1]ci, atlantis.awx (ansible tower) Kubernetes and docker understanding and troubleshooting. Provide day-to-day support to existing customers and ensure that the team is always exceeding their expectations Develop system health metrics for both real time monitoring and usability recommendations Enforce best practices for security and reliability Participate in security initiatives, including access control and vulnerability testing Maintain documentation of the infrastructure and suggest areas for improvement Assist in maintaining platform availability to defined levels Troubleshoot and address infrastructure issues as necessary Collaborate with the Cloud Automation team on shared objectives on future desired state Assist in the validation of new automations in Azure and AWS if the need arises Investigate new technologies and methodologies to better support the product Coach and mentor other team members as needed Participate in an on-call rotation as required Requirements: A bachelors degree in a technical field and 12+ years of professional work experience between IT and Public Cloud operations or customer-oriented environments (at least 4 years Cloud experience) Experience with Public Cloud PaaS/SaaS solutions such as App Services, App Insights, Storage Accounts, Resource Groups, and monitoring tools, EC2, S3, route53, IAM Good debugging and troubleshooting understanding in distributed systems.Ability to understand and develop CI/CD pipelines for automations Excellent interpersonal and communication skills General knowledge of designing and implementing GUI-based web pages/dashboards aimed to gather and present information to account, sales, and other stakeholder teams. Strong System administration skills: Windows and Linux. Ability to troubleshoot and solve problems. Strong coding skills: Jenkins, Ansible, Terraform, VBA, Bash, Powershell, JavaScript Strong container skills: Docker and Kubernetes administration. Ability to troubleshoot and solve problems. Strong ID management skills: Okta, VIDM, Horizon View, Federation, users/groups management, directory integration, MFA, SSO, monitoring, automation workflows, profile management, application integration. Ability to troubleshoot and solve problems in all specified skills. Excellent Git skills Strong experience in AWS/Azure resources and associated managed services
Job Title
Site Reliability Engineer