About the RoleAs a Senior Site Reliability Engineer at WSO2, you'll be instrumental in both supporting our existing customers with their managed or private cloud deployments and initiating new deployments across leading cloud platforms such as Azure, AWS, and GCP. Your mission will include ensuring the seamless operation, scalability, and security of WSO2 cloud services, alongside automating processes to boost both efficiency and reliability.Your Key ResponsibilitiesDeployment Setup and ManagementLead the design and implementation of new cloud deployments, tailoring solutions to meet stakeholder requirements on platforms like Azure, AWS, GCP, and Kubernetes.Optimize cloud architectures for scalability and cost-effectiveness, adhering to best practices for networking, security, and access controls.Gain and maintain deep knowledge of cloud infrastructure providers to create robust solutions.Proactively introduce continuous improvements and cost-optimized solutions to enhance infrastructure adaptability and streamline deployment processes.Automation and CI/CDCraft and manage automation scripts and infrastructure as code (IaC) using tools such as Terraform, Ansible, or CloudFormation.Deploy CI/CD pipelines to streamline software delivery, testing, and deployment processes, ensuring efficient version control and configuration management.Managed Cloud SupportEnsure the availability of services by configuring system monitors and alerts and attending to critical alerts in a timely manner.Offer continuous support and maintenance for existing deployments, monitoring system performance and swiftly resolving issues to maintain high availability and reliability.Implement strategies for performance optimization and failure prevention, conducting thorough root cause analyses to avoid future issues.Demonstrate strong ownership during critical incident scenarios, ensuring smooth operations under pressure by delivering timely resolutions. Implement effective workarounds and conduct thorough root cause analysis (RCA).Monitoring and SecurityEstablish comprehensive monitoring and alerting systems to oversee customer deployments, setting thresholds for incident response.Conduct regular security assessments and stay abreast of the latest threats and trends to fortify cloud environments against risks.Collaboration and Knowledge SharingFoster a collaborative environment with product developers, operations, and QA teams to enhance workflows and product quality.Share knowledge and best practices, contributing to the team’s collective expertise through documentation, training, and mentorship.Qualifications and SkillsBachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.5+ years of hands-on experience as a Site Reliability Engineer, managing and improving production systems at scale.Strong collaboration and leadership skills, with a proven ability to drive cross-functional initiatives and align team efforts.Expertise in cloud platforms such as Azure, AWS and GCP.Expertise in Linux and virtualization and containerization technologies such as Docker and Kubernetes.A solid understanding of networking, security principles, and compliance frameworks.Proficiency in IaC tools (Terraform, CloudFormation), configuration management (Puppet, Chef, Helm), and scripting languages (Python, Bash, PowerShell).Experience with CI/CD tools (Github Actions, Jenkins) and monitoring/logging tools (Prometheus, ELK stack, Splunk).Exceptional problem-solving, analytical, and troubleshooting skills, coupled with a proactive, customer-centric mindset.Strong communication skills and the ability to collaborate effectively in a team environment.
Job Title
Senior Site Reliability Engineer