We are seeking a Senior Site Reliability Engineer to lead reliability efforts across our application stack, focusing on high availability, performance, and scalability. This role will own the health and uptime of our mission-critical application, Cloud infrastructure, database system, and monitoring infrastructure.About Us At BQE, our mission is to transform the operational landscape of professional services firms, empowering them to achieve more and serve their customers better. These firms play a crucial role in building infrastructure that significantly impacts global progress. BQE CORE serves as the operational backbone for these firms, providing an all-in-one SaaS solution. Our platform enables them to efficiently manage projects, improve budget tracking and profitability, and streamline processes through automation. With a robust customer base, we are on a trajectory of continuous growth, constantly innovating to meet the evolving needs of our customers and the industries they influence. Why Join UsWork with a modern tech stack in a high-impact reliability role.Be a key part of our CloudOps and App Reliability strategy.A collaborative and supportive engineering culture.Responsibilities:Ensure application uptime, performance, and scalability.Own incident management, including on-call rotations, root cause analysis, and incident reviews.Manage and monitor MS SQL Server clusters and high-availability configurations.Set up and improve monitoring, alerting, and observability using New Relic, Logz.io, CloudWatch, and other tools.Proactively identify system bottlenecks and improve system reliability and automation.Define and improve SLOs/SLAs across services.Drive disaster recovery testing and availability simulations.Collaborate with CloudOps and DevOps for infrastructure automation and enhancements.Work with Jira and JSM to manage operational tasks, incidents, and changes.Qualifications & Experience:Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience).5-8 years of experience in Site Reliability Engineering, CloudOps, DevOps or related roles.Must Have Skills:Certifications in AWS, Microsoft, Windows, SQL Server, or SRE disciplines.Exposure to New Relic APM, IaC automation is a plus.Experience working in a 24x7 on-call rotation.Strong knowledge of Windows OS eco-system, IIS, MS SQL Server administration, clustering, performance tuning, and failover.Deep experience with monitoring/logging tools like New Relic, Logz.io, AWS CloudWatch.Experience with AWS (EC2, ASG, CloudWatch, CloudTrail, VPC) and infrastructure management.Good understanding of networking, DNS, load balancing, and security principles.Proficient in scripting languages such as PowerShell, Python.Strong understanding of incident response, change management, postmortem culture.Experience using Jira and Jira Service Management for operational workflows.Ability to work independently and drive technical initiatives.
Job Title
Senior Site Reliability Engineer