Are you an experienced Site Reliability Engineer looking for a new challenge? Were looking for a Staff Site Reliability Engineer to join us at Thinkific. Were looking for a Staff Site Reliability Engineer (SRE) to join us at Thinkific. As a Staff Site Reliability Engineer, you will help us scale and secure the infrastructure that powers thousands of online course creators around the world. In this role, youll play a critical role in improving the performance, reliability, and security of our platform. Youll work cross-functionally with engineers, product managers, and stakeholders to drive forward reliabilityfocused initiatives, build scalable systems, and mentor others. Youll also help shape our technical strategy, lead major infrastructure projects, and act as a domain expert in modern cloudnative practices, with a specific emphasis on Kubernetes, cloud infrastructure (AWS), observability, and service reliability. Your goal will be to help guide and execute on projects related to your technical domain. Heres how youll accomplish this: Own one or more technical domains across our infrastructure with accountability for system reliability, performance, scalability, and security Lead projects to evolve our Kubernetesbased platform, ensuring alignment with SLOs, security best practices, and longterm maintainability Contribute to the design and evolution of our infrastructure using Terraform, Helm, and cloudnative tools, with an emphasis on modularity, reuse, and automation Partner with engineering teams to design robust deployment pipelines, ensure operational readiness, and build securebydefault patterns for new services Lead incident response efforts and participate in oncall rotation, driving a culture of blameless postmortems and learning Write infrastructure and application code in Ruby, Node.js, Python, or Bash to automate operations and improve developer experience Serve as a mentor and multiplier, raising the technical bar through coaching, knowledge sharing, and technical leadership Actively promote observability, testing, and continuous improvement in everything you build and advocate for within your team Participate in our oncall rotation and incident response processes to help maintain a high level of service reliability The person we have in mind likely: Has 6+ years of experience in software or infrastructure engineering, including 4+ years working with Kubernetes in production environments Holds a CKA certification or equivalent handson Kubernetes expertise (bonus for experience managing multitenant clusters or complex networking in K8s) Has deep knowledge of TLS, certificates, ciphers, and encryption protocols, and can explain how they secure communications in a distributed system Has production experience with AWS infrastructure and services (EKS, RDS, IAM, ALB, S3, etc.) Writes infrastructureascode using Terraform, and has built scalable and secure infrastructure following modular and reusable patterns Is comfortable with monitoring and observability tooling (e.g., New Relic, Datadog, Prometheus, Grafana, Sentry) and building alerting based on meaningful SLOs Has experience supporting distributed systems with relational and nonrelational databases (PostgreSQL, AWS Aurora), message queues (Sidekiq, SNS/SQS), and asynchronous architectures Enjoys collaborating across teams and helping shape engineering roadmaps and architectural direction Brings a strong ownership mentality, cares deeply about developer experience and operational excellence, and thrives in a fastpaced environment Loves to learn and grow. Theyve found (and keep looking for) ways to level up their skills in this field, whether thats through formal education, gaining professional experience, or maybe even building their own business These things would also be nice, but we think you could learn them on the job: Experience with Database Administration (DBA) practices, including performance tuning, replication strategies, backup and recovery planning, and operational support for PostgreSQL or AWS Aurora environments Experience working with Ruby on Rails and/or Node.js applications in production Familiarity with Cloudflare, load balancing strategies, and CDN configuration Experience improving CI/CD pipelines and secure software supply chains Were committed to fair and transparent pay that reflects both where youre at and where you can grow. This role has a salary range of $132,900 $166,100 $182,900 in Canada, designed to capture the full journey from developing skills to excelling in the position. Most new hires start between the minimum and midpoint, which aligns with being fully capable in the role. Salaries above the midpoint are typically reserved for team members who have demonstrated strong, consistent performance, deep expertise, and a significant positive impact within the role. For highdemand or hardtofill positions like this one, we may hire above midpoint for candidates who bring exceptional experience, skills, or impact potential. Diversity, Equity, Inclusion and Belonging & Accessibility This is just our initial idea of who were looking for! At Thinkific, we know that people have unique career journeys. If your experience is close to what weve described but you feel that you might be missing a few of the requirements, please still apply! We believe in equal opportunity and are committed to diversity, equity, inclusion, and belonging across every facet of our business. Were also committed to providing a comfortable and accessible interview experience for every candidate. If there are any accommodations our team can make throughout our hiring process (big or small), please let us know. #J-18808-Ljbffr
Job Title
Staff Site Reliability Engineer