Skip to Main Content

Job Title


Senior Service Reliability Engineer


Company : IREN


Location : Vancouver, British Columbia


Created : 2026-03-10


Job Type : Full Time


Job Description

IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. IRENs vertically integrated platform is underpinned by its expansive portfolio of gridconnected land and data centers in renewablerich regions across the U.S. and Canada. With 100% renewable energy, we build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the everevolving applications of highperformance compute. We believe that human progress is invaluable, but it should be done in the right way responsibly, sustainably, and having a positive impact on the communities we operate in. The Platform Infrastructure Division builds and operates the foundational systems that power IRENs GPUenabled, multitenant compute platform. Senior Service Reliability Engineer This role owns the design, scalability, and operational excellence of the observability platform. You will transform highvolume metrics, logs, events, and traces into actionable intelligence that improves reliability, performance, and operational efficiency. You will be on the bleeding edge of the AI Revolution, building monitoring systems for thousands of GPUs and integrating reliability principles into the heart of our operations. Job Requirements Technical Skills 7+ years in Site Reliability Engineering, DevOps, Infrastructure Engineering, or similar roles. 3+ years owning observability platforms at meaningful scale. Deep understanding of distributed systems and production operations. Strong handson experience with Prometheus and Grafana in largescale environments. Experience with tracing and logging ecosystems including OpenTelemetry, Jaeger, Tempo, Loki, or Elasticsearch. Strong Linux systems engineering background including performance analysis and troubleshooting. Experience operating Kubernetes in production environments. Strong networking fundamentals including TCP/IP, DNS, and servicetoservice communication patterns. Proficiency in Go, Python, or similar modern programming languages. Experience building automation and internal reliability tooling. Experience managing highvolume telemetry ingestion and timeseries storage systems. Soft Skills & Competencies Strong analytical and troubleshooting capabilities across complex distributed systems. Ownership mindset with a strong sense of responsibility toward production systems. Effective communicator able to collaborate across engineering, infrastructure, and leadership teams. Pragmatic problem solver focused on reliability, scalability, and operational excellence. Nice-to-Have Experience operating GPUdense environments or highperformance compute clusters. Experience integrating GPU telemetry and hardware health signals into observability systems. Familiarity with InfiniBand, RoCE, or advanced data center networking fabrics. Experience integrating outofband management telemetry such as Redfish or BMC event streams. Experience supporting AI training infrastructure or research compute environments. Job Responsibilities Observability Architecture & Telemetry Strategy Design and own endtoend observability architecture across metrics, logs, traces, and event streams. Define telemetry standards and enforce consistent metadata across services and infrastructure domains. Establish and operationalize service level indicators and service level objectives across critical systems. Implement error budgetdriven alerting strategies that prioritize signal over noise. Architect highly available, scalable, and costefficient telemetry ingestion and storage systems. Develop executive and engineeringlevel dashboards that surfaces reliability posture and system health trends. Incident Management & Operational Excellence Own and evolve the full incident lifecycle across detection, triage, mitigation, resolution, and recovery. Design severity models, escalation paths, and response playbooks across software and infrastructure domains. Lead complex crossfunctional incident response efforts involving distributed systems and GPU infrastructure. Conduct structured, blameless postincident reviews and drive longterm systemic improvements. Track and improve key operational metrics including mean time to detect, mean time to recover, and change failure rate. Partner with engineering teams to eliminate recurring incidents through automation and architectural improvements. Software & Infrastructure Observability Partner with engineering teams to standardize instrumentation across applications and services. Drive adoption of distributed tracing and structured logging best practices. Build investigative workflows that connect applicationlevel symptoms to infrastructure and hardware signals. Correlate GPU health events and hardware telemetry with application performance and reliability metrics. Create topologyaware views of largescale systems to accelerate incident diagnosis and root cause analysis. Observability Platform Engineering Design and operate Prometheus at scale including federation, recording rules, and alert optimization. Build and maintain Grafana dashboards, alerting strategies, and rolebased access models. Operate log aggregation and indexing platforms such as Loki or Elasticsearch. Implement distributed tracing systems using Open Telemetry and compatible backends. Manage telemetry ingestion pipelines, retention strategies, and storage tiering policies. Optimize metric cardinality, labeling standards, and costperformance tradeoffs at scale. Job Benefits Compensation The expected base salary for this role starts at CAD$135,000 - 150,000/annum. Actual compensation will be determined based on factors such as experience, qualifications, and market data for the region. Total Compensation package may be inclusive of annual incentive bonus, and equity (longterm incentive) Health & Wellness Medical, dental, and vision insurance coverage 100% company paid for employees and dependents Companypaid life and disability insurance Voluntary life and critical illness coverage available Employee Assistance Program and virtual health care platform Financial WellBeing RRSP with company match Voluntary TFSA Time Off & Flexibility 3 weeks annually for vacation and paid holidays Growth & Development Opportunities for advancement and internal mobility Training and personal development opportunities Lifestyle & Culture Company events and teambuilding activities We value diverse perspectives and believe that skills can be developed. If youre passionate about this role, we want to hear from you whether you meet every criteria or not. Your unique experiences might be exactly what we need! Podtech Data Centers Inc., the employing entity and proud member of the IREN Group is an equal opportunity employer that is committed to creating an inclusive workplace. We evaluate qualified applicants without regard to race, colour, religion, age, sex, sexual orientation, gender identity, genetic information, national origin, disability, veteran status, and other legally protected characteristics. This job will remain posted until filled. While we appreciate all applications we receive, we are only able to contact candidates under consideration. By applying for this position and submitting your resume and application materials, you consent to the processing of your personal information in accordance with our Job Applicant Privacy Statement available on our website at www.iren.com. #J-18808-Ljbffr