Site Reliability EngineerLocation: Remote, CanadaOur client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford downtime. As they scale internationally, they are building a dedicated Site Reliability team to strengthen observability, automation, and uptime across a fleet of remote devices.In this high-visibility role you will be the guardian of system reliability, owning incident response and long-term reliability engineering for mission-critical edge deployments. Your work will directly enable factories, energy sites, and transportation hubs to run with confidence around the clock.Key ResponsibilitiesAct as first responder during the 24x7 on-call rotation, triaging and resolving production incidents across Linux-based edge devices and cloud services.Lead root-cause analysis and deliver durable fixes that eliminate classes of failures.Build and tune dashboards, alerts, and health checks using Prometheus, Grafana, and log aggregation tools for real-time fleet visibility.Automate operational tasks with Python or Bash to reduce toil and improve response times.Evolve CI/CD pipelines, configuration management, and infrastructure-as-code to support reliable, repeatable deployments.Run load tests, network validation, and hardware burn-in to surface issues pre-production.Create concise SOPs, runbooks, and post-incident reports that raise the bar for operational excellence.Partner with software, hardware, and customer-success teams to embed reliability best practices early in the development lifecycle.What Youll Need to SucceedStrong hands-on Linux administration experience (Ubuntu or embedded distributions) and comfort working with ARM-based systems.Proficiency in a scripting language such as Python or Bash for automation and diagnostics.Solid networking fundamentals (TCP/IP, routing, DNS, VPNs, VLANs, firewalls) and familiarity with tools like tcpdump or nmap.Experience operating modern observability stacks (Prometheus, Grafana, ELK/EFK, or Loki) and container technologies such as Docker.Proven ability to troubleshoot distributed systems under pressure and communicate findings clearly to technical and non-technical stakeholders.Willingness to share on-call responsibilities that span evenings, weekends, and holidays on a rotational basis.Nice-to-Have ExtrasExposure to GPU-accelerated, computer-vision, or machine-learning workloads.Familiarity with embedded edge hardware platforms and industrial automation protocols.Prior SRE, DevOps, or Systems Engineering experience supporting always-on, customer-facing solutions.Experience writing customer-facing operational documentation or SOPs.Work Environment & Schedule100 percent remote within Canada.Core coverage needs are 9:00 a.m. 9:00 p.m. Eastern Time; on-call rotation is shared globally for true 24x7 support.Standard 40-hour workweek with flexibility to swap shifts inside the team.Compensation & BenefitsCompetitive salary plus bonus eligibility.Choice of full-time employment or contract engagement, with comprehensive health benefits available through our employer-of-record partner.Expense coverage for approved home-office and professional-development costs.Opportunity to work with cutting-edge AI and edge-computing technology in a high-impact role.Why JoinYou will be the reliability champion for a product that makes real-world industrial sites safer and smarter every day. If you love digging into complex systems, writing clean automation, and seeing your work translate into measurable uptime for customers, we would love to meet you.About Blue Signal: Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS
Job Title
Site Reliability Engineer