Job Description

Grafana Labs is a remotefirst, opensource powerhouse. There are more than 20M users of Grafana, the open source visualization tool, around the globe, monitoring everything from beehives to climate change in the Alps. The instantly recognizable dashboards have been spotted everywhere from a NASA launch and Minecraft HQ to Wimbledon and the Tour de France. Grafana Labs also helps more than 3,000 companies including Bloomberg, JPMorgan Chase and eBay manage their observability strategies with the Grafana LGTM Stack, which can be run fully managed with Grafana Cloud or selfmanaged with the Grafana Enterprise Stack, both featuring scalable metrics (GrafanaMimir), logs (GrafanaLoki) and traces (GrafanaTempo). Were scaling fast and staying true to what makes us different: an opensource legacy, a global collaborative culture, and a passion for meaningful work. Our team thrives in an innovationdriven environment where transparency, autonomy and trust fuel everything we do. You may not meet every requirement, and thats okay. If this role excites you, wed love you to raise your hand for what could be a truly careerdefining opportunity. This is a remote opportunity and we would be interested in applicants from Canadian time zones only at this time. Staff Software Engineer Grafana Databases, Managed Services The Opportunity The Managed Services team is a newly formed squad within the Databases department. It owns and operates shared, productioncritical infrastructure that powers Grafana Clouds nextgeneration database products (Mimir, Loki and Tempo). Today, this includes operating 100+ WarpStream clusters across multiple cloud providers and regions, with continued growth anticipated for the future. WarpStream acts as the streaming backbone for ingestion and read/write decoupling across databases. It sits directly on the hot path for metrics, logs and traces, handling highthroughput, multiconsumer workloads at massive scale. In addition to streaming infrastructure, the team works closely with highvolume analytical and storage systems that power queryheavy and aggregationheavy workloads, where latency, compression behaviour, storage layout and scaling characteristics matter deeply. What Youll Be Doing Operate and evolve 100+ multicloud streaming clusters and related database infrastructure Diagnose and eliminate crosslayer failure modes (e.g., object storage latency, noisy neighbors, controlplane bottlenecks, query performance regressions, etc.) Design safe upgrade and rollout strategies at scale Improve observability, automation and operational ergonomics Partner closely with database and platform teams to ensure safe scaling, partitioning, consumer fanout and query performance Work directly with distributed systems behaviour, Kubernetes scheduling dynamics, storage engines, compression tradeoffs, etc. Serve as a primary escalation point and oncall for relevant incidents Own the relationship with all system vendors, including WarpStream Labs and others. At the Staff Level, Your Scope Includes Advanced Systems Ownership and Informal Technical Leadership Help define and evolve the technical direction for operating WarpStream and adjacent shared database systems at scale Lead complex initiatives such as migrations, rollout improvements and reliability investments Establish best practices around SLOs, scaling limits, failure isolation and change safety Investigate and drive resolution of multilayer incidents spanning storage, compute, networking and controlplane dependencies Identify systemic risks across 100+ clusters and contribute architectural improvements that reduce recurring issues Improve systems toil and operational ergonomics with automation Partner with database and platform teams to align on strategy and longterm scalability Mentor and support engineers as the team matures As we are remotefirst and our engineering organisation is largely remote, we provide guidance and meet regularly using video calls, so an independent attitude and good communication skills are a must. Of course, there is an oncall component to this role and one that we take seriously. As a company, we hire globally (remotefirst) to ensure our oncall remains healthy and aligned to approximately 12 daylight hours per day. You will work closely with counterparts in other regions to provide balanced coverage and shared ownership. This role blends deep distributed systems work with the opportunity to influence how the team approaches reliability, scaling and operational excellence. We invest heavily in developer productivity. You can use modern AI coding assistants as part of your daily workflow (your choice of tools, within security guidelines), backed by a companyfunded usage budget so you can iterate quickly without unnecessary friction. We encourage pragmatic AIassisted development: faster prototyping, test generation, refactors, documentation and incident followupsalways paired with strong code review and quality standards. Youll also have access to frontier models (e.g., GPTCodex5/3, Claude Opus4.6, Gemini3Pro). What Makes You a Great Fit Regular 1:1s with your manager and close collaboration with teammates across regions, helping shape how the team operates and matures Defining and evolving SLO strategy for shared database infrastructure, identifying systemic reliability gaps and driving longterm error budget improvements Setting standards for diagnosability across core streaming and database systems in production Leading complex initiatives across highthroughput, multicloud infrastructure Designing and promoting faulttolerant architectural patterns that address distributed system realities such as storage latency, partition imbalance, noisy neighbors and controlplane dependencies Defining rollout, migration and upgrade safety practices used across dozens of production clusters Partnering with database and platform engineering leaders to influence architecture decisions, roadmap prioritisation and longterm scalability strategy Leading design discussions and reviewing PRs with a focus on reducing operational risk and increasing system resilience Raising the bar for practices across teams by mentoring engineers and sharing distributed systems knowledge Playing a key role in highimpact incident response, guiding investigation, driving root cause analysis and ensuring durable remediation through strong postincident reviews Requirements 8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering or distributed systems roles Experience with highthroughput streaming systems, analytical or storage backends, or largescale database infrastructure (e.g., Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake or Cassandra) Strong Kubernetes experience in AWS, GCP or Azure, and familiarity with infrastructureascode tooling (Helm, Terraform, Jsonnet, etc.) Experience leading or driving complex technical efforts, even without formal management responsibilities Ability to influence technical direction and align teams around reliability improvements Strong understanding of distributed systems failure modes in multicloud environments Proficiency in at least one systemsoriented language (Go preferred, but not required) Working knowledge of Linux internals, networking, cloud storage and performance/scaling behaviour Experience participating in blameless incident response and writing highquality postincident reviews Clear communicator who can collaborate across teams and work autonomously Intellectually curious, transparent, actionoriented and kind (this is important!) Compensation & Rewards In Canada, the base compensation range for this role is CAD186,368CAD223,642. Actual compensation may vary based on level, experience and skillset as assessed in the interview process. Benefits include equity, bonus (if applicable) and other benefits listed here. All of our roles include Restricted Stock Units (RSUs), giving every team member ownership in Grafana Labs success. We believe in shared outcomesRSUs help us stay aligned and invested as we scale globally. Compensation ranges are country specific. If you are applying for this role from a different location than listed above, your recruiter will discuss your specific markets defined pay range & benefits at the beginning of the process. Why Youll Thrive At Grafana Labs 100% Remote, Global Culture As a remoteonly company, we bring together talent from around the world, united by a culture of collaboration and shared purpose. Scaling Organization Tackle meaningful work in a highgrowth, everevolving environment. Transparent Communication Expect open decisionmaking and regular companywide updates. InnovationDriven Autonomy and support to ship great work and try new things. Open Source Roots Built on communitydriven values that shape how we work. Empowered Teams High trust, low ego culture that values outcomes over optics. Career Growth Pathways Defined opportunities to grow and develop your career. Approachable Leadership Transparent execs who are involved, visible and human. Passionate People Join a team of smart, supportive folks who care deeply about what they do. InPerson onboarding We want you to thrive from day1 with your fellow new Grafanistas to learn all about what we do and how we do it. Balance is Key We operate a global annual leave policy of 30 days per annum. Three days of your annual leave entitlement are reserved for Grafana Shutdown Days to allow the team to really disconnect. *We will comply with local legislation where applicable. Equal Opportunity Employer We will recruit, train, compensate and promote regardless of race, religion, colour, national origin, gender, disability, age, veteran status and all the other fascinating characteristics that make us different and unique. We believe that equality and diversity builds a strong organisation and were working hard to make sure thats the foundation of our organisation as we grow. Grafana Labs may utilize AI tools in its recruitment process to assist in matching information provided in CVs to job postings. The recruitment team will continue to review inbound CVs manually to identify alignment with current openings. For information about how your personal data is used once youve applied to a job, check out our privacy policy. #J-18808-Ljbffr

Job Title

Company : Grafana Labs

Location : Toronto, Ontario

Created : 2026-03-07

Job Type : Full Time