Job Description

Responsibilities:- Build and evolve telemetry and monitoring systems to provide deep visibility into infrastructure performance, utilization, and costs across our cloud and datacenter fleets.- Design and implement cost attribution frameworks for our multi-tenant infrastructure, enabling teams to understand and optimize their resource consumption.- Identify and resolve performance bottlenecks and capacity hotspots through deep analysis of distributed systems at scale.- Partner closely with cloud service providers and internal stakeholders to optimize cluster configurations, workload placement, and resource utilization across AI training and inference workloadsincluding large-scale clusters spanning thousands to hundreds of thousands of machines.- Develop and champion engineering practices around efficiency, driving a culture of performance awareness and cost-conscious design across Anthropic.- Collaborate with research and product teams to deeply understand their infrastructure needs, and design solutions that balance performance with cost efficiency.- Drive architectural improvements and code-level optimizations across multiple services and platforms to deliver measurable utilization and performance gains.You may be a good fit if you:- Have 6+ years of relevant industry experience, 1+ year leading large scale, complex projects or teams as an engineer or tech lead- Deep expertise in distributed systems at scale, with a strong focus on infrastructure reliability, scalability, and continuous improvement.- Strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java)- Hands-on experience with cloud infrastructure, including Kubernetes, Infrastructure as Code, and major cloud providers such as AWS or GCP.- Experience optimizing end-to-end performance of distributed systems, including workload right-sizing and resource utilization tuning.- You possess a deep curiosity for how things work under the hood and have a proven ability to work independently to solve opaque performance issues- Experience designing or working with performance and utilization monitoring tools in large-scale, distributed environments.- Strong problem-solving skills with the ability to work independently and navigate ambiguity.- Excellent communication and collaboration skillsyou will work closely with internal and external stakeholders to build consensus and drive projects forward.Strong candidates may have:- Experience with machine learning infrastructure workloads as well as associated networking technologies like NCCL.- Low level systems experience, for example linux kernel tuning and eBPF- Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems- Published work in performance optimization and scaling distributed systemsThe annual compensation range for this role is listed below.For sales roles, the range provided is the roles On Target Earnings (

Job Title

Company : MSCCN

Location : San Francisco, CA

Created : 2026-02-11

Job Type : Full Time