Skip to Main Content

Job Title


AI Systems Engineer – AI Model (Training & Inference)


Company : AMD


Location : Markham,


Created : 2026-04-27


Job Type : Full Time


Job Description

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate nextgeneration computing experiencesfrom AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, youll discover the real differentiator is our culture. We push the limits of innovation to solve the worlds most important challengesstriving for execution excellence, while being direct, humble, collaborative and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. THE ROLE / PERSON The AMD AI Group is looking for a Senior Software Development Engineer to own the endtoend model execution stack on AMD Instinct GPUsspanning training infrastructure at scale and highperformance inference serving. This role demands someone who has shipped LLMs on real hardware, written GPU kernels that moved production metrics, and built the systems infrastructure (orchestration, storage, monitoring) that keeps thousands of GPUs productive. You will be instrumental in ensuring AMD GPUs are firstclass citizens for frontier model training and inference across current and nextgeneration Instinct accelerators. TRAINING INFRASTRUCTURE & ENABLEMENT Enable and optimize largescale model training (LLMs, VLMs, MoE architectures) on AMD Instinct GPU clusters, ensuring correctness, reproducibility, and competitive throughput. Build and maintain training infrastructure: job orchestration, distributed checkpointing, data loading pipelines, and storage optimization for multithousand GPU clusters on Kubernetes. Debug and resolve trainingspecific issues including gradient norm explosions, nondeterministic behavior across GPU generations, and computecommunication overlap in distributed training (FSDP, DeepSpeed, MegatronLM). Optimize RCCL collective communication patterns for training workloads, including allreduce, allgather, and reducescatter across multinode topologies. Develop monitoring, alerting, and compliance infrastructure to ensure training cluster health, data security, and SLA adherence at scale. Design and build endtoend validation and testing infrastructure using proxy workloads, synthetic benchmarks, and configurable workload generators to systematically validate platform readiness across AMD Instinct GPU generations. INFERENCE OPTIMIZATION & SERVING Write and optimize highperformance GPU kernels (GEMM, attention, quantized matmul, GPTQ/AWQ) in HIP, Triton, and MLIR targeting AMD Instinct architectures, with demonstrated ability to outperform opensource baselines. Drive endtoend inference enablement on new AMD GPU siliconbe among the first to get frontier models running on each new Instinct generation, creating reproducible guides and reference implementations. Optimize inference serving frameworks (vLLM, SGLang, TorchServe) for AMD GPUs: batching strategies, KVcache management, speculative decoding, and continuous batching for production throughput/latency targets. Develop novel approaches to inference acceleration, including bioinspired algorithms, SLMassisted batching, and custom scheduling strategies that exploit AMD hardware characteristics. Build quantization pipelines (FP8, FP6, FP4, GPTQ, AWQ) for production model deployment, ensuring qualityperformance tradeoffs are wellcharacterized across AMD GPU generations. CROSSCUTTING Collaborate with AMD silicon architecture and presilicon teams to provide software feedback and validate software stack integration on nextgeneration Instinct GPU designs for both training and inference workloads. Build observability and automated analysis tooling: log analysis pipelines, anomaly detection, performance baselining, regression detection, and diagnostic workflows for largescale GPU clusters. Contribute to the open ROCm ecosystem and AMD''s developer experienceSDKs, CI dashboards, documentation, and developer cloud enablement. REQUIRED EXPERIENCE Industry experience shipping production AI/ML infrastructure, with handson work spanning both training and inference. PREFERRED EXPERIENCE Direct experience enabling frontier models (GPT4 class) on AMD Instinct hardware endtoend. Background in building anomaly detection, log analysis, or observability systems for largescale distributed GPU infrastructure. Familiarity with AMD Instinct MIseries architectures (MI300X, MI350X, MI355X) and RCCL communication library. Contributions to opensource AI frameworks (PyTorch, vLLM, SGLang, DeepSpeed, MegatronLM). Experience designing validation frameworks, proxy benchmarks, or synthetic workload suites for GPU infrastructure at scale. Experience with presilicon software validation or hardwaresoftware coverification workflows. Publications or patents in HPC, ML systems, or GPU kernel optimization. PREFERRED ACADEMIC CREDENTIALS Bachelors or Masters degree or Ph.D. in Computer/Software Engineering, Computer Science, or related technical discipline. This role is not eligible for visa sponsorship. Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies or feebased recruitment services. AMD and its subsidiaries are equalopportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or thirdparty affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants needs under the respective laws throughout all stages of the recruitment and selection process. #J-18808-Ljbffr