Role Title:Senior HPC Engineer Reports To:Domain Architect - AI Compute Location:Remote, India (Must align with Client Time Zone) Employment Type:Full-TimeAbout The Role TheSenior HPC Engineeris the/"Foreman/" of the “AI Factory” . While the Domain Architect defines the architectural vision, you are responsible for the hands-on build and deployment. You act as theTechnical Squad Leadfor our offshore engineering teams, bridging the gap between the onshore architectural vision and the hands-on execution. As a System Integrator, we thrive on velocity and precision. You will not just /"maintain/" clusters; you will lead theautomated deploymentof NVIDIA SuperPOD and BasePOD infrastructure for global enterprise clients. You are the /"Lieutenant/" to the Domain Architect, translating High-Level Designs (HLDs) into executable Ansible playbooks and ensuring your squad of HPC Engineers delivers defect-free infrastructure. In this role, you are100% Delivery-Focused , split betweenTechnical Leadership (40%)andHands-on Engineering (60%) . You are the escalation point for complex kernel panics, the guardian of our Infrastructure-as-Code (IaC) repository, and the mentor who unblocks junior engineers when a Slurm job fails to schedule.CRITICAL REQUIREMENT:This role typically operates onShift Hoursto align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).Key Responsibilities Hands-on Engineering & Automation (60%) Cluster Provisioning Factory: Lead the deployment of NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) to provision bare-metal DGX/HGX nodes at scale. Develop and maintain the Ansible / Terraform library used to configure OS settings, user authentication (LDAP/AD), and storage mounts across hundreds of nodes. Execute HPL (High-Performance Linpack) and NCCL-tests to validate cluster performance, tuning BIOS and OS parameters to hit /"Gold Standard/" benchmarks. Scheduler & Workload Orchestration: Configure complex Slurm Workload Manager policies, including Fair Share, Preemption, and GPU Partitioning (MIG). Integrate Kubernetes-based orchestrators (e.g., NVIDIA Base Command, Run:AI, or Red Hat OpenShift) with the underlying HPC hardware. Deep-Dive Troubleshooting: Debug /"Silent Data Corruption/" and /"Xid Errors/" on GPUs, analysing nvidia-smi logs and kernel message buffers (dmesg). Diagnose fabric-related performance drops (e.g., Identifying a specific flapping link causing global slowdowns) in collaboration with the Network Squad. Squad Leadership & Quality Assurance (40%) Technical Direction (The /"Foreman/"): Translate the Low-Level Design (LLD) provided by the Domain Architect into granular Jira tasks for your squad of HPC Engineers. Conduct daily stand-ups to unblock engineers, clarifying requirements and making technical decisions on the fly (e.g., /"Use Ansible roles, not shell scripts for this/"). Code Quality & Governance: Act as the Primary Gatekeeper for the code repository. Perform mandatory Code Reviews on all Pull Requests (PRs) to ensure idempotency and error handling. Enforce /"Config-as-Code/" discipline, ensuring no manual changes are made to production clusters without a committed playbook. Mentorship: Guide mid-level and junior engineers on best practices for Linux Systems Administration and HPC environments. , Technical Competencies Essential Skills HPC & AI Infrastructure: Expert-level knowledge of NVIDIA Base Command Manager (BCM) or Metal-as-a-Service (MaaS) provisioning tools. Deep understanding of Slurm configuration (cgroups, plugin development, accounting). Proficiency with NVIDIA DGX/HGX hardware architecture and the associated software stack (Drivers, CUDA, DCGM). Linux & Automation (DevOps for Hardware): Mastery of Red Hat Enterprise Linux (RHEL) / Ubuntu internals (Systemd, Kernel Tuning, Hugepages). Advanced proficiency in Ansible (writing custom modules/roles) and Python (automating admin tasks). Experience with Git workflows (Branching, PRs, CI/CD). Containerisation: Hands-on experience with Docker, Singularity/Apptainer, and Kubernetes (specifically NVIDIA GPU Operator and NVIDIA Network Operator). Desirable Experience Network Awareness: Ability to troubleshoot basic InfiniBand/RoCEv2 issues (ibstat, perf query) to distinguish between a /"Node Issue/" and a /"Network Issue./" Storage Integration: Experience mounting high-performance parallel file systems (VAST/Lustre/WEKA/GPFS) and tuning client-side performance. Certifications: NVIDIA Certified Associate - AI in the Data Center. Red Hat Certified Engineer (RHCE). CKA (Certified Kubernetes Administrator). Success Metrics (KPIs) Deployment Velocity: Reduction in /"Time-to-Hello-World/" (time from power-on to running the first successful GPU job) for new clusters. Code Quality: >95% of Pull Requests pass automated linting and require fewer than 2 review cycles before merge. Stability: Zero /"Configuration Drift/" incidents in production (e.g., manual changes breaking the cluster) due to strict IaC enforcement.
Job Title
Senior HPC Engineer