Skip to Main Content

Job Title


Infrastructure Engineer: GPU Fleet (HPC)


Company : Alpha Compute


Location : dartmouth, Nova Scotia


Created : 2026-05-07


Job Type : Full Time


Job Description

Infrastructure Engineer: GPU Fleet (HPC)About the CompanyAlpha Compute Corp. (NASDAQ: ALP), formerly AlphaTON Capital Corp. (NASDAQ: ATON), is a technology leader in AI GPU-as-a-service (GPUaaS) and AI Confidential Compute. Alpha Compute builds and operates businesses at the intersection of confidential compute, artificial intelligence, and digital assets. The Companys GPU assets deliver privacy-preserving computation to partners and applications including Telegram, Animoca Brands, GAMEE, and Midnight Network.About the RoleAlpha Compute is scaling the next generation of AI infrastructure. We are seeking a Lead GPU Infrastructure Engineer to architect and own the lifecycle of our high-density GPU fleet (H200, B200, and B300). You will not be inheriting legacy systems; you will be building the software-defined systems that deliver enterprise-grade availability for massive production AI training workloads.Visit https://www.alphacompute.ai/Core ResponsibilitiesFleet Architecture & Lifecycle: Own the end-to-end health of our H200, B200, and B300 nodes. You are responsible for the Day 0 to Day N lifecyclefrom firmware validation and bare-metal provisioning to decommissioning.Thermal & Power Management: Lead the operational oversight of high-density liquid-cooled environments. Monitor CDU (Coolant Distribution Unit) health and secondary loop telemetry alongside GPU thermals for extreme 120kW+ racks.Auto-Remediation & Observability: Architect a telemetry stack using Prometheus, Grafana, and NVIDIA DCGM that doesnt just alert you to issues, but actively triggers automated remediation (e.g., automated node draining, reboots, and health validation) for common hardware regressions.NetBox Integration: Own the migration of our inventory to NetBox DCIM. Build the API integrations that make NetBox the undisputed, authoritative source of truth for asset tracking, IPAM, and cabling for our compliance audits.Vendor & Operator Authority: Serve as the primary technical interface for third-party facility operators and MSPs. Set the bar for SLA/KPI compliance, lead technical post-mortems, and manage escalations for cluster-level outages.Commercial Support: Serve as the technical authority on enterprise deal cycles, supporting the Sales team with capacity planning, infrastructure deep-dives, and technical reviews for top-tier clients.On-Call Leadership: Participate in a 24/7 on-call rotation. This role carries primary accountability for fleet availability and incident response.Technical RequirementsHPC & GPU Pedigree: Extensive experience managing large-scale HPC environments or production GPU fleets at a hyperscaler, neocloud, or top-tier research facility.Hopper & Blackwell Mastery: Deep, hands-on experience with H200, B200, or B300 systems. You must intimately understand the unique power, thermal, and networking demands of Blackwell-class hardware.Fabric & Interconnects: Expert knowledge of 400G/800G InfiniBand (ConnectX-7 NDR / ConnectX-8 XDR), NVLink, and NVSwitch architectures.Engineering Mindset: Strong Linux internals and proven proficiency in building bulletproof infrastructure automation using Python or Go.Observability: Deep experience deploying and scaling DCGM-based telemetry and SNMP-based environmental monitoring.Strong PlusLiquid Cooling Experience: Direct experience with Direct-to-Chip (DLC) systems, coolant chemistry management, or immersion cooling.NVIDIA Mission Control: Familiarity with NVIDIA Mission Control for Blackwell-class cluster management.Confidential Compute: Expertise in Intel TDX or NVIDIA RIM attestation flows.Early-Stage Growth: Prior experience as an initial infrastructure hire responsible for building standards from the ground up.Type: Full-timeLocation: Remote, North America (Core working hours must overlap with EST/PST business hours)Alpha Compute Corp. is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.