Job Description

We're hiringNCP-Certified Engineers ! Join us as aNetwork (AIN) ,Deployment (AII) , orOperations (AIO)Engineer and help power next-gen AI infrastructure with NVIDIA H100 racks.Apply now to be part of cutting-edge AI deployments and scalable data center innovation!1. Network Design & Installation Engineer (NCP-AIN Certified) Location: India REMOTE Duration: Long Term ContractOverview: We are seeking a certifiedNetwork Design & Installation Engineerwith deep expertise in InfiniBand and Ethernet-based networking solutions. This role is pivotal in architecting and deploying robust, high-performance network fabrics for NVIDIA H100 GPU-powered AI racks. Key Responsibilities: Design and implement scalable InfiniBand/Ethernet networks to support large-scale H100 GPU clusters. Configure Spectrum-X switches, BlueField DPUs, and Cumulus Linux-based environments. Integrate networking architecture with existing data center infrastructure. Perform on-site installations, including racking, cable management, and connectivity validation. Utilize tools such as UFM and IBDiagnet to run diagnostics and optimize network performance. Collaborate with infrastructure and operations teams to ensure seamless deployment and expansion. Qualifications: NCP-AIN certification (required)or strong equivalent hands-on experience. In-depth knowledge of InfiniBand, RoCE v2, Spectrum switches, BlueField DPUs, and Cumulus Linux. Proven experience in designing and deploying high-performance or HPC network environments. Willingness to travel for on-site deployments and hands-on hardware installation. Experience with telemetry, diagnostics, and fabric tuning tools.2. AI Infrastructure Deployment Engineer (NCP-AII Certified) Location: India REMOTE Duration: Long Term ContractOverview: We are hiring an experiencedAI Infrastructure Deployment Engineerto lead the deployment of full-stack AI infrastructure powered by NVIDIA H100 GPUs. This role focuses on validating and configuring the entire stack — from bare-metal systems to orchestration platforms — ensuring production-ready AI environments. Key Responsibilities: Lead end-to-end deployment of AI racks, including servers, GPUs, switches, and interconnects. Validate bare-metal hardware, Spectrum-X switches, routers, and storage systems. Configure multi-tenant GPU environments using MIG, MPS, and virtualization tools. Deploy NVIDIA Base Command, DGX OS, and associated AI/ML software stacks. Integrate systems with Kubernetes, Helm, and other orchestration platforms. Implement monitoring and telemetry using DCGM, UFM, and performance benchmarking tools. Qualifications: NCP-AII certification (required)or equivalent hands-on infrastructure experience. Expertise in GPU server configurations, MIG/MPS, Base Command, and virtualization (K8s, vSphere). Experience with BIOS/firmware updates, system burn-in, and power/cooling validation. Strong understanding of data center infrastructure and AI workload requirements. Experience integrating AI infrastructure with cloud-native tools and container environments.3. AI Infrastructure Operations Engineer (NCP-AIO Certified) Location: India REMOTE Duration: Long Term ContractOverview: We are looking for a proactive and skilledAI Infrastructure Operations Engineerto manage and optimize large-scale AI clusters built with NVIDIA H100 GPUs. This role focuses on post-deployment operations — ensuring performance, reliability, and maintainability of AI infrastructure environments. Key Responsibilities: Manage day-to-day operations of GPU clusters, networking fabric, and server infrastructure. Monitor and maintain the health of InfiniBand/Ethernet networks and DGX/H100 nodes. Apply firmware upgrades, OS patches, and handle infrastructure lifecycle management. Troubleshoot hardware, network, and container-level failures using telemetry tools like UFM and DCGM. Create and maintain operational runbooks, automate workflows, and improve incident response. Support infrastructure scaling, upgrades, and collaborate with deployment teams. Qualifications: NCP-AIO certification (required)or comparable operational experience in large-scale AI environments. Strong troubleshooting skills across compute, network, and storage domains. Experience with monitoring and telemetry tools (Prometheus, Grafana, DCGM, UFM). Familiarity with log aggregation and alerting systems. Background in data center operations, capacity planning, and support automation.How These Roles Collaborate NCP-AIN (Design & Install):Builds and installs the high-speed network fabric that powers AI workloads. NCP-AII (Deploy):Deploys and validates the full AI infrastructure stack, including hardware and software integration. NCP-AIO (Operate):Ensures continuous, reliable, and optimized operations of deployed AI environments.

Job Title

Company : Scubyt

Location : Belgaum, Karnataka

Created : 2025-08-01

Job Type : Full Time