Job Description

Job Title: Senior Engineer-HPCDepartment: Production & SupportLocation: FaridabadPosition Summary:Accomplished HPC Systems Engineer with 8–10 years of enterprise Linux administration and over 5 years of hands-on experience managing large-scale HPC clusters exceeding 500 cores and multi-petabyte storage environments. Proven expertise in designing, implementing, and optimizing HPC infrastructure, including compute, storage, and high-speed networking, to deliver maximum performance for demanding workloads.Key Responsibilities:HPC Cluster Management & Optimization- Design, implement, and maintain HPC environments, including compute, storage, and network components. - Configure and optimize Slurm, PBS Pro, or other workload managers/schedulers for efficient job scheduling and resource allocation. - Implement performance tuning for CPU, GPU, memory, I/O, and network subsystems to meet workload demands. - Manage HPC filesystem solutions such as Lustre, BeeGFS, or GPFS/Spectrum Scale.Linux Administration- Administer enterprise-grade Linux distributions (RHEL, CentOS, Rocky, Ubuntu) in large-scale compute environments. - Manage kernel upgrades, patching, and security hardening. - Troubleshoot kernel-level and system-level issues for performance and stability.Automation & Configuration Management- Develop and maintain Ansible playbooks/roles for automated provisioning, configuration, and patching of HPC systems. - Integrate Ansible with CI/CD pipelines for infrastructure as code (IaC) practices. - Automate cluster deployment and environment consistency across hundreds of nodes.Monitoring, Troubleshooting & Support- Implement and maintain monitoring tools (e.g., Grafana, Prometheus, Nagios, Ganglia). - Troubleshoot complex HPC workloads, MPI communication issues, and application performance bottlenecks. - Provide Tier-3 escalation support for Linux/HPC-related incidents.Collaboration & Documentation- Work closely with research teams, DevOps engineers, and system architects to deliver high-performance solutions. - Document architecture, SOPs, troubleshooting guides, and performance tuning methodologies.RequirementsRequired Skills & Experience- 8–10 years of hands-on Linux system administration experience in production environments. - 5+ years managing HPC clusters at scale (500+ cores / multiple petabytes of storage). - Strong Ansible automation skills (complex playbooks, roles, variables, templates). - Deep understanding of MPI, OpenMP, and GPU/accelerator integration in HPC workloads. - Proficient with HPC job schedulers (Slurm, PBS Pro, LSF). - Experience with HPC storage (Lustre, BeeGFS, GPFS). - Strong knowledge of TCP/IP networking, Infiniband, and RDMA technologies. - Experience with performance tuning and benchmarking tools (perf, hpc tool kit, Intel VTune, Iperf, fio). - Scripting proficiency in Bash, Python, or Perl for automation and tooling.Preferred Qualifications- Experience with containerized HPC (Singularity, Apptainer, or Podman). - Familiarity with cloud-HPC integration (AWS Parallel Cluster, Azure Cycle Cloud, GCP HPC). - Knowledge of security compliance standards (CIS benchmarks, STIG). - Contribution to HPC community tools or open-source projects.Soft Skills- Strong problem-solving and analytical thinking. - Ability to mentor junior engineers and collaborate across teams. - Excellent communication skills for technical and non-technical stakeholders.

Job Title

Company : Netweb Technologies India Ltd.

Location : Faridabad, Haryana

Created : 2025-12-18

Job Type : Full Time