Job Description – HPC Engineer (HPC with SLURM, CPU & GPU Clusters) Position Overview We are seeking a skilled HPC Engineer to design, deploy, manage, and optimize our on premises High Performance Computing (HPC) environment, consisting of SLURM-managed CPU and GPU clusters. The ideal candidate will have a strong understanding of HPC architecture, Linux systems, job scheduling, and cluster operations. Experience with parallel file systems and enterprise storage solutions such as WekaFS or Scality is preferred but optional. Key Responsibilities 1. HPC Infrastructure & Operations • Manage day to day operations of on prem HPC clusters including CPU and GPU compute nodes. • Monitor cluster health, performance, and utilization, ensuring high availability and efficiency. • Implement and maintain best practices for HPC operations, user management, and resource administration. • Troubleshoot cluster related issues including networking, node failures, job failures, and performance bottlenecks. • Support users in job submissions, resource usage, and HPC workflows. 2. SLURM Workload Manager (Mandatory) • Configure, install, and manage SLURM workload manager across multiple clusters. • Handle queue creation, partition configuration, node allocation, fair share policies, and job prioritization. • Perform SLURM upgrades, migrations, and service maintenance with hands on expertise. • Work with SLURM APIs and integrations to support automation and custom workflows. • Optimize scheduling policies for mixed CPU/GPU workloads. 3. Linux System Administration • Manage Linux-based compute nodes, head nodes, and administration servers. • Perform OS updates, package installations, security patching, and system tuning. • Knowledge of shell scripting (Bash/Python) for automation and HPC tooling workflows. 4. Parallel Computing & Cluster Architecture • Understanding of parallel computing concepts: MPI, OpenMP, distributed execution. • Familiarity with HPC building blocks: interconnect networks (InfiniBand/100G), storage tiers, resource managers, monitoring tools. • Ability to analyze and troubleshoot performance issues in parallel workloads. 5. Storage (Optional but Preferred) A. WEKA (WekaFS) – Optional • Knowledge of parallel file systems and performance tuning. • Diagnose and resolve issues related to WekaFS with minimal downtime. • Provide guidance to internal teams on WekaFS usage and best practices. • Stay updated with Weka ecosystem advancements and propose improvements. B. Scality – Optional • Troubleshoot and maintain Scality RING and ARTESCA environments. • Monitor, tune, and optimize Scality-based storage for high availability and reliability. • Create and maintain documentation for Scality configuration and SOPs. • Recommend performance improvements based on new Scality enhancements. Qualifications & Skills Mandatory Skills • Experience managing HPC clusters with SLURM in production environments. • Good understanding of Linux (RHEL) administration. • Knowledge of parallel computing concepts and HPC architecture. • Strong troubleshooting and diagnostic skills. • Ability to work in complex, multi-node distributed environments. Preferred/Optional Skills • Experience with WekaFS, Scality RING, or other parallel/distributed file systems. • Exposure to GPU computing (CUDA, NVIDIA drivers, GPU scheduling). • Familiarity with monitoring tools (Grafana, Prometheus).
Job Title
Linux System Administrator