We are seeking a highly skilled Senior Linux Administrator to join our team, focusing on the implementation and management of on-premises Linux servers optimized for AI/ML workloads. The ideal candidate will have deep expertise in core Linux system administration, with a strong foundation in configuring and optimizing servers for high-performance computing tasks. Responsibilities include deploying and maintaining robust Linux environments, automating system processes, and ensuring security and stability for AI/ML pipelines. While training on NVIDIA technologies will be provided, the candidate must demonstrate proficiency in Linux ecosystem tools, scripting, and troubleshooting complex on-premises systems. This role demands a proactive problem-solver capable of delivering reliable, high-performance infrastructure to support cutting-edge AI/ML initiatives.Key ResponsibilitiesSupport deployment and maintenance of NVIDIA GPU-accelerated systems.Deploy and support Kubernetes clusters across various environments and distros (e.g., RKE, OpenShift, AKS, EKS, GKE).Perform day-to-day system administration across compute, storage, and networking layers.Automate infrastructure tasks using Shell scripts, Ansible, or similar tools.Collaborate with DevOps, data science, and engineering teams to ensure scalable, resilient infrastructure for AI/ML workloads.Monitor infrastructure health and performance; participate in troubleshooting and root cause analysis.Extensive experience in managing, configuring, and troubleshooting Linux-based systems (e.g., RHEL, Ubuntu, CentOS, Debian) in enterprise environments, including kernel tuning, system monitoring, and performance optimization.Hands-on experience in deploying and configuring Linux servers for AI/ML applications, including setup of GPU-accelerated environments, storage optimization for large datasets (e.g., using RAID, LVM), and ensuring system stability under intensive computational loads—note that training on NVIDIA technologies will be provided.Expertise in tuning Linux systems for performance, including CPU/GPU resource allocation, memory management, and I/O optimization, tailored to on-premises setups handling AI/ML training and inference workloads.Proven ability to diagnose and resolve intricate problems in Linux environments, such as hardware failures, network bottlenecks, or software conflicts, with a emphasis on minimizing downtime in mission-critical on-premises AI/ML systems.Qualifications• Min 7 years of experience in systems engineering or enterprise infrastructure roles.• Understanding of enterprise storage, networking, and system monitoring tools.• Scripting and automation experience (e.g., Bash, Python, Ansible).• Strong communication, documentation, and troubleshooting skills.• Comfortable working independently in a remote environment.
Job Title
Sr Systems Engineer Linux – AI Infrastructure