Job Description

[Up to c. $425k Comp Package (or equivalent) | Hybrid Working] Were hiring on behalf of a top-tier technology-driven trading firm known for its world-class infrastructure and scientific approach to real-time systems. As part of a specialist engineering team, youll help scale and optimise massive distributed GPU environments powering AI, research, and quantitative strategies. This is a rare chance to take ownership of petabyte-scale infrastructure across global data centres - shaping the future of how data-intensive workloads are run and accelerated at scale Key Responsibilities Design, deploy, and tune large-scale GPU-based compute environments used for AI and quant research workloads Benchmark, analyse, and eliminate performance bottlenecks across compute, storage, and network layers Automate system configuration, monitoring, and diagnostics across thousands of high-density nodes Partner with researchers and engineers to align infrastructure improvements with evolving model and data demands Manage end-to-end rollout of new hardware and software solutions, including hands-on testing and vendor coordination Troubleshoot complex distributed systems across the full stack: hardware, OS, drivers, and container orchestration Own critical projects that enhance performance, reliability, and observability at the fleet level What You Bring 4-8 years experience managing large-scale Linux infrastructure in high-performance, distributed, or AI-centric environments Deep technical fluency with GPU architecture, deployment, and tuning (e.g. memory management, driver compatibility, hardware diagnostics) Strong scripting and automation skills, especially in Python, with infrastructure-as-code mindset Hands-on experience resolving GPU workload issues across compute clusters and supporting technologies Familiarity with performance tooling and debugging in live production environments Practical experience with CUDA or systems-level programming in C/C++ Experience with config management frameworks like Salt, Ansible, or Puppet (Preferred) Experience with GPU communication and interconnect technologies (e.g. collective communication libraries such as NCCL, low-latency solutions like GPUDirect RDMA, or high-speed GPU interconnects including NVLink)

Job Title

Company : Techfellow Limited

Location : London,

Created : 2025-06-18

Job Type : Full Time