Job Description

Job Title: Deployment Engineer – GenAI Infrastructure Location: Bengaluru / Mumbai / Hybrid Department: Engineering Reports to: CTO (Priyesh Srivastava) Apply here: About OnFinance AI OnFinance AI is building agentic AI infrastructure to automate compliance, surveillance, and governance for the BFSI sector. Our stack includes proprietary LLMs (NeoGPT), investigative agents, audit vetting tools, voice compliance workflows, and real-time regulatory engines used by premier Indian BFSI clients. Role Overview We're seeking a Deployment Engineer who deeply understands GenAI model infrastructure and can manage cloud-native deployments across LLMs, embeddings, and audio models. The ideal candidate will help us operationalize and scale inference pipelines, GPU workloads, and model-serving endpoints with reliability, observability, and cost-awareness. Key Responsibilities Design and manage containerized deployments (Docker + K8s) for 4-model architecture: Embedding Model Core LLM (NeoGPT / OpenAI / Claude) Audio Transcription + Diarization (2x models) Own GPU provisioning, RAM/VRAM estimation, and autoscaling configurations for model servers. Optimize rate limits, concurrency, and model ID usage (especially across OpenAI, Groq, or Claude APIs). Calculate and manage infra requirements (CPU, GPU, memory) based on token throughput and user concurrency. Maintain compatibility with NVIDIA driver + CUDA ecosystem and relevant AI runtime dependencies. Interface with DevOps and cloud vendors to ensure cost-effective deployments on AWS/GCP/Azure/Oracle. Monitor server health, model response times, and latency with dashboards (e.g., Prometheus, Grafana). Support continuous delivery of models with whitelisted URLs and secure artifact updates for BFSI clients. Coordinate with engineering and client success teams to ensure high uptime in client-facing deployments. Must-Have Skills Strong understanding of LLM infrastructure , including rate limits, model IDs, and tokenized cost breakdowns. Experience with GPU workloads , CUDA stack, NVIDIA driver dependencies, and memory bottlenecks. Proficiency in Kubernetes, Docker, and CI/CD pipelines for model deployment. Hands-on with OpenAI, Azure OpenAI, or similar model APIs – including rate strategy and retries. Strong infra planning skills – ability to calculate GPU/RAM needs based on batch sizes, seq lengths, and model types. Should be open to travelling to clients offices in Mumbai. Good to Have Prior experience with audio model inference (e.g., Whisper, diarization pipelines). Familiarity with LLM evaluation tools , quantization, or low-latency inference (vLLM, TGI). Exposure to BFSI or compliance domain is a strong plus. Why Join Us? You’ll be part of the core engineering team scaling real-world AI agents used by India’s largest financial institutions. This is an opportunity to build robust infra systems at the intersection of compliance, cloud, and cutting-edge AI.

Job Title

Company : OnFinance AI

Location : Mumbai, Maharashtra

Created : 2025-06-25

Job Type : Full Time