Job Description

Role-Senior LLM Engineer – RLHF & Alignment Experience -5-8 years Job mode - Hybrid (Noida)Job Description: Own and drive the full RLHF pipeline: data collection, reward model training, and RL fine-tuning using PPO, DPO, GRPO, and RLAIF Design and run Supervised Fine-Tuning (SFT) pipelines on open-weight models (LLaMA, Mistral, Qwen) as the foundation for RLHF Build and train reward models that accurately capture human preferences from annotation data Design human feedback collection pipelines: labeling rubrics, annotator calibration, and preference dataset curation Implement Constitutional AI and RLAIF techniques to reduce reliance on costly human annotation Red team models post-training — probing for jailbreaks, regressions, unsafe outputs, and alignment failures Design and maintain evaluation benchmarks to measure alignment, safety, and capability before and after RL training Optimize inference pipelines and runtimes (llama.cpp, vLLM, TensorRT) to serve aligned models efficiently at scale Implement quantization strategies (INT4/INT8/FP8, LoRA, QLoRA) to deploy fine-tuned models on target hardware Write and tune low-level C/C++ and Rust code for inference performance where Python cannot reach Diagnose and resolve training instabilities, reward hacking, and production inference bugs under pressure Stay at the frontier — read alignment and RL papers weekly and translate findings into working experimentsCore Requirements and Technical Skills Hands-on experience implementing RLHF end-to-end — not just using libraries, but understanding the mechanics Deep familiarity with policy gradient methods: PPO stability, KL divergence constraints, reward shaping Experience with Direct Preference Optimization (DPO) and its variants as an RLHF alternative Understanding of reward hacking, Goodhart’s Law, and mitigation strategies in RL training Familiarity with RLAIF (RL from AI Feedback) and Constitutional AI approaches Ability to design preference datasets and annotation rubrics that produce high-quality reward signal Experience diagnosing training instabilities: reward collapse, mode collapse, KL divergence blowup Python as the primary language for all training, fine-tuning, and evaluation pipelines Strong mathematical foundation: RL theory, probability, linear algebra, optimization — deep enough to derive loss functions and debug training dynamics C and C++ for systems-level inference work, runtime contributions, and performance-critical paths Rust experience with ML tooling. Familiarity with transformer architecture, attention, tokenization, and how post-training interacts with pretraining Experience with distributed training frameworks for large-scale fine-tuning Experience with vector databases such as FAISS or Milvus Familiarity with retrieval-augmented generation (RAG) pipelines Experience integrating LLMs with external tools, APIs, and agent-based systems Exposure to Rapid Application Development (RAD) approaches for building and iterating AI solutions efficiently

Job Title

Company : Innodata India Private Limited

Location : Kanpur, Uttar Pradesh

Created : 2026-04-24

Job Type : Full Time