Skip to Main Content

Job Title


Senior QA Engineer – AI Model Evaluation


Company : MillionLogics


Location : Nashik, Maharashtra


Created : 2026-04-29


Job Type : Full Time


Job Description

Company DescriptionAs a trusted Oracle Partner, MillionLogics is a global IT solutions leader with a strong presence in London, UK, and a development hub in Hyderabad, India. Our mission is to empower enterprises through scalable, future-ready IT solutions, including Data & AI, Cloud migrations, and enterprise application optimization. Expertly combining business acumen with advanced technical skills, our team of 50+ Oracle specialists is dedicated to delivering tailored, result-driven solutions. Led by visionary leadership, MillionLogics is committed to driving digital transformation with a user-focused approach. Learn more about our work and leadership: MillionLogics.Role OverviewWe are looking for strong, detail-oriented software practitioners to help evaluate and improve datasets for agentic coding models.This role involves working with realistic coding tasks in an agentic coding harness, reviewing model trajectories, verifying solutions, and producing high-quality annotations.Depending on the assignment, the work may include:Online evaluations: Manually interacting with blinded models on predefined tasks, then ranking and grading resulting trajectoriesOffline evaluations: Designing realistic coding tasks, calibrating them through user simulation, writing task-specific rubrics, and grading generated trajectoriesThis is not a basic annotation role. Candidates are expected to read and debug code, validate behavior, follow detailed process rules, and make consistent judgment calls across model runs.We are specifically looking for candidates with enough engineering maturity to independently work on realistic software tasks, not just toy problems or shallow code-review exercises.Offer Details:Pay: INR 90,000-1,00,000 LPA (Net/take-home)Mode of work - Fully RemoteWhat Does a Day-to-Day look like :Execute realistic coding tasks within the assigned agentic coding harness while maintaining model blindness and session independenceFollow task instructions, milestones, planned interactions, and evaluation guardrails consistently across runsVerify model outputs by reading code, running commands, checking logs, and inspecting generated artifactsPerform targeted validation of outputs using tests, scripts, and manual checksWrite clear, specific, evidence-based rationales for trajectory rankings and assessmentsDesign multi-step, realistic coding tasks (offline work), including user intent and milestone structureCreate and refine task-specific rubrics and binary evaluation criteriaReview completed work for quality, completeness, consistency, and schema complianceIdentify and escalate broken environments, unclear instructions, or process gaps with clear supporting evidence Requirements Software Engineering fluency (Mandatory):5+ years of experience in software engineering, QA, developer tooling, data/ML engineering, or similar code-heavy rolesStrong hands-on experience in at least 1–2 programming languages or ecosystemsRepresentative languages include :Python, JavaScript/TypeScript, Rust, Java, C/C++, Bash/CLI environments, Haskell, Swift, SQL, or other production-relevant ecosystemsAbility to: Read and understand unfamiliar codebases Run and interpret tests, scripts, and CLI tools Debug issues and reason about edge cases or partial fixes Evaluate whether an implementation is functionally correct Terminal & Tooling skillls ( Mandatory) Comfortable working in Linux/Ubuntu-like environmentsProficient with:Terminal workflows Git basics Code editors or IDEs Package managers and test runners JSON, YAML, and MarkdownFamiliarity with Docker and reproducible environments (strong plus, especially for offline work)Coding-Agent Workflow Familiarity (Mandatory)Comfortable working with or quickly adapting to agentic coding environments, such as:OpenCode, Claude Code, Cursor, Similar coding-agent toolsQuality judgment & Annotation Accuracy (Mandatory):Ability to:Compare multiple model trajectories and identify meaningful differencesDistinguish correctness from style, communication quality, and agent behaviorEvaluate solutions consistently using defined rubricsFollow detailed process instructions without deviationMaintain consistency across repeated or similar evaluationsWrite concise, evidence-based rationales (not generic summaries)Work style :Highly detail-oriented and process-drivenComfortable with repetitive, high-precision evaluation workAble to maintain consistency across long tasks and multiple model runsProactively flags ambiguity instead of making assumptionsBalances realism with strict evaluation consistencyAdditional Preferred Qualifications:Strong Docker skills and experience building/debugging reproducible environmentsExperience working in large, complex repositories (not just small or greenfield projects)Demonstrated originality and sound engineering judgment in defining technical problemsAbility to design realistic, non-trivial tasks that go beyond tutorials, README flows, or simple bug fixesOffer DetailsCommitments Required: 8 hours per day with a 4-hour overlap with PST.Employment Type: Contractor position (Note: this role does not include medical/paid leave).Duration of Contract: 5 weeks; [expected start date is next week.How to Apply?Please send us your updated CV to with job ID 75232 in the email subject line.