Machine Learning Engineer – Generative AI & Agent Evaluation Location: Remote Company: Pilotcrew AI Type: Full-Time Experience: 2–6 Years About Pilotcrew AI Pilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing. Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems. Role Overview We are hiring a Machine Learning Engineer with strong Generative AI expertise to design and build scalable evaluation infrastructure for LLMs and AI agents. You will architect distributed inference pipelines, structured trace logging systems, tool-call validation frameworks, and automated grading engines. The role involves benchmarking proprietary and open-weight LLMs, implementing and robustness metrics, building adversarial stress-testing pipelines, and analyzing agent failure modes under real-world conditions. This is a systems-heavy, production-focused GenAI role requiring strong ML fundamentals and engineering rigor. Key Responsibilities - Design and implement distributed LLM inference pipelines - Build automated benchmarking systems for reasoning, planning, and tool use - Implement , reliability metrics, variance analysis, and statistical confidence evaluation - Develop adversarial testing frameworks for stress-testing agents - Create structured evaluation pipelines (rule-based and model-based graders) - Build trace capture, logging, and telemetry systems for multi-step agent workflows - Validate tool calls and sandboxed execution environments - Optimize inference for latency, cost, and throughput - Manage dataset versioning and reproducible benchmark pipelines - Deploy and monitor GenAI systems in production (AWS/GCP/Azure) Required Skills - Strong Python programming and system design skills - Hands-on experience with Generative AI systems and LLM APIs - Experience with PyTorch or TensorFlow - Experience building production ML or GenAI systems - Strong understanding of decoding strategies, temperature effects, and sampling variance - Familiarity with async processing, distributed task execution, or job scheduling - Experience with Docker and cloud deployment - Strong debugging, observability, and reliability engineering mindset Preferred Skills - Experience with AI agent architectures (ReAct, tool-calling, planner-executor loops) - Experience with reward modeling or evaluation science - Knowledge of RLHF or alignment pipelines - Familiarity with vector databases (FAISS, Pinecone, Weaviate) - Experience with distributed systems (Ray, Celery, Kubernetes) - Experience building internal benchmarking platforms What We Value - Ownership and bias toward execution - Systems thinking and failure-mode analysis - Comfort working with non-deterministic model behavior - Ability to design measurable, reproducible evaluation pipelines - Clear technical communication Why Join Pilotcrew AI - Work on cutting-edge AI agent evaluation infrastructure - Solve real-world GenAI reliability challenges - High technical ownership and autonomy - Opportunity to shape how AI agents are benchmarked at scale
Job Title
Machine Learning Engineer