**About Caliper Lab**Caliper Lab is an independent AI evaluation institution building the standard for AI capability measurement in financial services and professional services. We benchmark AI products against frontier models, develop evaluation frameworks in collaboration with a consortium of AI vendors and domain experts, and publish structured intelligence that PE funds, consulting firms, and enterprise buyers use to make decisions. We are early stage, moving fast, and building something that does not yet exist.---**The Role**We are looking for a data scientist who is technically sharp, intellectually curious, and comfortable working in ambiguity. This is a foundational role — you will help build the evaluation infrastructure, design benchmark pipelines, and contribute directly to the research outputs the Lab publishes. You will work closely with the founding team across both the technical build and the research agenda.What you will do:- Design and run LLM evaluation pipelines on financial services and professional services AI tasks- Build and maintain benchmark datasets and ground truth frameworks- Develop the processing layer that translates raw model outputs into structured capability intelligence- Contribute to the design of the Lab's evaluation standard — the taxonomy, task construction rules, and scoring methodology- Produce structured analysis that feeds into published capability findingsWhat we are looking for:- Hands-on experience with LLM evaluation — you have built and run eval pipelines, not just read about them- Comfortable with Python, API calls, and working with structured and unstructured data- Strong quantitative and analytical thinking — you can design a scoring rubric and defend the methodology- Self-directed — you figure things out from documentation and first principles without needing constant direction- Genuine intellectual curiosity about AI capability measurement as a problemTop-tier pedigree (IITs, NITs, BITS etc)Nice to have: familiarity with Braintrust, LangSmith, or equivalent eval tooling; exposure to financial services or professional services workflows; any published work or open-source projects in the AI eval space.---**To Apply**Send your resume and a short write-up (half a page maximum) on what you find genuinely interesting — and genuinely underdeveloped — in the AI evaluation space today. We are not looking for a polished answer. We are looking for an honest one.Applications without the write-up will not be reviewed.
Job Title
Data Scientist