AI Evaluation Engineer

Join Rational Exponent to evaluate, stress-test, and improve trustworthy AI systems for enterprise clients.

AI Evaluation Engineer

Rational Exponent is seeking a highly skilled engineer to join our AI Evaluations team as an AI Evaluation Engineer. This unique role sits at the intersection of infrastructure engineering, AI/LLM integration, and product development. You’ll be responsible for building and maintaining the evaluation infrastructure that ensures our AI cognitive functions meet quality standards before reaching production, while simultaneously contributing to core product development and bridging the evaluation and development teams.

Tech Stack

Python, LangChain, LlamaIndex, Langfuse, Svelte/SvelteKit/TypeScript, MongoDB, Qdrant, MinIO, RabbitMQ, FastAPI, Kubernetes, Terraform, AWS (EKS, Lambda, S3, Bedrock, etc), Azure Cognitive Services, REST, GraphQL, OpenAI and HuggingFace APIs.

Key Responsibilities

Design, build, and maintain evaluation automation frameworks for testing cognitive functions
Create standardized testing harnesses that allow isolated testing of AI components
Develop APIs and interfaces for the evaluation team to run experiments systematically
Build experiment tracking and logging systems to capture all evaluation runs with full configuration details
Create real-time dashboards for visualizing evaluation results, historical trends, and A/B comparisons
Implement alerting systems for regressions or evaluation failures
Ensure evaluation infrastructure is highly reliable with minimal downtime
Configure automated evaluation runs on every relevant code change
Monitor and optimize evaluation run times and resource utilization
Bridge between the evaluation team and development team
Support developers with AI/LLM integration questions and best practices
Ensure evaluation results are easily consumable and actionable for developers
Help developers interpret and act on evaluation feedback
Participate in weekly evaluation standups to coordinate priorities
Provide technical consultation on prompt engineering and LLM optimization

Qualifications

You will have:

Experience building and implementing LLM evaluation frameworks and experiment tracking platforms (Langfuse, MLflow, Weights & Biases, or similar)
Experience with experiment design, including systematic parameter sweeping and A/B testing
A track record of improving LLM system performance

Additionally:

Proven track record building reliable automation and testing infrastructure
Strong Python development skills with experience building production systems
Hands-on experience working with large language models (OpenAI, Anthropic Claude, or similar)
Experience with LLM evaluation methodologies, metrics, and benchmarking frameworks
Proficiency building APIs and testing harnesses to integrate with cognitive functions (FastAPI, Flask)
Understanding of prompt engineering principles and optimization techniques
Familiarity with LLM frameworks and agentic systems (LangChain, LlamaIndex, or similar)
Knowledge of data pipeline orchestration and batch processing (Airflow, Prefect, or similar)
Strong scripting and automation skills for evaluation workflows
Experience with CI/CD tools and practices (GitHub Actions, Jenkins, CircleCI)
Understanding of token economics, rate limiting, and cost optimization for LLM systems
Proficiency with version control systems and Git workflows
Experience with testing frameworks and test automation
Working knowledge of databases for storing evaluation results and experiment metadata
Strong problem-solving and analytical skills
Excellent communication and collaboration abilities
Ability to bridge technical and non-technical stakeholders
Self-motivated with ability to work independently and prioritize effectively
Passion for developer experience and tooling
4+ years of software engineering experience
2+ years of experience working with machine learning systems or AI/LLM integrations
Bachelor’s degree in Computer Science, Software Engineering, or related field

ABOUT RATIONAL EXPONENT

Rational Exponent is a new company with an experienced team with a track record of successfully building, scaling, and exiting enterprise software and services companies with a particular focus on finance, banking, capital markets, healthcare, and other highly regulated industries. Our mission is to provide the tools and controls that will allow complex, regulated entities to confidently deploy applications based on state-of-the-art foundation model AI. We develop systems and frameworks to allow our customers to create trustworthy cognitive applications that control risks, evidence compliance and deliver business value.

COMPREHENSIVE BENEFITS THAT SUPPORT YOU AND YOUR FAMILY

We provide a competitive benefits package designed to support your health, financial future, and work-life balance including multiple national medical, dental (with a $0 employee option), and vision plans with company premium contributions, tax-advantaged accounts such as an HSA, Healthcare FSA, and Dependent Care FSA, and life and disability coverage, all within a flexible, remote-first work environment.

Job Details

Job Type: Full-time
Salary: competitive + equity
Vacancy: 1 Person
Years of Experience: 5+
Education: Bachelor's Degree
Deadline: 2026-12-31.