AI Evaluation Engineer

Join Rational Exponent to evaluate, stress-test, and improve trustworthy AI systems for enterprise clients.

AI Evaluation Engineer

Rational Exponent is seeking a highly skilled engineer to join our AI Evaluations team as an AI Evaluation Engineer. This unique role sits at the intersection of infrastructure engineering, AI/LLM integration, and product development. You’ll be responsible for building and maintaining the evaluation infrastructure that ensures our AI cognitive functions meet quality standards before reaching production, while simultaneously contributing to core product development and bridging the evaluation and development teams.

Tech Stack

Python, LangChain, LlamaIndex, Langfuse, Svelte/SvelteKit/TypeScript, MongoDB, Qdrant, MinIO, RabbitMQ, FastAPI, Kubernetes, Terraform, AWS (EKS, Lambda, S3, Bedrock, etc), Azure Cognitive Services, REST, GraphQL, OpenAI and HuggingFace APIs.

Key Responsibilities

  • Design, build, and maintain evaluation automation frameworks for testing cognitive functions
  • Create standardized testing harnesses that allow isolated testing of AI components
  • Develop APIs and interfaces for the evaluation team to run experiments systematically
  • Build experiment tracking and logging systems to capture all evaluation runs with full configuration details
  • Create real-time dashboards for visualizing evaluation results, historical trends, and A/B comparisons
  • Implement alerting systems for regressions or evaluation failures
  • Ensure evaluation infrastructure is highly reliable with minimal downtime
  • Configure automated evaluation runs on every relevant code change
  • Monitor and optimize evaluation run times and resource utilization
  • Bridge between the evaluation team and development team
  • Support developers with AI/LLM integration questions and best practices
  • Ensure evaluation results are easily consumable and actionable for developers
  • Help developers interpret and act on evaluation feedback
  • Participate in weekly evaluation standups to coordinate priorities
  • Provide technical consultation on prompt engineering and LLM optimization

Qualifications

You will have:

  • Experience building and implementing LLM evaluation frameworks and experiment tracking platforms (Langfuse, MLflow, Weights & Biases, or similar)
  • Experience with experiment design, including systematic parameter sweeping and A/B testing
  • A track record of improving LLM system performance

Additionally:

  • Proven track record building reliable automation and testing infrastructure
  • Strong Python development skills with experience building production systems
  • Hands-on experience working with large language models (OpenAI, Anthropic Claude, or similar)
  • Experience with LLM evaluation methodologies, metrics, and benchmarking frameworks
  • Proficiency building APIs and testing harnesses to integrate with cognitive functions (FastAPI, Flask)
  • Understanding of prompt engineering principles and optimization techniques
  • Familiarity with LLM frameworks and agentic systems (LangChain, LlamaIndex, or similar)
  • Knowledge of data pipeline orchestration and batch processing (Airflow, Prefect, or similar)
  • Strong scripting and automation skills for evaluation workflows
  • Experience with CI/CD tools and practices (GitHub Actions, Jenkins, CircleCI)
  • Understanding of token economics, rate limiting, and cost optimization for LLM systems
  • Proficiency with version control systems and Git workflows
  • Experience with testing frameworks and test automation
  • Working knowledge of databases for storing evaluation results and experiment metadata
  • Strong problem-solving and analytical skills
  • Excellent communication and collaboration abilities
  • Ability to bridge technical and non-technical stakeholders
  • Self-motivated with ability to work independently and prioritize effectively
  • Passion for developer experience and tooling
  • 4+ years of software engineering experience
  • 2+ years of experience working with machine learning systems or AI/LLM integrations
  • Bachelor’s degree in Computer Science, Software Engineering, or related field

ABOUT RATIONAL EXPONENT

Rational Exponent is a new company with an experienced team with a track record of successfully building, scaling, and exiting enterprise software and services companies with a particular focus on finance, banking, capital markets, healthcare, and other highly regulated industries. Our mission is to provide the tools and controls that will allow complex, regulated entities to confidently deploy applications based on state-of-the-art foundation model AI. We develop systems and frameworks to allow our customers to create trustworthy cognitive applications that control risks, evidence compliance and deliver business value.

COMPREHENSIVE BENEFITS THAT SUPPORT YOU AND YOUR FAMILY

We provide a competitive benefits package designed to support your health, financial future, and work-life balance including multiple national medical, dental (with a $0 employee option), and vision plans with company premium contributions, tax-advantaged accounts such as an HSA, Healthcare FSA, and Dependent Care FSA, and life and disability coverage, all within a flexible, remote-first work environment.

Job Details


  • Job Type: Full-time
  • Salary: competitive + equity
  • Vacancy: 1 Person
  • Years of Experience: 5+
  • Education: Bachelor's Degree
  • Deadline: 2026-12-31.