1.205 LLM Evaluation & Testing Frameworks#

Explainer

LLM Evaluation: Domain Explainer#

Universal Analogies#

Quality Control for AI Outputs#

Analogy: Factory Quality Inspector vs AI Quality Inspector

Traditional software testing is like inspecting widgets on an assembly line:

Widget either fits the spec (pass) or doesn’t (fail)
Same input → same output, every time
Clear pass/fail criteria

LLM evaluation is like judging creative writing:

Many “correct” answers exist for the same prompt
Same prompt → different outputs each time
Quality is subjective and context-dependent

Example:

Prompt: "Summarize this article in 3 sentences"

Valid Summary A: "The study found X. Researchers discovered Y. This suggests Z."
Valid Summary B: "Research shows X is correlated with Y. The implications are Z."

Both are correct, but:
- Different phrasing
- Different emphasis
- Different completeness

An LLM evaluator must understand semantic equivalence (these mean the same thing despite different words) rather than just exact matching.

The Restaurant Review Problem#

Analogy: How do you know if a restaurant is good?

Option 1: Count stars (like BLEU/ROUGE scores)

Fast, cheap, scalable
But: Doesn’t explain WHY it got 3 stars
Misses context: “3 stars for fine dining” ≠ “3 stars for pizza”

Option 2: Professional food critic (like LLM-as-Judge)

Understands nuance, context, and subjective quality
Provides detailed explanations
But: Expensive, has personal biases

Option 3: Health inspection (like Programmatic checks)

Binary checks: Does food have correct temperature? Is kitchen clean?
Catches specific, predictable problems
But: Doesn’t evaluate taste, creativity, or overall quality

Best practice: Use all three. Health inspection for safety (programmatic), star rating for quick filtering (metrics), and food critic for nuanced evaluation (LLM-as-judge).

The RAG Triad: Research Paper Analogy#

Retrieval-Augmented Generation (RAG) is like writing a research paper:

Your Question → Library Search → Retrieved Books → Your Essay
   (Query)      (Vector Search)    (Context)      (Answer)

Three quality checks:

1. Context Relevance = “Did you check out the right books?”

Question: “How does photosynthesis work?”
Good retrieval: Botany textbooks, plant biology papers
Bad retrieval: Economics journals, cooking recipes

2. Faithfulness/Groundedness = “Did you cite your sources correctly?”

Good: “According to Smith (2020), photosynthesis converts light into energy”
Bad: “Photosynthesis was invented in 1872” (not in any source)
Problem: Hallucination = making up citations or facts

3. Answer Relevance = “Did you actually answer the question?”

Question: “How does photosynthesis work?”
Good: Explains the process step-by-step
Bad: “Photosynthesis is important” (true but doesn’t answer HOW)

Debugging with the Triad:

Low context relevance → Fix your search/embeddings
Low faithfulness → Model is hallucinating, prompt engineering needed
Low answer relevance → Prompt doesn’t guide model well

What Problem Does LLM Evaluation Solve?#

The Scale Problem#

Scenario: You’re building a customer support chatbot answering 10,000 questions/day.

Without evaluation:

How do you know if answers are accurate?
Manual review = 10,000 answers × 2 min/review = 333 hours/day (impossible)
Launch blind, hope for the best, fix angry customer complaints

With evaluation:

Automated metrics score every answer
Flag low-scoring answers for human review
Catch quality regressions before customers do
Measure improvement over time

Real example:

Answer A: "Your order ships in 3-5 business days"
Answer B: "I don't have access to shipping information"
Answer C: "Your package left our facility yesterday and should arrive Tuesday"

Evaluation metrics:
- Relevance: Does it answer the question? (A=90%, B=40%, C=95%)
- Faithfulness: Is it grounded in retrieved data? (A=80%, B=90%, C=95%)
- Completeness: Did it address all aspects? (A=70%, B=30%, C=90%)

Verdict: C is best (most complete, most accurate, most helpful)

The Drift Problem#

Analogy: Software bit rot, but for AI

Traditional software: Code doesn’t change → same input = same output forever

LLMs drift over time:

Model updates (GPT-4 → GPT-4.5)
Prompt modifications
Context window changes
Training data shifts

Without continuous evaluation:

You update your prompt to fix one edge case
Accidentally break 15 other cases
No one notices until production breaks

With continuous evaluation:

Test suite runs on every prompt change
Regression detected immediately: “New prompt scores 15% lower on accuracy”
Roll back or iterate before deploying

Key Concepts#

LLM-as-Judge: Using AI to Grade AI#

How it works:

Evaluator LLM receives:
- User question: "What is photosynthesis?"
- Model answer: "Photosynthesis is when plants make energy from sunlight"
- Rubric: "Score 1-5 for accuracy, completeness, clarity"

Evaluator outputs:
- Accuracy: 4/5 (correct but simplified)
- Completeness: 3/5 (missing chlorophyll, chemical equation)
- Clarity: 5/5 (very clear for a beginner)

Advantages:

Understands paraphrases: “automobile” = “car” = “vehicle”
Scales to thousands of evaluations
Can judge subjective qualities (tone, helpfulness)

Limitations:

Costs money (API calls)
Judge has biases (prefers certain writing styles)
Can be fooled: Outputs that “sound good” but are factually wrong

Example of judge bias:

Answer A: "Photosynthesis converts CO₂ and H₂O into glucose using light"
          (Accurate but dry)

Answer B: "Plants are nature's solar panels, transforming sunlight into
           delicious energy that fuels all life on Earth!"
          (Engaging but imprecise)

Some judges prefer A (precision), others prefer B (engagement)

Self-Explaining Metrics#

Problem with black-box scores:

Faithfulness: 0.4

Why 0.4? Which claims weren’t grounded? Unclear.

Self-explaining metrics:

Faithfulness: 0.4

Reason: The response claims "revenue increased 50%" but the retrieved
context only states "revenue showed growth." The specific percentage
is not supported by the documents. This is a hallucination.

Recommendation: Remove unsupported statistics or retrieve quarterly
reports with exact figures.

Analogy: Teacher grading essays

Bad feedback: “C+ See me after class” Good feedback: “C+ Your thesis is unclear (see paragraph 1), and you didn’t cite sources for your main claim (paragraph 3). Strengthen these for a B.”

Self-explaining metrics are like good teachers—they show you exactly what’s wrong and how to fix it.

Evaluation vs Observability#

Analogy: Car dashboard

Observability = Speed, fuel, engine temperature

“Is the car running?”
Real-time monitoring
Alerts when something breaks

Evaluation = Crash test ratings, fuel efficiency tests

“Is the car safe and efficient?”
Quality benchmarks
Regression testing before new model releases

Aspect	Evaluation	Observability
Question	“Is the output good?”	“Is the system healthy?”
When	Development, CI/CD, batch	Production, real-time
Metrics	Faithfulness, relevance, accuracy	Latency, errors, cost
Tools	DeepEval, Ragas	LangSmith, Datadog

You need both:

Observability catches outages: “API is down, 500 errors”
Evaluation catches quality degradation: “Accuracy dropped 20% after prompt change”

Common Patterns#

The Test Suite Pattern#

Analogy: Regression testing in traditional software

Create a “golden dataset” of curated test cases:

Test Case 1:
  Input: "What is the capital of France?"
  Expected: "Paris"

Test Case 2:
  Input: "Who was the first president of the United States?"
  Expected: "George Washington"

Test Case 3 (Edge case):
  Input: "What is the capital of the moon?"
  Expected: "The moon has no capital" or "No government on the moon"

Run on every change:

Before prompt change: 95% accuracy
After prompt change: 92% accuracy

Investigate: Which 3% broke? Why?

Coverage types:

Happy path: Normal questions that should always work
Edge cases: Unusual questions, ambiguous phrasing
Adversarial: Trick questions, jailbreak attempts
Domain-specific: Industry jargon, technical terms

The A/B Testing Pattern#

Analogy: Website A/B testing, but for prompts

Scenario: Which prompt is better?

Prompt A: "Answer this question concisely: {question}"
Prompt B: "You are a helpful assistant. Provide a detailed answer to: {question}"

Test on 100 questions:
- Prompt A: Conciseness 95%, Completeness 70%
- Prompt B: Conciseness 60%, Completeness 90%

Decision: Use A for quick lookups, B for research questions

Metrics to compare:

Accuracy
Relevance
Latency (response time)
Cost (tokens used)
User satisfaction (if you have feedback)

The Continuous Monitoring Pattern#

Analogy: Canary deployment + health checks

In production:

Sample 1% of traffic for evaluation
Run evaluations asynchronously (don’t slow down responses)
Alert if scores drop below threshold

Example:

Day 1-10: Average faithfulness = 0.85
Day 11: Average faithfulness = 0.65

Alert: "Faithfulness dropped 20%. Recent changes:
- Model updated from GPT-4 to GPT-4-turbo
- New embedding model deployed
Investigation needed."

Common Misconceptions#

“Evaluation is just testing”#

Traditional testing: Input → Code → Output (deterministic)

Test: add(2, 3) == 5 (exact match)

LLM evaluation: Input → LLM → Output (probabilistic)

Test: summarize(article) ≈ “good summary” (fuzzy match)
Multiple correct answers
Subjective quality judgments

“Higher scores always mean better quality”#

Counterexample: Optimizing for the wrong metric

Prompt optimized for BLEU score:
  "The cat sat on the mat. The mat was sat on by the cat."
  (Repetitive, awkward, but high n-gram overlap)

Prompt optimized for relevance:
  "The cat rested on the mat."
  (Natural, concise, lower BLEU but better quality)

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Use multiple metrics to avoid gaming single metrics.

“One tool does everything”#

Reality: Most teams use 2-3 tools

DeepEval: 60+ metrics, self-explaining, general-purpose
Ragas: RAG-specific (retrieval quality)
PromptFoo: Red teaming, security testing
LangSmith: Observability + basic evaluation

Analogy: Software development uses multiple tools:

Jest (unit tests)
Cypress (E2E tests)
Datadog (monitoring)
Sentry (error tracking)

LLM evaluation is the same—different tools for different needs.

When to Invest in Evaluation#

Minimal (just starting)#

Manual spot checks (review 10-20 outputs)
Basic programmatic checks (JSON validity, length limits)
~20-50 test cases

Moderate (production app)#

Automated test suite in CI/CD
LLM-as-judge for key metrics
100-500 test cases
Basic dashboard

Comprehensive (critical application)#

Continuous production evaluation (sample traffic)
Multiple metric coverage (accuracy, safety, quality)
1,000+ test cases
Regression alerts
Human-in-the-loop for edge cases

Budget guidance:

Scale	Evaluations/month	Tool cost	Engineering time
Minimal	`<1`,000	$0-50	1 week setup
Moderate	1,000-50,000	$50-500	2-4 weeks
Comprehensive	`>50`,000	$500-2,000	Ongoing investment

Cost Considerations#

Evaluation has direct costs:

Example: Customer support chatbot, 10,000 questions/day

Option 1: Human review
  10,000 × 2 min/review = 333 hours/day
  333 hours × $30/hour = $10,000/day
  Annual cost: $3.6M

Option 2: LLM-as-Judge (GPT-4)
  10,000 × $0.03/eval = $300/day
  Annual cost: $110K (97% savings)

Option 3: LLM-as-Judge (GPT-3.5)
  10,000 × $0.002/eval = $20/day
  Annual cost: $7.3K (99.8% savings)

Option 4: Programmatic + sampling
  Programmatic checks: Free (CPU only)
  Human review of flagged 1%: 100 × 2 min × $30/hour = $100/day
  Annual cost: $36K (99% savings)

Cost optimization:

Use cheap models (GPT-3.5) for initial filtering
Use expensive models (GPT-4) for edge cases
Cache evaluation results for identical outputs
Sample production traffic instead of evaluating everything

Glossary#

Chain-of-Thought (CoT): Prompting technique where the model shows reasoning steps before answering. Improves both generation and evaluation accuracy.

Faithfulness/Groundedness: Is the answer supported by the provided context? (Not hallucinated)

Hallucination: When an LLM generates information not present in its context or training data.

Golden Dataset: Curated test cases with human-verified expected outputs or scores.

RAG (Retrieval-Augmented Generation): Pattern where an LLM is given retrieved documents as context before generating an answer.

RAG Triad: Three core metrics for RAG evaluation: context relevance, faithfulness, answer relevance.

Rubric: Scoring criteria provided to an LLM judge explaining how to rate responses.

Self-Explaining Metric: Evaluation metric that provides both a score AND an explanation of why that score was given.

Synthetic Data: Machine-generated test cases, often used to expand test coverage cheaply.

Related Research:

1.111 (State Management)
1.113 (UI Components)
3.205 (Pronunciation Assessment - similar evaluation challenges)

Last Updated: 2026-02-02

S1: Rapid Discovery

S1 Synthesis: LLM Evaluation & Testing Frameworks#

Executive Summary#

The LLM evaluation landscape has matured significantly, with clear tool differentiation by use case. DeepEval emerges as the most comprehensive open-source option, while Ragas leads for RAG-specific evaluation. PromptFoo excels at quick iterations and security testing, LangSmith dominates observability, and TruLens offers OpenTelemetry-native tracing.

Comparison Matrix#

Tool	Focus	Metrics	Pricing	Best For
DeepEval	Comprehensive	60+	Free + Enterprise	CI/CD, full coverage
PromptFoo	Prompt testing	Basic	Free	Quick iterations, red team
LangSmith	Observability	Custom	$39/seat+	LangChain users, tracing
Ragas	RAG-specific	5 core	Free	RAG pipelines
TruLens	Feedback functions	Extensible	Free	OTel users, custom evals

Decision Framework#

Choose DeepEval when:#

Need comprehensive metric coverage (RAG, agents, safety, multimodal)
Want CI/CD integration with pytest-style tests
Require self-explaining metrics for debugging
Building production systems with regression detection

Choose PromptFoo when:#

Doing rapid prompt engineering iterations
Need security/red team testing
Prefer YAML config over code
Want lightweight CLI tool without SDK

Choose LangSmith when:#

Using LangChain/LangGraph
Need production observability + evaluation
Want unified tracing and testing platform
Have budget for commercial tooling

Choose Ragas when:#

Evaluating RAG systems specifically
Want lower-cost RAG metrics (vs LLM-as-judge)
Need quick integration, pandas-like workflow
Don’t need general LLM evaluation

Choose TruLens when:#

Already using OpenTelemetry
In Snowflake ecosystem
Need custom feedback functions
Want extensible evaluation framework

Common Stack Patterns#

Full Coverage Stack#

DeepEval + Ragas + PromptFoo

DeepEval: Comprehensive metrics, CI/CD backbone
Ragas: RAG-specific depth when retrieval quality matters
PromptFoo: Security validation, red teaming

Lightweight Stack#

Ragas + PromptFoo

Lower overhead for RAG-focused applications
Good for teams not needing 60+ metrics

Enterprise Stack#

LangSmith + DeepEval

Observability + comprehensive evaluation
Best for LangChain-based production systems

Key Insights#

No single tool covers everything - Most teams combine 2-3 tools
DeepEval has widest metric coverage (60+) with self-explanation
Ragas pioneered RAG Triad - still best for retrieval-focused eval
PromptFoo leads red teaming - best for security testing
LangSmith = observability-first - evaluation is secondary
TruLens differentiator - OpenTelemetry native, extensible

Cost Considerations#

Tool	Free Tier	Paid Trigger
DeepEval	Full OSS	Enterprise features
PromptFoo	Full OSS	Hosted dashboard
LangSmith	Limited	Team collaboration
Ragas	Full OSS	N/A
TruLens	Full OSS	N/A

Sources#

DeepEval#

Overview#

Type: Open-source Python framework
License: Apache 2.0
GitHub: 400k+ monthly downloads
Focus: Comprehensive LLM evaluation (“Pytest for LLMs”)

Key Features#

60+ metrics: Prompt, RAG, chatbot, safety, multimodal
Self-explaining metrics: Tells you WHY scores are low
Pytest integration: Familiar unit-test interface
CI/CD native: Built for continuous deployment workflows
Safety testing: Red teaming, toxicity detection

Metric Categories#

RAG: Faithfulness, contextual relevancy, answer relevancy
Conversational: Coherence, engagement, knowledge retention
Safety: Bias, toxicity, PII leakage, jailbreak detection
Agentic: Tool use, task completion, reasoning

Enterprise Platform (Confident AI)#

Cloud dashboard for team collaboration
Dataset curation and annotation
Production monitoring
Regression detection

Limitations#

Python-only (no JS/CLI-first option)
Enterprise features require Confident AI platform
Can be overkill for simple prompt testing

Best For#

Teams needing comprehensive evaluation coverage
CI/CD integration with automated testing
Production monitoring and regression detection
Multi-pattern evaluation (RAG, agents, chatbots)

Installation#

pip install deepeval

Pricing#

Open-source: Free
Confident AI: Free tier + paid plans for enterprise

LangSmith#

Overview#

Type: Commercial SaaS platform
Company: LangChain Inc.
Focus: Tracing, observability, and evaluation for LLM apps

Key Features#

Detailed tracing: Visibility into every execution step
Dataset management: Create/organize test data
Multiple evaluator types: Code-based, LLM-as-judge, composite
Experiment tracking: Compare results across test runs
Framework agnostic: Works with or without LangChain

Integration#

Seamless with LangChain and LangGraph
Python and TypeScript SDKs
REST API for custom integrations
No LangChain dependency required

Evaluation Capabilities#

Custom evaluation logic
Prebuilt assessment tools
Quality tracking over time
Consistency validation

Limitations#

Commercial product (not fully open-source)
Tracing-first, evaluation second
Tighter integration with LangChain ecosystem
Pricing can scale with usage

Best For#

LangChain/LangGraph users
Teams needing production observability
Debugging complex multi-step chains
Organizations wanting unified tracing + eval

Pricing#

Developer: Free tier with limits
Plus: $39/seat/month
Enterprise: Custom pricing

PromptFoo#

Overview#

Type: Open-source CLI and library
License: MIT
GitHub: 51,000+ developers
Focus: Prompt testing, A/B testing, red teaming

Key Features#

CLI-first: Simple command-line interface, no cloud required
YAML configuration: Declarative test case definition
Side-by-side comparison: Diff views for prompt variations
Red teaming: Automated security testing (injections, toxic content)
CI/CD ready: Integrates into deployment pipelines

Supported Providers#

OpenAI, Anthropic, Azure, Google, HuggingFace
Open-source models (Llama, etc.)
Custom API providers

Evaluation Capabilities#

Basic RAG metrics
Safety/security testing
LLM-as-judge evaluations
Custom assertion logic

Limitations#

Limited metric set compared to DeepEval (basic RAG + safety only)
YAML-heavy workflow harder to customize at scale
Less comprehensive than code-first alternatives

Best For#

Quick prompt iterations
Security/red team testing
Teams preferring declarative config over code
Lightweight experimentation without SDK dependencies

Installation#

npm install -g promptfoo
# or
npx promptfoo@latest

Pricing#

Open-source: Free, self-hosted
Cloud: Optional hosted dashboard

Ragas (Retrieval-Augmented Generation Assessment Suite)#

Overview#

Type: Open-source Python library
License: Apache 2.0
Focus: RAG-specific evaluation metrics

Key Features#

RAG Triad: Structured evaluation framework
Lightweight: Easy integration, pandas-like workflow
Reference-free: No ground truth required
Benchmarked: Against LLM-AggreFact, TREC-DL, HotPotQA

Core Metrics (RAG Triad)#

Faithfulness: How accurately answer reflects retrieved evidence
Context Relevancy: How relevant retrieved docs are to query
Answer Relevancy: How relevant answer is to user question
Context Recall: Coverage of relevant information
Context Precision: Signal-to-noise in retrieved context

Extended Capabilities#

Agentic workflow metrics
Tool use evaluation
SQL evaluation
Multimodal faithfulness
Noise sensitivity testing

Limitations#

Metrics somewhat opaque (not self-explanatory)
RAG-focused, not general LLM evaluation
Need to combine with other tools for full coverage
Lower metric count than DeepEval

Best For#

RAG pipeline evaluation specifically
Teams wanting targeted retrieval metrics
Lower-cost alternative to LLM-as-judge for RAG
Quick integration with existing RAG systems

Installation#

pip install ragas

Pricing#

Open-source: Free

TruLens#

Overview#

Type: Open-source Python library
License: MIT
Maintainer: Snowflake (acquired TruEra)
Focus: Feedback functions and tracing for LLM apps

Key Features#

Feedback functions: Programmatic evaluation without ground truth
RAG Triad pioneer: Original structured RAG evaluation framework
OpenTelemetry support: Interoperable observability
Extensible: Custom feedback function framework
Provider integrations: OpenAI, HuggingFace, LiteLLM, LangChain

Feedback Function Types#

Generation-based: LLM-as-judge with rubrics
Custom logic: Tailored evaluation tasks
Chain-of-thought: Optional reasoning traces
Few-shot: Example-guided evaluation

Tracing Capabilities#

OpenTelemetry (OTel) native
Integrates with existing observability stack
Detailed execution traces
Performance monitoring

Supported Use Cases#

Question-answering
Summarization
RAG systems
Agent-based applications

Limitations#

Snowflake acquisition may affect roadmap
Overlaps with Ragas on RAG evaluation
Less comprehensive metrics than DeepEval
Community-driven, less commercial support

Best For#

Teams already using OpenTelemetry
Snowflake ecosystem users
Custom feedback function needs
RAG evaluation with extensibility

Installation#

pip install trulens

Pricing#

Open-source: Free

Published: 2026-03-06 Updated: 2026-03-06