1.205 LLM Evaluation & Testing Frameworks#
Explainer
LLM Evaluation: Domain Explainer#
Universal Analogies#
Quality Control for AI Outputs#
Analogy: Factory Quality Inspector vs AI Quality Inspector
Traditional software testing is like inspecting widgets on an assembly line:
- Widget either fits the spec (pass) or doesn’t (fail)
- Same input → same output, every time
- Clear pass/fail criteria
LLM evaluation is like judging creative writing:
- Many “correct” answers exist for the same prompt
- Same prompt → different outputs each time
- Quality is subjective and context-dependent
Example:
Prompt: "Summarize this article in 3 sentences"
Valid Summary A: "The study found X. Researchers discovered Y. This suggests Z."
Valid Summary B: "Research shows X is correlated with Y. The implications are Z."
Both are correct, but:
- Different phrasing
- Different emphasis
- Different completenessAn LLM evaluator must understand semantic equivalence (these mean the same thing despite different words) rather than just exact matching.
The Restaurant Review Problem#
Analogy: How do you know if a restaurant is good?
Option 1: Count stars (like BLEU/ROUGE scores)
- Fast, cheap, scalable
- But: Doesn’t explain WHY it got 3 stars
- Misses context: “3 stars for fine dining” ≠ “3 stars for pizza”
Option 2: Professional food critic (like LLM-as-Judge)
- Understands nuance, context, and subjective quality
- Provides detailed explanations
- But: Expensive, has personal biases
Option 3: Health inspection (like Programmatic checks)
- Binary checks: Does food have correct temperature? Is kitchen clean?
- Catches specific, predictable problems
- But: Doesn’t evaluate taste, creativity, or overall quality
Best practice: Use all three. Health inspection for safety (programmatic), star rating for quick filtering (metrics), and food critic for nuanced evaluation (LLM-as-judge).
The RAG Triad: Research Paper Analogy#
Retrieval-Augmented Generation (RAG) is like writing a research paper:
Your Question → Library Search → Retrieved Books → Your Essay
(Query) (Vector Search) (Context) (Answer)Three quality checks:
1. Context Relevance = “Did you check out the right books?”
- Question: “How does photosynthesis work?”
- Good retrieval: Botany textbooks, plant biology papers
- Bad retrieval: Economics journals, cooking recipes
2. Faithfulness/Groundedness = “Did you cite your sources correctly?”
- Good: “According to Smith (2020), photosynthesis converts light into energy”
- Bad: “Photosynthesis was invented in 1872” (not in any source)
- Problem: Hallucination = making up citations or facts
3. Answer Relevance = “Did you actually answer the question?”
- Question: “How does photosynthesis work?”
- Good: Explains the process step-by-step
- Bad: “Photosynthesis is important” (true but doesn’t answer HOW)
Debugging with the Triad:
- Low context relevance → Fix your search/embeddings
- Low faithfulness → Model is hallucinating, prompt engineering needed
- Low answer relevance → Prompt doesn’t guide model well
What Problem Does LLM Evaluation Solve?#
The Scale Problem#
Scenario: You’re building a customer support chatbot answering 10,000 questions/day.
Without evaluation:
- How do you know if answers are accurate?
- Manual review = 10,000 answers × 2 min/review = 333 hours/day (impossible)
- Launch blind, hope for the best, fix angry customer complaints
With evaluation:
- Automated metrics score every answer
- Flag low-scoring answers for human review
- Catch quality regressions before customers do
- Measure improvement over time
Real example:
Answer A: "Your order ships in 3-5 business days"
Answer B: "I don't have access to shipping information"
Answer C: "Your package left our facility yesterday and should arrive Tuesday"
Evaluation metrics:
- Relevance: Does it answer the question? (A=90%, B=40%, C=95%)
- Faithfulness: Is it grounded in retrieved data? (A=80%, B=90%, C=95%)
- Completeness: Did it address all aspects? (A=70%, B=30%, C=90%)
Verdict: C is best (most complete, most accurate, most helpful)The Drift Problem#
Analogy: Software bit rot, but for AI
Traditional software: Code doesn’t change → same input = same output forever
LLMs drift over time:
- Model updates (GPT-4 → GPT-4.5)
- Prompt modifications
- Context window changes
- Training data shifts
Without continuous evaluation:
- You update your prompt to fix one edge case
- Accidentally break 15 other cases
- No one notices until production breaks
With continuous evaluation:
- Test suite runs on every prompt change
- Regression detected immediately: “New prompt scores 15% lower on accuracy”
- Roll back or iterate before deploying
Key Concepts#
LLM-as-Judge: Using AI to Grade AI#
How it works:
Evaluator LLM receives:
- User question: "What is photosynthesis?"
- Model answer: "Photosynthesis is when plants make energy from sunlight"
- Rubric: "Score 1-5 for accuracy, completeness, clarity"
Evaluator outputs:
- Accuracy: 4/5 (correct but simplified)
- Completeness: 3/5 (missing chlorophyll, chemical equation)
- Clarity: 5/5 (very clear for a beginner)Advantages:
- Understands paraphrases: “automobile” = “car” = “vehicle”
- Scales to thousands of evaluations
- Can judge subjective qualities (tone, helpfulness)
Limitations:
- Costs money (API calls)
- Judge has biases (prefers certain writing styles)
- Can be fooled: Outputs that “sound good” but are factually wrong
Example of judge bias:
Answer A: "Photosynthesis converts CO₂ and H₂O into glucose using light"
(Accurate but dry)
Answer B: "Plants are nature's solar panels, transforming sunlight into
delicious energy that fuels all life on Earth!"
(Engaging but imprecise)
Some judges prefer A (precision), others prefer B (engagement)Self-Explaining Metrics#
Problem with black-box scores:
Faithfulness: 0.4Why 0.4? Which claims weren’t grounded? Unclear.
Self-explaining metrics:
Faithfulness: 0.4
Reason: The response claims "revenue increased 50%" but the retrieved
context only states "revenue showed growth." The specific percentage
is not supported by the documents. This is a hallucination.
Recommendation: Remove unsupported statistics or retrieve quarterly
reports with exact figures.Analogy: Teacher grading essays
Bad feedback: “C+ See me after class” Good feedback: “C+ Your thesis is unclear (see paragraph 1), and you didn’t cite sources for your main claim (paragraph 3). Strengthen these for a B.”
Self-explaining metrics are like good teachers—they show you exactly what’s wrong and how to fix it.
Evaluation vs Observability#
Analogy: Car dashboard
Observability = Speed, fuel, engine temperature
- “Is the car running?”
- Real-time monitoring
- Alerts when something breaks
Evaluation = Crash test ratings, fuel efficiency tests
- “Is the car safe and efficient?”
- Quality benchmarks
- Regression testing before new model releases
| Aspect | Evaluation | Observability |
|---|---|---|
| Question | “Is the output good?” | “Is the system healthy?” |
| When | Development, CI/CD, batch | Production, real-time |
| Metrics | Faithfulness, relevance, accuracy | Latency, errors, cost |
| Tools | DeepEval, Ragas | LangSmith, Datadog |
You need both:
- Observability catches outages: “API is down, 500 errors”
- Evaluation catches quality degradation: “Accuracy dropped 20% after prompt change”
Common Patterns#
The Test Suite Pattern#
Analogy: Regression testing in traditional software
Create a “golden dataset” of curated test cases:
Test Case 1:
Input: "What is the capital of France?"
Expected: "Paris"
Test Case 2:
Input: "Who was the first president of the United States?"
Expected: "George Washington"
Test Case 3 (Edge case):
Input: "What is the capital of the moon?"
Expected: "The moon has no capital" or "No government on the moon"Run on every change:
Before prompt change: 95% accuracy
After prompt change: 92% accuracy
Investigate: Which 3% broke? Why?Coverage types:
- Happy path: Normal questions that should always work
- Edge cases: Unusual questions, ambiguous phrasing
- Adversarial: Trick questions, jailbreak attempts
- Domain-specific: Industry jargon, technical terms
The A/B Testing Pattern#
Analogy: Website A/B testing, but for prompts
Scenario: Which prompt is better?
Prompt A: "Answer this question concisely: {question}"
Prompt B: "You are a helpful assistant. Provide a detailed answer to: {question}"
Test on 100 questions:
- Prompt A: Conciseness 95%, Completeness 70%
- Prompt B: Conciseness 60%, Completeness 90%
Decision: Use A for quick lookups, B for research questionsMetrics to compare:
- Accuracy
- Relevance
- Latency (response time)
- Cost (tokens used)
- User satisfaction (if you have feedback)
The Continuous Monitoring Pattern#
Analogy: Canary deployment + health checks
In production:
- Sample 1% of traffic for evaluation
- Run evaluations asynchronously (don’t slow down responses)
- Alert if scores drop below threshold
Example:
Day 1-10: Average faithfulness = 0.85
Day 11: Average faithfulness = 0.65
Alert: "Faithfulness dropped 20%. Recent changes:
- Model updated from GPT-4 to GPT-4-turbo
- New embedding model deployed
Investigation needed."Common Misconceptions#
“Evaluation is just testing”#
Traditional testing: Input → Code → Output (deterministic)
- Test:
add(2, 3) == 5(exact match)
LLM evaluation: Input → LLM → Output (probabilistic)
- Test:
summarize(article)≈ “good summary” (fuzzy match) - Multiple correct answers
- Subjective quality judgments
“Higher scores always mean better quality”#
Counterexample: Optimizing for the wrong metric
Prompt optimized for BLEU score:
"The cat sat on the mat. The mat was sat on by the cat."
(Repetitive, awkward, but high n-gram overlap)
Prompt optimized for relevance:
"The cat rested on the mat."
(Natural, concise, lower BLEU but better quality)Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
Use multiple metrics to avoid gaming single metrics.
“One tool does everything”#
Reality: Most teams use 2-3 tools
- DeepEval: 60+ metrics, self-explaining, general-purpose
- Ragas: RAG-specific (retrieval quality)
- PromptFoo: Red teaming, security testing
- LangSmith: Observability + basic evaluation
Analogy: Software development uses multiple tools:
- Jest (unit tests)
- Cypress (E2E tests)
- Datadog (monitoring)
- Sentry (error tracking)
LLM evaluation is the same—different tools for different needs.
When to Invest in Evaluation#
Minimal (just starting)#
- Manual spot checks (review 10-20 outputs)
- Basic programmatic checks (JSON validity, length limits)
- ~20-50 test cases
Moderate (production app)#
- Automated test suite in CI/CD
- LLM-as-judge for key metrics
- 100-500 test cases
- Basic dashboard
Comprehensive (critical application)#
- Continuous production evaluation (sample traffic)
- Multiple metric coverage (accuracy, safety, quality)
- 1,000+ test cases
- Regression alerts
- Human-in-the-loop for edge cases
Budget guidance:
| Scale | Evaluations/month | Tool cost | Engineering time |
|---|---|---|---|
| Minimal | <1,000 | $0-50 | 1 week setup |
| Moderate | 1,000-50,000 | $50-500 | 2-4 weeks |
| Comprehensive | >50,000 | $500-2,000 | Ongoing investment |
Cost Considerations#
Evaluation has direct costs:
Example: Customer support chatbot, 10,000 questions/day
Option 1: Human review
10,000 × 2 min/review = 333 hours/day
333 hours × $30/hour = $10,000/day
Annual cost: $3.6M
Option 2: LLM-as-Judge (GPT-4)
10,000 × $0.03/eval = $300/day
Annual cost: $110K (97% savings)
Option 3: LLM-as-Judge (GPT-3.5)
10,000 × $0.002/eval = $20/day
Annual cost: $7.3K (99.8% savings)
Option 4: Programmatic + sampling
Programmatic checks: Free (CPU only)
Human review of flagged 1%: 100 × 2 min × $30/hour = $100/day
Annual cost: $36K (99% savings)Cost optimization:
- Use cheap models (GPT-3.5) for initial filtering
- Use expensive models (GPT-4) for edge cases
- Cache evaluation results for identical outputs
- Sample production traffic instead of evaluating everything
Glossary#
Chain-of-Thought (CoT): Prompting technique where the model shows reasoning steps before answering. Improves both generation and evaluation accuracy.
Faithfulness/Groundedness: Is the answer supported by the provided context? (Not hallucinated)
Hallucination: When an LLM generates information not present in its context or training data.
Golden Dataset: Curated test cases with human-verified expected outputs or scores.
RAG (Retrieval-Augmented Generation): Pattern where an LLM is given retrieved documents as context before generating an answer.
RAG Triad: Three core metrics for RAG evaluation: context relevance, faithfulness, answer relevance.
Rubric: Scoring criteria provided to an LLM judge explaining how to rate responses.
Self-Explaining Metric: Evaluation metric that provides both a score AND an explanation of why that score was given.
Synthetic Data: Machine-generated test cases, often used to expand test coverage cheaply.
Related Research:
- 1.111 (State Management)
- 1.113 (UI Components)
- 3.205 (Pronunciation Assessment - similar evaluation challenges)
Last Updated: 2026-02-02
S1: Rapid Discovery
S1 Synthesis: LLM Evaluation & Testing Frameworks#
Executive Summary#
The LLM evaluation landscape has matured significantly, with clear tool differentiation by use case. DeepEval emerges as the most comprehensive open-source option, while Ragas leads for RAG-specific evaluation. PromptFoo excels at quick iterations and security testing, LangSmith dominates observability, and TruLens offers OpenTelemetry-native tracing.
Comparison Matrix#
| Tool | Focus | Metrics | Pricing | Best For |
|---|---|---|---|---|
| DeepEval | Comprehensive | 60+ | Free + Enterprise | CI/CD, full coverage |
| PromptFoo | Prompt testing | Basic | Free | Quick iterations, red team |
| LangSmith | Observability | Custom | $39/seat+ | LangChain users, tracing |
| Ragas | RAG-specific | 5 core | Free | RAG pipelines |
| TruLens | Feedback functions | Extensible | Free | OTel users, custom evals |
Decision Framework#
Choose DeepEval when:#
- Need comprehensive metric coverage (RAG, agents, safety, multimodal)
- Want CI/CD integration with pytest-style tests
- Require self-explaining metrics for debugging
- Building production systems with regression detection
Choose PromptFoo when:#
- Doing rapid prompt engineering iterations
- Need security/red team testing
- Prefer YAML config over code
- Want lightweight CLI tool without SDK
Choose LangSmith when:#
- Using LangChain/LangGraph
- Need production observability + evaluation
- Want unified tracing and testing platform
- Have budget for commercial tooling
Choose Ragas when:#
- Evaluating RAG systems specifically
- Want lower-cost RAG metrics (vs LLM-as-judge)
- Need quick integration, pandas-like workflow
- Don’t need general LLM evaluation
Choose TruLens when:#
- Already using OpenTelemetry
- In Snowflake ecosystem
- Need custom feedback functions
- Want extensible evaluation framework
Common Stack Patterns#
Full Coverage Stack#
DeepEval + Ragas + PromptFoo
- DeepEval: Comprehensive metrics, CI/CD backbone
- Ragas: RAG-specific depth when retrieval quality matters
- PromptFoo: Security validation, red teaming
Lightweight Stack#
Ragas + PromptFoo
- Lower overhead for RAG-focused applications
- Good for teams not needing 60+ metrics
Enterprise Stack#
LangSmith + DeepEval
- Observability + comprehensive evaluation
- Best for LangChain-based production systems
Key Insights#
- No single tool covers everything - Most teams combine 2-3 tools
- DeepEval has widest metric coverage (60+) with self-explanation
- Ragas pioneered RAG Triad - still best for retrieval-focused eval
- PromptFoo leads red teaming - best for security testing
- LangSmith = observability-first - evaluation is secondary
- TruLens differentiator - OpenTelemetry native, extensible
Cost Considerations#
| Tool | Free Tier | Paid Trigger |
|---|---|---|
| DeepEval | Full OSS | Enterprise features |
| PromptFoo | Full OSS | Hosted dashboard |
| LangSmith | Limited | Team collaboration |
| Ragas | Full OSS | N/A |
| TruLens | Full OSS | N/A |
Sources#
- DeepEval Alternatives Compared
- LLM Evaluation Frameworks Comparison
- Top LLM Evaluation Tools 2025
- TruLens Documentation
- PromptFoo Docs
- LangSmith Docs
DeepEval#
Overview#
- Type: Open-source Python framework
- License: Apache 2.0
- GitHub: 400k+ monthly downloads
- Focus: Comprehensive LLM evaluation (“Pytest for LLMs”)
Key Features#
- 60+ metrics: Prompt, RAG, chatbot, safety, multimodal
- Self-explaining metrics: Tells you WHY scores are low
- Pytest integration: Familiar unit-test interface
- CI/CD native: Built for continuous deployment workflows
- Safety testing: Red teaming, toxicity detection
Metric Categories#
- RAG: Faithfulness, contextual relevancy, answer relevancy
- Conversational: Coherence, engagement, knowledge retention
- Safety: Bias, toxicity, PII leakage, jailbreak detection
- Agentic: Tool use, task completion, reasoning
Enterprise Platform (Confident AI)#
- Cloud dashboard for team collaboration
- Dataset curation and annotation
- Production monitoring
- Regression detection
Limitations#
- Python-only (no JS/CLI-first option)
- Enterprise features require Confident AI platform
- Can be overkill for simple prompt testing
Best For#
- Teams needing comprehensive evaluation coverage
- CI/CD integration with automated testing
- Production monitoring and regression detection
- Multi-pattern evaluation (RAG, agents, chatbots)
Installation#
pip install deepevalPricing#
- Open-source: Free
- Confident AI: Free tier + paid plans for enterprise
LangSmith#
Overview#
- Type: Commercial SaaS platform
- Company: LangChain Inc.
- Focus: Tracing, observability, and evaluation for LLM apps
Key Features#
- Detailed tracing: Visibility into every execution step
- Dataset management: Create/organize test data
- Multiple evaluator types: Code-based, LLM-as-judge, composite
- Experiment tracking: Compare results across test runs
- Framework agnostic: Works with or without LangChain
Integration#
- Seamless with LangChain and LangGraph
- Python and TypeScript SDKs
- REST API for custom integrations
- No LangChain dependency required
Evaluation Capabilities#
- Custom evaluation logic
- Prebuilt assessment tools
- Quality tracking over time
- Consistency validation
Limitations#
- Commercial product (not fully open-source)
- Tracing-first, evaluation second
- Tighter integration with LangChain ecosystem
- Pricing can scale with usage
Best For#
- LangChain/LangGraph users
- Teams needing production observability
- Debugging complex multi-step chains
- Organizations wanting unified tracing + eval
Pricing#
- Developer: Free tier with limits
- Plus: $39/seat/month
- Enterprise: Custom pricing
PromptFoo#
Overview#
- Type: Open-source CLI and library
- License: MIT
- GitHub: 51,000+ developers
- Focus: Prompt testing, A/B testing, red teaming
Key Features#
- CLI-first: Simple command-line interface, no cloud required
- YAML configuration: Declarative test case definition
- Side-by-side comparison: Diff views for prompt variations
- Red teaming: Automated security testing (injections, toxic content)
- CI/CD ready: Integrates into deployment pipelines
Supported Providers#
- OpenAI, Anthropic, Azure, Google, HuggingFace
- Open-source models (Llama, etc.)
- Custom API providers
Evaluation Capabilities#
- Basic RAG metrics
- Safety/security testing
- LLM-as-judge evaluations
- Custom assertion logic
Limitations#
- Limited metric set compared to DeepEval (basic RAG + safety only)
- YAML-heavy workflow harder to customize at scale
- Less comprehensive than code-first alternatives
Best For#
- Quick prompt iterations
- Security/red team testing
- Teams preferring declarative config over code
- Lightweight experimentation without SDK dependencies
Installation#
npm install -g promptfoo
# or
npx promptfoo@latestPricing#
- Open-source: Free, self-hosted
- Cloud: Optional hosted dashboard
Ragas (Retrieval-Augmented Generation Assessment Suite)#
Overview#
- Type: Open-source Python library
- License: Apache 2.0
- Focus: RAG-specific evaluation metrics
Key Features#
- RAG Triad: Structured evaluation framework
- Lightweight: Easy integration, pandas-like workflow
- Reference-free: No ground truth required
- Benchmarked: Against LLM-AggreFact, TREC-DL, HotPotQA
Core Metrics (RAG Triad)#
- Faithfulness: How accurately answer reflects retrieved evidence
- Context Relevancy: How relevant retrieved docs are to query
- Answer Relevancy: How relevant answer is to user question
- Context Recall: Coverage of relevant information
- Context Precision: Signal-to-noise in retrieved context
Extended Capabilities#
- Agentic workflow metrics
- Tool use evaluation
- SQL evaluation
- Multimodal faithfulness
- Noise sensitivity testing
Limitations#
- Metrics somewhat opaque (not self-explanatory)
- RAG-focused, not general LLM evaluation
- Need to combine with other tools for full coverage
- Lower metric count than DeepEval
Best For#
- RAG pipeline evaluation specifically
- Teams wanting targeted retrieval metrics
- Lower-cost alternative to LLM-as-judge for RAG
- Quick integration with existing RAG systems
Installation#
pip install ragasPricing#
- Open-source: Free
TruLens#
Overview#
- Type: Open-source Python library
- License: MIT
- Maintainer: Snowflake (acquired TruEra)
- Focus: Feedback functions and tracing for LLM apps
Key Features#
- Feedback functions: Programmatic evaluation without ground truth
- RAG Triad pioneer: Original structured RAG evaluation framework
- OpenTelemetry support: Interoperable observability
- Extensible: Custom feedback function framework
- Provider integrations: OpenAI, HuggingFace, LiteLLM, LangChain
Feedback Function Types#
- Generation-based: LLM-as-judge with rubrics
- Custom logic: Tailored evaluation tasks
- Chain-of-thought: Optional reasoning traces
- Few-shot: Example-guided evaluation
Tracing Capabilities#
- OpenTelemetry (OTel) native
- Integrates with existing observability stack
- Detailed execution traces
- Performance monitoring
Supported Use Cases#
- Question-answering
- Summarization
- RAG systems
- Agent-based applications
Limitations#
- Snowflake acquisition may affect roadmap
- Overlaps with Ragas on RAG evaluation
- Less comprehensive metrics than DeepEval
- Community-driven, less commercial support
Best For#
- Teams already using OpenTelemetry
- Snowflake ecosystem users
- Custom feedback function needs
- RAG evaluation with extensibility
Installation#
pip install trulensPricing#
- Open-source: Free