Blog post

LLM Evaluation Frameworks: How to Actually Measure What Your Model Does

Benchmarks tell you how a model performs on known tasks. Evals tell you whether your application works. These are different things and require different approaches.

Category: ai
Published: 2026-04-15

The evaluation gap

You deploy an LLM-powered feature. Users complain it is wrong sometimes, helpful sometimes, inconsistent. You have no metric to tell you whether a new prompt version is better or worse. You are flying blind.

Evaluation is the discipline that closes this gap. It is not optional — without it, you cannot improve systematically, you cannot catch regressions, and you cannot make confident deployment decisions.

What you are evaluating

LLM application evaluation has two distinct levels:

Model evaluation: how does the underlying model perform on standard tasks? This is what leaderboards (MMLU, HumanEval, HELM) measure. Useful for choosing a model, not for measuring your application.

Application evaluation: does your system — prompt + retrieval + model + post-processing — produce the right output for your users? This is what you need to build.

Dimensions to measure

For a RAG (retrieval-augmented generation) system, the standard dimensions are:

Faithfulness: does the generated answer stay within the retrieved context? An answer that introduces facts not in the context is hallucinating.

Answer relevance: does the answer address the question asked? A faithful answer can still be irrelevant if it addresses a different question.

Context precision: of the retrieved chunks, how many are actually useful for answering the question? Low precision means you are sending noise to the model.

Context recall: does the retrieved context contain the information needed to answer the question? Low recall means the answer is limited by missing context, not model capability.

RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates these measurements using an LLM-as-judge approach. You provide a dataset of questions, ground truth answers, retrieved contexts, and generated answers. RAGAS uses an LLM to score each dimension.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

RAGAS scores range from 0 to 1 for each metric. They are useful for catching regressions (did this prompt change hurt faithfulness?) and for comparing retrieval strategies (does chunking at 512 tokens outperform 256 tokens on context recall?).

Limitation: RAGAS uses an LLM to evaluate LLM output. The evaluator model has its own biases and errors. RAGAS scores are not ground truth — they are a useful proxy that correlates with human judgment at the population level, not at the individual example level.

LLM-as-judge

Beyond RAGAS, LLM-as-judge is a general pattern: use a strong model (GPT-4o, Claude Opus, Gemini Ultra) to evaluate outputs from a weaker or specialized model.

The pattern works best when you provide a rubric — explicit criteria for what makes an answer good or bad:

You are evaluating a customer support response.
Score from 1-5 on:
- Accuracy: does it correctly answer the question?
- Tone: is it professional and empathetic?
- Completeness: does it address all parts of the question?

Question: {question}
Response: {response}
Ground truth: {ground_truth}

Known failure modes:

Verbosity bias: LLM judges tend to prefer longer answers even when shorter ones are more correct.
Self-enhancement bias: a model tends to rate its own outputs higher.
Position bias: the first option in a pairwise comparison is rated higher.

Mitigate by: using a different model as judge, using rubrics that penalize verbosity explicitly, and running pairwise comparisons in both orders and averaging.

Human evaluation

Human eval is expensive and slow but it is the ground truth for qualitative dimensions (tone, helpfulness, cultural appropriateness). Use it to:

Calibrate your automated metrics. Build a dataset of 200–500 examples with human labels. Measure how well your automated metrics correlate. If RAGAS faithfulness does not correlate with human faithfulness judgments on your domain, tune your approach.
Evaluate edge cases. Automated metrics miss nuance. Human reviewers catch it.
Set baselines. Before you can improve, you need a baseline human evaluation. Run it once per major version.

Building an eval pipeline

A practical eval pipeline for a production LLM application:

Golden dataset: 100–500 representative questions with ground truth answers. Curate from real usage, not synthetic examples.
Regression suite: run on every prompt or model change. Flag if faithfulness drops > 2 points or relevance drops > 5 points.
A/B eval: when comparing two system versions, run both on the same dataset and compute delta. Statistical significance matters — with a small dataset, differences may be noise.
Production sampling: log a sample of production queries. Run them through automated eval weekly. Catch distribution shift before users notice.

The benchmark trap

Model benchmarks (MMLU, HumanEval) measure fixed, known tasks. They are designed to be hard to overfit — but developers overfit to them anyway by choosing models with high benchmark scores that do not transfer to their domain.

Your production eval dataset is more valuable than any public benchmark. Build it from real user queries. A model that scores 65 on your eval dataset but 80 on MMLU is worse for your product than one that scores 75 on your dataset and 72 on MMLU.

Evaluate what you deploy, not what benchmarks measure.