Evaluating LLMs and Parameter-Efficient Fine-Tuning

7.3 — Evaluating Generative Output

You cannot improve what you cannot measure. And measuring the quality of generated text is the hardest unsolved problem in applied NLP.

The difficulty is fundamental. A classification model outputs a label — you compare it to the ground truth and compute accuracy. A regression model outputs a number — you compute the residual. A generative model outputs free-form text, and there are infinite valid ways to say the same thing. “The quarterly revenue was $4.2M” and “Revenue for Q3 reached four point two million dollars” carry identical information but share almost no surface-level features. Any metric that relies on word overlap will score these as dissimilar.

The Metrics That Don’t Work

BLEU (Bilingual Evaluation Understudy) counts n-gram overlaps between generated text and reference text. It was designed for machine translation in 2002. For modern generative output, BLEU has two fatal flaws: it penalizes valid paraphrasing, and it rewards degenerate outputs that happen to share n-grams with the reference. A model that copies fragments of the reference and stitches them together incoherently can score higher than a model that produces a fluent, accurate paraphrase.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures recall of reference n-grams in the generated text. It has the same paraphrasing problem as BLEU. A summary that captures the key facts in different words scores lower than a summary that copies sentences verbatim. ROUGE was designed for extractive summarization — it actively penalizes the abstractive summarization that modern LLMs excel at.

Both metrics have a deeper problem: they require a reference answer. For open-ended generation, question answering over novel contexts, or creative tasks, no single reference answer exists. You cannot compute BLEU against a reference you do not have.

These metrics are not useless in all contexts — BLEU still has value for machine translation evaluation, and ROUGE can work for extractive tasks. But for evaluating LLM output in production data science applications, you need different tools.

LLM-as-a-Judge

The most practical evaluation method for generative output in 2025 is using a stronger model to evaluate a weaker one. GPT-4o evaluates GPT-4o-mini. Claude Opus evaluates Claude Sonnet. The judge model receives the question, the generated answer, and a structured rubric, then produces a score with justification.

This works because evaluation is easier than generation. You do not need the judge to produce a better answer — you need it to recognize quality dimensions: relevance, accuracy, completeness, and coherence. A model that struggles to generate a perfect legal brief can still reliably distinguish a good brief from a bad one.

from openai import OpenAI
from pydantic import BaseModel, Field


class EvaluationResult(BaseModel):
    """Structured evaluation of a generated answer."""
    relevance: int = Field(ge=1, le=5, description="Does the answer address the question?")
    accuracy: int = Field(ge=1, le=5, description="Are the factual claims correct?")
    completeness: int = Field(ge=1, le=5, description="Does the answer cover all aspects?")
    coherence: int = Field(ge=1, le=5, description="Is the answer well-structured and clear?")
    justification: str = Field(description="Brief explanation of the scores")


def evaluate_answer(
    question: str,
    answer: str,
    context: str | None = None,
    judge_model: str = "gpt-4o",
) -> EvaluationResult:
    """Evaluate a generated answer using an LLM judge."""
    client = OpenAI()

    context_section = ""
    if context:
        context_section = f"\nReference Context:\n{context}\n"

    completion = client.beta.chat.completions.parse(
        model=judge_model,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert evaluator. Score the answer on each dimension "
                    "from 1 (poor) to 5 (excellent). Be strict — a 3 means acceptable, "
                    "a 5 means exceptional. Justify your scores in 2-3 sentences."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question:\n{question}\n"
                    f"{context_section}"
                    f"Answer to evaluate:\n{answer}"
                ),
            },
        ],
        response_format=EvaluationResult,
    )

    return completion.choices[0].message.parsed


# Usage: evaluate a batch of generated answers
def evaluate_pipeline(
    test_cases: list[dict],
    generate_fn,
    judge_model: str = "gpt-4o",
) -> dict:
    """Run evaluation across a test set and compute aggregate scores."""
    results: list[EvaluationResult] = []
    for case in test_cases:
        answer = generate_fn(case["question"])
        evaluation = evaluate_answer(
            question=case["question"],
            answer=answer,
            context=case.get("context"),
            judge_model=judge_model,
        )
        results.append(evaluation)

    # Aggregate scores
    n = len(results)
    return {
        "avg_relevance": sum(r.relevance for r in results) / n,
        "avg_accuracy": sum(r.accuracy for r in results) / n,
        "avg_completeness": sum(r.completeness for r in results) / n,
        "avg_coherence": sum(r.coherence for r in results) / n,
        "n_evaluated": n,
    }

The cost of LLM-as-a-judge. Evaluating 100 answers with GPT-4o costs roughly $0.50–2.00 depending on answer length. This is cheap compared to human evaluation, but it compounds: if you re-evaluate on every prompt change, every chunking parameter update, and every model swap, you can run through $50–100 in evaluation costs during a single optimization session. Budget for it.

Bias warning. LLM judges have systematic biases: they prefer longer answers, they favor their own generation style, and they are sensitive to answer position in pairwise comparisons. Mitigate by randomizing order in pairwise evaluations and calibrating against a small set of human-judged examples.

BERTScore: Semantic Similarity That Works

BERTScore addresses the paraphrasing problem by comparing generated and reference texts at the embedding level rather than the surface level. It computes token-level cosine similarities between contextual embeddings from a pre-trained model, then aggregates precision, recall, and F1.

“The quarterly revenue was $4.2M” and “Revenue for Q3 reached four point two million dollars” will score high on BERTScore because their token embeddings are semantically similar, even though they share few exact words.

from bert_score import score as bert_score
import torch


def compute_bertscore(
    predictions: list[str],
    references: list[str],
    model_type: str = "microsoft/deberta-xlarge-mnli",
) -> dict:
    """Compute BERTScore between generated and reference texts.

    Uses DeBERTa-xlarge by default — the recommended model for English.
    Returns precision, recall, and F1 as lists (one score per pair).
    """
    precision, recall, f1 = bert_score(
        predictions,
        references,
        model_type=model_type,
        device="cuda" if torch.cuda.is_available() else "cpu",
        batch_size=32,
        verbose=False,
    )

    return {
        "precision": precision.tolist(),
        "recall": recall.tolist(),
        "f1": f1.tolist(),
        "mean_f1": f1.mean().item(),
    }


# Usage
predictions = [
    "The company reported revenue of $4.2 million for the quarter.",
    "Customer satisfaction declined by 12% year-over-year.",
]
references = [
    "Quarterly revenue reached $4.2M.",
    "Customer satisfaction dropped 12% compared to the same period last year.",
]

scores = compute_bertscore(predictions, references)
print(f"Mean BERTScore F1: {scores['mean_f1']:.4f}")
# Typical range: 0.85-0.95 for semantically equivalent texts

BERTScore is not a replacement for human judgment — it is a fast, automated signal that correlates with human quality assessments better than BLEU or ROUGE. Use it for regression testing (did my prompt change make outputs worse?), not for absolute quality measurement.

Human Evaluation: The Ground Truth

When the stakes are high, you need human evaluators. No automated metric captures every quality dimension, and no LLM judge is free from systematic biases. Human evaluation is expensive and slow, but it is the only evaluation method you can trust completely.

Three protocols, ordered by cost:

Likert scale rating. Show evaluators the question and generated answer. Ask them to rate on a 1–5 scale across dimensions (relevance, accuracy, fluency). Fast per judgment, but requires careful rubric design — without concrete anchors for each score level, inter-annotator agreement drops below useful levels. Define what a 3 means. Define what a 5 means. Show examples.

Pairwise comparison. Show evaluators the same question with two different answers (from two models, or two prompt variants). Ask: “Which answer is better?” This eliminates the calibration problem of absolute scores — humans are much better at relative judgments than absolute ones. Requires more evaluations but produces more reliable rankings.

Inter-annotator agreement. Measure it. If your three annotators agree on only 60% of judgments, your evaluation data is noise. Cohen’s kappa above 0.6 for pairwise tasks, or Krippendorff’s alpha above 0.67 for rated tasks, is the minimum bar. Below that, your rubric needs work before your evaluation data is usable.

The real bottleneck. Building evaluation sets — curated questions with expert-annotated ground truth — costs more time than any other part of an NLP project. A 200-question evaluation set with validated answers, annotated relevant chunks, and human quality scores takes 40–80 hours of expert time to create. There is no shortcut. The teams that invest in evaluation sets ship better systems. The teams that skip it never know whether their systems are improving or degrading.

7.4 — Parameter-Efficient Fine-Tuning

Prompting is the default. RAG extends the default to domain-specific data. But sometimes the model’s behavior is wrong, not its knowledge. The model generates verbose responses when you need terse ones. It produces Markdown when you need JSON. It hedges with disclaimers when you need confident assertions. You can prompt your way past some of these issues, but there is a point where the prompt becomes a 2,000-word instruction manual that the model follows inconsistently.

Fine-tuning changes the model. Instead of telling it what to do at inference time, you bake the desired behavior into its weights. The model learns your output format, your domain vocabulary, your stylistic preferences. The prompt becomes simple because the model already knows what you want.

Why Full Fine-Tuning Is Impractical

A 7-billion parameter model stores each parameter as a floating-point number. At 16-bit precision (fp16), that is 14 GB just for the weights. During training, you also need to store:

Optimizer states (Adam stores two running averages per parameter): 28 GB at fp16
Gradients: 14 GB
Activations for backpropagation: variable, but typically 4–8 GB

Total: 60+ GB of VRAM for a 7B model. A single A100 80GB GPU can barely fit this. A 70B model requires 8 GPUs. Full fine-tuning is not a technique you reach for — it is a datacenter-scale operation.

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) sidesteps the memory problem with a mathematical insight: the weight updates during fine-tuning have low intrinsic rank. You do not need to modify all 7 billion parameters — you can approximate the update with two small matrices.

The math: for a weight matrix $W \in \mathbb{R}^{d \times k}$, the fine-tuned weight is:

$$W’ = W + \Delta W = W + BA$$

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.

If $W$ is a 4096 × 4096 attention matrix (16.7M parameters), and $r = 16$, then $B$ and $A$ together have $4096 \times 16 + 16 \times 4096 = 131,072$ parameters — 0.8% of the original. You freeze $W$ entirely (no gradients, no optimizer states) and train only $B$ and $A$.

The memory savings are dramatic:

Base model weights (frozen, can be quantized): 14 GB → 3.5 GB at 4-bit
LoRA adapter parameters: ~50 MB
Optimizer states for adapters: ~100 MB
Gradients for adapters: ~50 MB
Total: ~4 GB — a single consumer GPU

LoRA Architecture

QLoRA: Quantize Then Adapt

QLoRA pushes the memory reduction further: quantize the frozen base weights to 4-bit precision using NormalFloat4, then train LoRA adapters in 16-bit. The base model that took 14 GB at fp16 now takes 3.5 GB at 4-bit. Combined with LoRA adapters, you can fine-tune a 7B model on a GPU with 6 GB of VRAM.

The quality cost is surprisingly low. QLoRA matches full fine-tuning on most benchmarks — the 4-bit quantization of frozen weights introduces negligible error because you are not computing gradients through them.

Fine-Tuning Pipeline with PEFT

Here is a complete LoRA fine-tuning pipeline for a structured extraction task: given unstructured text, produce a JSON output conforming to a schema.

import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer


def prepare_training_data(examples: list[dict]) -> Dataset:
    """Format training examples as instruction-response pairs.

    Each example: {"input": str, "output": str}
    """
    formatted: list[str] = []
    for ex in examples:
        formatted.append(
            f"### Instruction:\nExtract structured data from the following text.\n\n"
            f"### Input:\n{ex['input']}\n\n"
            f"### Response:\n{ex['output']}"
        )
    return Dataset.from_dict({"text": formatted})


def fine_tune_with_lora(
    model_name: str = "mistralai/Mistral-7B-v0.3",
    train_data: list[dict] = None,
    output_dir: str = "./lora-adapter",
    num_epochs: int = 3,
    learning_rate: float = 2e-4,
    lora_rank: int = 16,
    lora_alpha: int = 32,
) -> None:
    """Fine-tune a model with QLoRA for structured extraction."""

    # Quantization config: 4-bit base weights, 16-bit compute
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",  # NormalFloat4 — better than uniform int4
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,  # Quantize the quantization constants
    )

    # Load base model with quantization
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=False,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # LoRA configuration
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_rank,              # Rank of the update matrices
        lora_alpha=lora_alpha,    # Scaling factor (alpha/r scales the update)
        lora_dropout=0.05,
        target_modules=[          # Which layers to adapt
            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
            "gate_proj", "up_proj", "down_proj",      # MLP
        ],
        bias="none",
    )

    # Apply LoRA to the model
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # Output: trainable params: 13,631,488 || all params: 7,248,547,840 || trainable%: 0.188

    # Prepare dataset
    dataset = prepare_training_data(train_data)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # Effective batch size: 16
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,                      # Match compute dtype
        optim="paged_adamw_8bit",       # Memory-efficient optimizer
        gradient_checkpointing=True,    # Trade compute for memory
        max_grad_norm=0.3,
        report_to="none",
    )

    # Train
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=training_args,
        processing_class=tokenizer,
        max_seq_length=1024,
    )
    trainer.train()

    # Save only the adapter weights — not the full model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Adapter saved to {output_dir}")

Key decisions in this pipeline:

lora_rank=16. Higher rank captures more complex adaptations but costs more memory and risks overfitting on small datasets. Rank 8–32 covers most tasks. Start at 16 and reduce if you see overfitting.

target_modules. Adapting both attention and MLP layers gives the model more capacity to change. For simpler tasks (formatting changes, style adaptation), adapting only attention layers may suffice.

lora_alpha=32. The effective scaling factor is alpha / rank. With alpha=32 and rank=16, the scaling is 2.0 — the LoRA update contributes with double weight. Lower alpha makes the model more conservative.

gradient_checkpointing=True. Trades ~30% slower training for ~60% less activation memory. Always enable this on consumer GPUs.

Loading and Using the Fine-Tuned Model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer


def load_fine_tuned_model(
    base_model_name: str, adapter_path: str
) -> tuple[AutoModelForCausalLM, AutoTokenizer]:
    """Load a base model with a LoRA adapter for inference."""
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    model = PeftModel.from_pretrained(model, adapter_path)
    model = model.merge_and_unload()  # Merge adapter into base weights

    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    return model, tokenizer


# Inference
model, tokenizer = load_fine_tuned_model(
    "mistralai/Mistral-7B-v0.3", "./lora-adapter"
)

prompt = (
    "### Instruction:\nExtract structured data from the following text.\n\n"
    "### Input:\nAcme Corp reported Q3 2025 revenue of $4.2M, up 15% YoY.\n\n"
    "### Response:\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.1,
        do_sample=True,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("### Response:\n")[-1])

When to Fine-Tune vs. When to Use RAG

This is not a matter of preference. It is a function of what you are trying to change:

Dimension	RAG	Fine-Tuning
What it changes	The model’s knowledge at query time	The model’s behavior permanently
Data freshness	Real-time — update the vector store anytime	Static — retraining required for new behavior
Setup cost	Hours (vector DB + ingestion pipeline)	Days (training data curation + training + evaluation)
Inference cost	Higher (embedding + vector search + LLM call)	Lower (single model call, can use smaller model)
Latency	Higher (retrieval adds 50–200ms)	Lower (no retrieval step)
Accuracy on domain facts	High if chunks are correct	Moderate — model may still hallucinate
Output format control	Moderate — depends on prompt compliance	High — model learns the format
Training data required	None (just documents)	100–10,000 examples depending on task complexity

Use RAG when:

The task requires knowledge that changes (product catalog, documentation, news)
You need source attribution (cite which document the answer came from)
You have documents but not labeled training examples
Accuracy on specific facts matters more than output format

Use fine-tuning when:

You need a specific output format that prompting cannot reliably produce
You want to reduce inference cost by using a smaller, specialized model
The model’s tone, style, or verbosity needs systematic adjustment
You have labeled examples and the patience to build an evaluation pipeline

Use both when:

You need domain knowledge (RAG) and specific behavior (fine-tuning)
Fine-tune the model for output format, then retrieve context for domain facts
This is the architecture behind most production-grade domain-specific assistants

The teams that ship the best NLP systems are not the ones that pick one approach and optimize it. They are the ones that match the approach to the failure mode. If the model does not know something, add retrieval. If the model knows but misbehaves, fine-tune. If the model knows and behaves but the answer is wrong, fix your evaluation pipeline — you have a measurement problem, not a model problem.