Precision at k, Recall, and Choosing the Right Metric

The Symptom

The team measures precision@10 and sees 0.85. They are satisfied. But the three most important results for each query consistently appear at positions 7, 8, and 9. Precision@10 does not penalize this. NDCG does.

The Internals

Precision@k measures the fraction of the top k results that are relevant:

$$\text{Precision@k} = \frac{\text{number of relevant documents in top k}}{k}$$

Precision@5 of 0.80 means 4 of the top 5 results are relevant. It does not say anything about the order of those 4 relevant results. A ranking of [relevant, irrelevant, relevant, relevant, relevant] and [irrelevant, relevant, relevant, relevant, relevant] both have precision@5 of 0.80.

NDCG@k (defined in the parent chapter) accounts for the position of relevant results. It penalizes a relevant document at position 5 more than a relevant document at position 1.

Mean Average Precision (MAP) computes precision at every rank position where a relevant document appears, then averages:

$$\text{AP} = \frac{1}{R} \sum_{k=1}^{n} \text{Precision@k} \times \text{rel}(k)$$

Where $R$ is the total number of relevant documents and $\text{rel}(k)$ is 1 if the document at position k is relevant, 0 otherwise.

The Implementation

public class SearchMetrics {

    /**
     * Precision at k: fraction of top k results that are relevant.
     * Binary relevance: any grade > 0 is relevant.
     */
    public static double precisionAtK(List<String> results,
            Map<String, Integer> grades, int k) {

        long relevant = results.stream()
            .limit(k)
            .filter(docId -> grades.getOrDefault(docId, 0) > 0)
            .count();

        return (double) relevant / k;
    }

    /**
     * Mean Average Precision across multiple queries.
     */
    public static double meanAveragePrecision(
            Map<String, List<String>> queryResults,
            Map<String, Map<String, Integer>> queryGrades) {

        double sumAP = 0.0;

        for (var entry : queryResults.entrySet()) {
            String queryId = entry.getKey();
            List<String> results = entry.getValue();
            Map<String, Integer> grades = queryGrades.get(queryId);

            sumAP += averagePrecision(results, grades);
        }

        return sumAP / queryResults.size();
    }

    private static double averagePrecision(List<String> results,
            Map<String, Integer> grades) {

        int relevantSoFar = 0;
        double sumPrecision = 0.0;
        int totalRelevant = (int) grades.values().stream()
            .filter(g -> g > 0)
            .count();

        if (totalRelevant == 0) return 0.0;

        for (int i = 0; i < results.size(); i++) {
            if (grades.getOrDefault(results.get(i), 0) > 0) {
                relevantSoFar++;
                sumPrecision += (double) relevantSoFar / (i + 1);
            }
        }

        return sumPrecision / totalRelevant;
    }
}

The Measurement

Compare metrics on the same query test set to understand their behavior:

Query	Top 5 Results (R=relevant, I=irrelevant)	P@5	NDCG@5	AP
Q001	R, R, R, I, I	0.60	0.85	0.78
Q002	I, R, R, R, I	0.60	0.72	0.62
Q003	R, I, R, I, R	0.60	0.79	0.76

All three queries have identical precision@5 (0.60). NDCG and AP differentiate them. Q001, with relevant documents at the top, scores highest. Q002, with an irrelevant document at position 1, scores lowest. NDCG and AP capture what precision@k misses: position matters.

The Decision Rule

Use NDCG@k as the primary metric when relevance grades are multi-level (0, 1, 2, 3). NDCG leverages the graded judgments to distinguish between “perfect” and “marginal” results in the ranking.

Use precision@k as a supplementary metric when you need a simple, interpretable number for stakeholder communication. “4 of the top 5 results are relevant” is easier to explain than “NDCG@5 is 0.78.”

Use MAP when you care about the full recall of relevant documents, not just the top k. MAP penalizes missing relevant documents more heavily than NDCG, making it appropriate for search use cases where finding all relevant documents matters (e.g., legal document review, prior art search).

For the documentation platform, NDCG@5 is the primary metric because users rarely look beyond the first five results, and the difference between a “perfect” result (the exact page they need) and a “relevant” result (a related page) matters for user experience.