Skip to main content
fast by design

HashMap Internals and Beyond: Load Factor, Collision Chains, and When to Use Something Else

13 min read Chapter 28 of 90

HashMap Internals and Beyond: Load Factor, Collision Chains, and When to Use Something Else

HashMap is the most-used data structure in Java after ArrayList. It is also the most misunderstood. Developers treat it as a black box: put a key-value pair in, get it out by key, O(1) average case, done. But O(1) average case hides a range of outcomes from O(1) best case (no collisions, direct bucket hit) to O(log n) worst case (hash collisions causing tree traversal) to O(n) pathological case (terrible hashCode, all keys in one bucket before treeification kicks in).

The content platform uses HashMap extensively. The article metadata cache maps article IDs to metadata objects. The category index maps category strings to article lists. The user session store maps session tokens to user state. The recommendation engine maps user IDs to precomputed feature vectors. Each of these has different key types, different sizes, different access patterns, and different correctness requirements under concurrency.

This chapter opens the HashMap implementation, shows how bucket structure, load factor, and hash quality interact, benchmarks the impact of each, and identifies when HashMap is the wrong choice.

The Bucket Array

HashMap is backed by a Node<K,V>[] table. The table length is always a power of two. When you call put(key, value):

  1. Compute the hash: hash = key.hashCode(), then spread it: hash ^ (hash >>> 16). The spread mixes the high bits into the low bits to reduce collisions when the table length is small (since the bucket index uses only the low bits).

  2. Compute the bucket index: index = hash & (table.length - 1). Because the table length is a power of two, this is a fast bitwise AND instead of a modulo operation.

  3. If table[index] is null, create a new Node and store it there.

  4. If table[index] is not null, walk the chain at that bucket. If a node with the same key exists (by equals()), replace its value. Otherwise, append a new node to the chain.

  5. If the chain length at any bucket exceeds TREEIFY_THRESHOLD (8), and the table length is at least MIN_TREEIFY_CAPACITY (64), convert that bucket’s linked list into a red-black tree. This converts worst-case lookup from O(n) to O(log n).

  6. If the total number of entries exceeds capacity * loadFactor, resize the table to double its length and rehash all entries.

HashMap internals: bucket array, collision chains, and treeification

The diagram shows a HashMap with 16 buckets. Most buckets have zero or one entry. Bucket 3 has a collision chain of three nodes linked together. Bucket 7 has exceeded the treeification threshold and been converted to a red-black tree for O(log n) lookup instead of O(n) chain traversal.

Load Factor: The Space-Time Trade-Off

The default load factor is 0.75. This means the table resizes when 75% of buckets are occupied. A lower load factor means more empty buckets, fewer collisions, faster lookups, but more wasted memory. A higher load factor means fewer empty buckets, more collisions, slower lookups, but less memory.

// Default: load factor 0.75, initial capacity 16
Map<String, ArticleMetadata> cache = new HashMap<>();
// Resizes at 12 entries, then at 24, 48, 96, ...

// Lower load factor: fewer collisions, more memory
Map<String, ArticleMetadata> cache = new HashMap<>(64, 0.5f);
// Resizes at 32 entries. 50% of buckets stay empty.

// Higher load factor: more collisions, less memory
Map<String, ArticleMetadata> cache = new HashMap<>(64, 1.0f);
// Resizes at 64 entries. Buckets nearly full, chains form.
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class LoadFactorBenchmark {

    @Param({"10000"})
    private int size;

    private String[] keys;
    private Map<String, String> map050;
    private Map<String, String> map075;
    private Map<String, String> map100;
    private Map<String, String> map150;

    @Setup
    public void setup() {
        keys = new String[size];
        for (int i = 0; i < size; i++) {
            keys[i] = "article-" + UUID.randomUUID();
        }
        map050 = buildMap(0.5f);
        map075 = buildMap(0.75f);
        map100 = buildMap(1.0f);
        map150 = buildMap(1.5f);
    }

    private Map<String, String> buildMap(float loadFactor) {
        Map<String, String> map = new HashMap<>(16, loadFactor);
        for (String key : keys) {
            map.put(key, "value-" + key);
        }
        return map;
    }

    @Benchmark
    public String loadFactor050() {
        return map050.get(keys[ThreadLocalRandom.current().nextInt(size)]);
    }

    @Benchmark
    public String loadFactor075() {
        return map075.get(keys[ThreadLocalRandom.current().nextInt(size)]);
    }

    @Benchmark
    public String loadFactor100() {
        return map100.get(keys[ThreadLocalRandom.current().nextInt(size)]);
    }

    @Benchmark
    public String loadFactor150() {
        return map150.get(keys[ThreadLocalRandom.current().nextInt(size)]);
    }
}

Results for 10,000 entries with UUID-based keys (good hash distribution):

Load FactorGet (ns)Table SizeMemory (KB)Avg Chain Length
0.501832,7682560.31
0.752016,3841280.61
1.002416,3841280.61
1.50328,192641.22

With good hash distribution, the default 0.75 is near optimal. Dropping to 0.5 saves 2ns per get but doubles memory. Raising to 1.5 saves memory but costs 12ns per get (60% slower) due to longer collision chains.

The key insight: load factor tuning is almost never the performance lever you think it is. With a good hashCode(), the default 0.75 produces short chains regardless of size. Load factor tuning matters only when hash quality is poor or when you need deterministic worst-case latency.

Pre-Sizing: Avoid Rehashing

HashMap resizes by allocating a new table at double the length, then rehashing every entry into the new table. For a map that grows from 0 to 10,000 entries with default settings:

Resize at: 12 → 24 → 48 → 96 → 192 → 384 → 768 → 1536 → 3072 → 6144 → 12288
That is 11 resize operations, each rehashing all existing entries.
// SLOW: 11 resize operations
Map<String, ArticleMetadata> cache = new HashMap<>();
for (Article article : articles) { // 10,000 articles
    cache.put(article.id(), buildMetadata(article));
}

// FAST: zero resize operations
Map<String, ArticleMetadata> cache = new HashMap<>(
    (int) (10_000 / 0.75f) + 1 // capacity for 10k at load factor 0.75
);
for (Article article : articles) {
    cache.put(article.id(), buildMetadata(article));
}

The formula (expectedSize / loadFactor) + 1 computes the initial capacity that avoids any resizing. Since Java 19, you can use the factory method:

// FAST: Java 19+ factory method
Map<String, ArticleMetadata> cache = HashMap.newHashMap(10_000);

HashMap.newHashMap(int numMappings) computes the optimal capacity internally.

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
public class MapSizingBenchmark {

    @Param({"1000", "10000", "100000"})
    private int size;

    private String[] keys;
    private String[] values;

    @Setup
    public void setup() {
        keys = new String[size];
        values = new String[size];
        for (int i = 0; i < size; i++) {
            keys[i] = "key-" + i;
            values[i] = "val-" + i;
        }
    }

    @Benchmark
    public Map<String, String> defaultCapacity() {
        Map<String, String> map = new HashMap<>();
        for (int i = 0; i < size; i++) {
            map.put(keys[i], values[i]);
        }
        return map;
    }

    @Benchmark
    public Map<String, String> preSized() {
        Map<String, String> map = HashMap.newHashMap(size);
        for (int i = 0; i < size; i++) {
            map.put(keys[i], values[i]);
        }
        return map;
    }
}
SizeDefault (us)Pre-sized (us)Savings
1,000685224%
10,00082058029%
100,00011,2007,40034%

The savings come from eliminating table allocation and entry rehashing during growth. At 100,000 entries, the default path rehashes entries 17 times, copying increasingly large numbers of entries each time.

hashCode() Quality: The Silent Bottleneck

A bad hashCode() concentrates entries in a few buckets, creating long chains that degrade get/put from O(1) to O(n) (or O(log n) after treeification).

// SLOW: Bad hashCode clusters keys into few buckets
public class BadHashArticleId {
    private final String category;
    private final int sequence;

    // Uses only category for hash. All articles in same category
    // go to the same bucket.
    @Override
    public int hashCode() {
        return category.hashCode(); // SLOW: ignores sequence
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof BadHashArticleId other)) return false;
        return category.equals(other.category) && sequence == other.sequence;
    }
}

If the content platform has 10 categories with 1,000 articles each, this hashCode creates 10 buckets with 1,000-element chains. Lookup becomes O(125) on average (1,000 / 8 treeified) instead of O(1).

// FAST: Good hashCode uses all identifying fields
public class GoodHashArticleId {
    private final String category;
    private final int sequence;

    @Override
    public int hashCode() {
        int result = category.hashCode();
        result = 31 * result + sequence; // FAST: includes sequence
        return result;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof GoodHashArticleId other)) return false;
        return category.equals(other.category) && sequence == other.sequence;
    }
}
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class HashQualityBenchmark {

    private Map<BadHashArticleId, String> badMap;
    private Map<GoodHashArticleId, String> goodMap;
    private BadHashArticleId[] badKeys;
    private GoodHashArticleId[] goodKeys;
    private static final int SIZE = 10_000;
    private static final String[] CATEGORIES = {
        "tech", "science", "business", "health", "sports",
        "politics", "entertainment", "travel", "food", "education"
    };

    @Setup
    public void setup() {
        badMap = HashMap.newHashMap(SIZE);
        goodMap = HashMap.newHashMap(SIZE);
        badKeys = new BadHashArticleId[SIZE];
        goodKeys = new GoodHashArticleId[SIZE];

        for (int i = 0; i < SIZE; i++) {
            String cat = CATEGORIES[i % CATEGORIES.length];
            badKeys[i] = new BadHashArticleId(cat, i);
            goodKeys[i] = new GoodHashArticleId(cat, i);
            badMap.put(badKeys[i], "article-" + i);
            goodMap.put(goodKeys[i], "article-" + i);
        }
    }

    @Benchmark
    public String badHashGet() {
        return badMap.get(badKeys[ThreadLocalRandom.current().nextInt(SIZE)]);
    }

    @Benchmark
    public String goodHashGet() {
        return goodMap.get(goodKeys[ThreadLocalRandom.current().nextInt(SIZE)]);
    }
}
Hash QualityGet (ns)Collision chainsMax chain
Bad (category only)14510 buckets used~1,000 per bucket
Good (category + sequence)18~7,500 buckets used1-3 per bucket

The bad hashCode is 8x slower. With treeification, the worst case is O(log 1000) = ~10 tree node comparisons per lookup. Without treeification (in the linked list phase before 8 collisions in early JDK versions, or if UNTREEIFY_THRESHOLD is reached after resize), it would be O(500) average comparisons.

The Treeification Threshold

When a bucket’s chain reaches 8 nodes (TREEIFY_THRESHOLD), HashMap converts it from a linked list to a red-black tree. The tree nodes are larger (56 bytes vs 32 bytes for regular nodes) but provide O(log n) lookup instead of O(n).

There is a counter-threshold: UNTREEIFY_THRESHOLD = 6. When a resize causes a tree bucket to split and one of the resulting buckets has 6 or fewer entries, it converts back to a linked list. This hysteresis prevents constant conversion back and forth at the boundary.

Treeification also requires the table size to be at least MIN_TREEIFY_CAPACITY = 64. If the table is smaller, HashMap resizes instead of treeifying. The reasoning: with a small table, collisions are likely due to insufficient buckets, and resizing (which doubles the table and redistributes entries) will resolve them. Treeification is reserved for genuine hash clustering that persists even with a large table.

The Content Platform Metadata Cache

The article metadata cache is the most accessed HashMap in the content platform. Every page render, every search result, every recommendation lookup hits this cache.

// SLOW: unprepared metadata cache
public class ArticleMetadataCache {

    // Default capacity 16, will resize 14 times to reach 50,000 entries
    private final Map<String, ArticleMetadata> cache = new HashMap<>(); // SLOW

    public void loadAll(List<Article> articles) {
        for (Article article : articles) {
            cache.put(article.id(), buildMetadata(article));
        }
    }

    public ArticleMetadata get(String articleId) {
        return cache.get(articleId);
    }
}
// FAST: pre-sized, unmodifiable after load
public class ArticleMetadataCache {

    private volatile Map<String, ArticleMetadata> cache = Map.of(); // FAST

    public void loadAll(List<Article> articles) {
        Map<String, ArticleMetadata> building = HashMap.newHashMap(articles.size());
        for (Article article : articles) {
            building.put(article.id(), buildMetadata(article));
        }
        cache = Map.copyOf(building); // immutable, thread-safe, optimized
    }

    public ArticleMetadata get(String articleId) {
        return cache.get(articleId);
    }
}

Map.copyOf() returns an unmodifiable map with several advantages:

  • No synchronization needed for reads (immutable data is inherently thread-safe)
  • No modCount tracking (skip unnecessary bookkeeping)
  • Optimized for lookup (JDK may use a different internal layout)
  • The volatile reference ensures visibility of the new map across threads

The pattern: build with a mutable HashMap, publish as an immutable snapshot via Map.copyOf(), swap the volatile reference. Readers see a consistent snapshot with no locking.

Map.of() and Map.copyOf() Internals

For small maps (up to 2 entries), Map.of() uses specialized classes that store key-value pairs in fields without any array or hashing:

// 0 entries: singleton empty map
Map.of();

// 1 entry: Map1 class with key1, value1 fields
Map.of("id", "article-123");

// 2 entries: Map2 class with key1, value1, key2, value2 fields
Map.of("id", "article-123", "title", "Performance");

For 3+ entries, Map.of() uses an array-based probe sequence that is denser than HashMap’s bucket array. No linked list nodes, no tree nodes. Lookup is a hash probe into a flat array.

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class MapOfVsHashMap {

    private Map<String, String> hashMap;
    private Map<String, String> mapOf;
    private String[] keys;
    private static final int SIZE = 100;

    @Setup
    public void setup() {
        keys = new String[SIZE];
        Map<String, String> temp = HashMap.newHashMap(SIZE);
        for (int i = 0; i < SIZE; i++) {
            keys[i] = "key-" + i;
            temp.put(keys[i], "val-" + i);
        }
        hashMap = new HashMap<>(temp);
        mapOf = Map.copyOf(temp);
    }

    @Benchmark
    public String hashMapGet() {
        return hashMap.get(keys[ThreadLocalRandom.current().nextInt(SIZE)]);
    }

    @Benchmark
    public String mapCopyOfGet() {
        return mapOf.get(keys[ThreadLocalRandom.current().nextInt(SIZE)]);
    }
}
ImplementationGet (ns)Memory (bytes)
HashMap (100 entries)20~6,400
Map.copyOf (100 entries)16~3,200

Map.copyOf is ~20% faster for gets and uses ~50% less memory. It achieves this by eliminating Node objects (stores keys and values directly in an array), removing resize/modCount overhead, and enabling the JIT to optimize more aggressively (no mutation possible).

When Not to Use HashMap

HashMap is not always the answer. Three JDK alternatives serve specific patterns better:

EnumMap: When keys are enum constants. Uses the enum’s ordinal as a direct array index. No hashing, no collision handling, O(1) guaranteed. Memory: one Object[] of exactly EnumClass.values().length.

IdentityHashMap: When key comparison should use == instead of equals(). Uses System.identityHashCode() and reference equality. Useful for object graph traversal where you need to track “have I visited this exact object” regardless of logical equality.

LinkedHashMap: When you need insertion-order or access-order iteration. Maintains a doubly-linked list across all entries. 8 bytes per entry overhead compared to HashMap. Use for LRU caches with removeEldestEntry().

Section 1 dives into HashMap’s rehashing mechanics with JMH proof of the resize cost. Section 2 covers ConcurrentHashMap’s segment structure and the specialized map alternatives in depth.