
How HashMaps Work
Diving into Java source code to learn to understand associative data structures
Written by: Alex Root-Roatch | Friday, July 26, 2024
What's a Hash Map?
A hash map is an associative datatype that stores key-value pairs. It is also called:
- A "dictionary" in Python
- An "associative array" in C++ and PHP,
- A "hash" in Ruby and Perl,
- A "map" in Clojure and
- An "object" in JavaScript (JavaScript also has "maps," but they are slightly different).
They are unordered, dynamically sized, and very fast for inserting, deleting, and retrieving data. This makes them ideal for large datasets that may require frequent modification.
Unordered
Hash maps are unordered, meaning that the order of elements in the collection cannot be guaranteed to maintain the
ordered in which they were added. For example, adding the number 0-9 to a HashMap in Java as key value pairs
of {"0"=0}
comes back in numerical order regardless of the order the entries are inserted:
void addNKeys(HashMap<String, Integer>map, int n){
for (int i = 9; i >= 0; i--){
map.put(Integer.toString(i), i);
}
System.out.println(map);
}
=> {0=0, 1=1, 2=2, 3=3, 4=4, 5=5, 6=6, 7=7, 8=8, 9=9}
void addNKeys(HashMap<String, Integer>map, int n){
for (int i = 0; i < 10; i++){
map.put(Integer.toString(i), i);
}
System.out.println(map);
}
=> {0=0, 1=1, 2=2, 3=3, 4=4, 5=5, 6=6, 7=7, 8=8, 9=9}
However, when inserting them using the spelled out names of the numbers as the keys, they come back in an unexpected order:
{nine=9, six=6, four=4, one=1, seven=7, ten=10, two=2, three=3, five=5, eight=8}
This is due to how keys are hashed, which is how a key is turned into an index in memory for fast retrieval. This is an important aspect of how hash maps are stored.
Storing Hash Maps
Hash maps are an interesting combination of arrays and linked lists. Looking at the Java source code, we see that each key-value pair in the map is stored as a node:
static class Node<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
V value;
Node<K,V> next;
Node(int hash, K key, V value, Node<K,V> next) {
this.hash = hash;
this.key = key;
this.value = value;
this.next = next;
}
This bears a strong resemblance to linked lists:
private static class Node<E> {
E item;
Node<E> next;
Node<E> prev;
Node(Node<E> prev, E element, Node<E> next) {
this.item = element;
this.next = next;
this.prev = prev;
}
}
Rather than this.item
in the linked list, there is this.key
and this.value
. There's also this.next
but
no this.prev
, meaning this forms a single-linked list as opposed to the double-linked list of java.util.LinkedList
.
But what about this.hash
?
Continuing to scroll down the source code reveals an array of key-value nodes:
transient Node<K,V>[] table;
It's this array that is actually storing the map, and each index of the array is a linked list. This is
where this.hash
comes in.
Hashing
A hash function is a function that converts data (e.g. a string, integer, or keyword) of any size into a value of a fixed size. Java's hash function outputs a 32-bit signed integer. In the context of hash maps, this hash function is used to convert the key of the key-value pair into a 32-bit signed integer that is then used as the index of the array.
It is possible for two or more keys to potentially lead to the same 32-bit integer, causing an unwanted collision.
In an array, if an element is being added at an occupied index, all the pre-existing elements are shifted
down one index to make room for the new element. To avoid this, the new key-value node is added to the one that is
already there, creating a linked list at that index. All the elements in one linked list will share the same value as
their this.hash
, so the hash
field of each node can simply be thought of as the index in the array that the linked
list occupies.
This is often described as each linked list at each index in the array being like a "bin" or "bucket" that elements are "dropped" into.
Memory Allocation
When dealing with arrays, the allocation of contiguous block of memory is an important concern. In Java, Hash Maps are initialized with a capacity of 16. Rather than waiting for this capacity to be reached before re-allocating more memory, hash maps will re-allocate memory and re-hash based on their load-factor, which is the capacity at which this will happen. In Java, the default load factor is 0.75, meaning that memory is re-allocated and keys are re-hashed when the array reaches 75% capacity.
The combination of array, linked list, and allocating more memory when the array is only 75% full means that hash maps
have more memory overhead than arrays or linked lists by themselves. More complex implementation of maps --
like java.util.SortedMap
, which maintains each elements position based on insertion order -- will have even more
memory overhead.
Data Retrieval
To retrieve an item from the hash map, the key being searched is hashed to find its index in the array. If there is more than one key-value pair node at that index, the linked list is traversed until the desired key-value pair is found. Let's look at how retrieving data from maps compares to array lists and linked lists:
- Retrieving the middle item out of ten items:
- HashMap -- 0.01643ms
- Array -- 0.005994ms
- Linked List -- 0.01964ms
In this case, hash maps and linked lists are close in performance, while arrays are a little more than twice as fast. The story changes dramatically with large datasets, though:
- Retrieving the middle item out of 1,000,000 items
- HashMap -- 0.014793ms
- Array -- 0.007988ms
- Linked List -- 4.633918ms
Traversing 500,000 nodes in a linked list took significantly longer, while the array barely saw a decrease in speed, and the hash map was actually slightly faster!
Conclusion
Hash maps are an array of linked list in which each node is a key-value pair. The index at which each item will be inserted is determined by the hash function, which turns each key into a 32-bit signed integer representing the index. Keys that evaluate to the same integer by the hash function are added to the linked list at that index. Because of this, each index can be thought of as a "bucket" that holds multiple elements. The array by default has an initial capacity of 16 and is expanded and the keys re-hashed when the array reach 75% capacity. This is known as the load-factor.