In this post, I break down how to implement a memory-efficient HashSet from scratch in Java using separate chaining, dynamic resizing, and amortized analysis, without relying on built-in data structures.
Problem Breakdown
In this leetcode problem, we are asked to implement a class that simulates the functionality of a HashSet without using any built-in HashSet data structure.
-
We are required to implement the following methods:
-
void add(key)
- Inserts the valuekey
into the HashSet. -
bool contains(key)
- Returns whether the valuekey
exists in the HashSet or not. -
void remove(key)
- Removes the valuekey
from the HashSet. Ifkey
does not exist, do nothing.
-
-
At first glance, this may seem straightforward, especially if you have used
HashSet
in any language before. But the real challenge lies in achieving:- O(1) average time complexity for all operations (add, contains, remove).
- Efficient memory usage, especially when we only store a small subset of values within the large keyspace that can go up to 1,000,000.
- I saw many solutions essentially just initialising a boolean array of size 1,000,001 and toggle the entries based on the key. This approach works, but it is extremely wasteful in terms of space, particularly if only a few keys are used.
- Hence we get to the crux of our problem: How can we minimize memory usage while still maintaining fast performance?
- Lets dive into a proper design that balances time efficiency and space optimization using a real-world inspired data structure: a dynamic array of linked lists (i.e., buckets).
Solution Breakdown
Overview
- I will try to explain the solution step by step with small code snippets along the way. The full code is found down below.
-
First, to design an efficient
HashSet
from scratch, we draw inspiration from how real-world hash tables work under the hood. The core idea is to:- Use an array of buckets, where each bucket is implemented as a singly linked list.
- Store key values as
Node
instances within these buckets. - Dynamically resize the hash table as more keys are added based on a load factor threshold.
- Maintain average O(1) time complexity for
add
,remove
, andcontains
, while using memory more efficiently than a flat 1-million-element array.
- But why use this more complicated approach? Well, simpler solutions do exist (e.g., initializing a boolean array of size 1,000,001), however, they are extremely memory-inefficient in scenarios where few keys are inserted.
- Our approach aims to offer a balance of time efficiency and space optimization, using a hash-based strategy with separate chaining.
Data Structure Design
-
This is the given class we must implement:
class MyHashSet { public MyHashSet() { } public void add(int key) { } public void remove(int key) { } public boolean contains(int key) { } }
-
We begin by defining a static
Node
class to represent elements stored in each bucket
static class Node { int val; Node next; Node(int val) { this.val = val; } }
- Each Node stores a key (
val
) and a reference to the next node in the linked list (next
). - Buckets are simply linked lists of these nodes.
- Each Node stores a key (
-
The
MyHashSet
class maintains:
private Node[] hashTable; private int size; public MyHashSet() { this.hashTable = new Node[16]; }
-
hashTable
is the array of buckets (initially of size 16). -
size
tracks the total number of keys currently inserted.
-
Hashing and Indexing Logic
-
We define a helper method to compute a stable hash value and another one to derive the target index within the hash table:
private int hashKey(int key) { return Math.abs(Integer.hashCode(key)); } private int getIndex(int key) { return hashKey(key) % hashTable.length; }
-
Integer.hashCode(key)
returns the hash of the integer key (in this case, the key itself). - Taking the modulo of the hash with the array length ensures the key maps to a valid index.
- We use
Math.abs
to avoid negative indices.
-
Load Factor and Resizing
-
To avoid performance degradation due to collisions, we use the load factor to determine when to resize the hash table.
private double getLoadFactor() { return (double) size / hashTable.length; // resize occurs at >= 0.75 }
- When the load factor reaches 0.75, we double the capacity of the hash table.
- This reduces the likelihood of collisions and keeps average operation time close to O(1).
- For example, if we our hash table can hold 16 elements and it currently has 12 elements stored, this gives us a load factor of exactly 0.75.
- At this point we would call the
resize()
function, which will now be discussed.
-
The
resize()
method handles this reallocation:
private void resize() { int newSize = hashTable.length * 2; Node[] newHashTable = new Node[newSize]; // new map will be twice the size for (Node head : hashTable) { while (head != null) { Node next = head.next; int newIndex = hashKey(head.val) % newSize; head.next = newHashTable[newIndex]; // in case of a collision, we insert node at the head newHashTable[newIndex] = head; head = next; } } hashTable = newHashTable; // re-assign old hash table to new hash table }
- This function creates a new hash table in memory, which is twice the size of the old hash table.
- It then loops through each linked list (
Node
) in the old hash table. - For each bucket/linked list, we traverse each node using a
while
loop. - The first thing we do inside the while loop is store a reference to the next node in the linked list so that we don’t lose it.
- We then compute the index of the current node in the new hash table and store this value in
newIndex
. -
We then insert the current node to the head of the linked list found at
newHashTable[newIndex]
.- I was initially confused by this part, but we basically do this because of possible collisions that may also occur during the copying phase.
- Meaning that
newHashTable[newIndex]
could already have one or more elements stored there, and so if we just donewHashTable[newIndex]
=head
then we would overwrite that entire linked list possibly stored atnewHashTable[newIndex]
.
For this reason, we first do
head.next
=newHashTable[newIndex]
so that the current nodehead
is pointing to the beginning of that list.After that we can safely do
newHashTable[newIndex]
=head
so that a reference tohead
is stored atnewHashTable[newIndex]
.After that, we then move to the next element of the linked list by doing
head
=next
. See why we stored a reference to the next node earlier?After the entire old hash table is copied over to the new one, we make sure that
hashTable
points tonewHashTable
.
Method Implementation
-
Now that we understood the helper functions to be used, lets start with the
add()
method:
public void add(int key) { if (getLoadFactor() >= 0.75) { resize(); // copy the map over to a new one } // check if node already exists in map if (contains(key)) return; // get key's index int index = getIndex(key); // insert new node at beginning of bucket at index Node newNode = new Node(key); newNode.next = hashTable[index]; hashTable[index] = newNode; size++; }
- Before adding the
key
, we first check if we have reached the maximum threshold for the load factor. If we did, then we callresize()
to create a new hash table that is double the current hash table in size and copy all the elements over (as we have seen above). - This step is important because if we just keep adding elements to our hash table while it remains fixed in size, the number of collisions will keep increasing until it is virtually guaranteed that any new element to be inserted will map to a hash value that already exists (i.e. collision).
- Not resizing will cause our other two methods
remove()
andcontains()
, to degrade to a time complexity of O(k) instead of O(1), where k is the number of elements in a given bucket in our hash table. -
The
add()
method, on the other hand, is guaranteed to always run in O(1) time even if we do not resize the hash table. Adding the call toresize()
makes the method occasionally run in O(n) time, but calls toresize()
decrease exponentially as the hash table grows, making theadd()
method run in amortized O(1) time.- Amortization is a term we use whenever there is a fixed cost within our system, and we reduce the frequency at which this cost is incurred.
- In our example here, we decrease the frequency at which the hash table is resized over time, thus amortizing the cost of the
add()
function to run in O(1) time.
After we have checked for the load factor, we then check if the
key
already exists in the hash table by calling thecontains()
method (discussed below). For now, just know that this method also has amortized O(1) time complexity as the size of the hash table grows, but can run in O(k) time if there are collisions.If the key does not exist in the hash table, we get the index of the bucket/list it will be inserted in.
We then create a new
Node
instance storing thekey
, and insert this node at the head of the linked list found athashTable[index]
.
- Before adding the
-
Now lets go on to the discuss the
remove()
method:
public void remove(int key) { int index = getIndex(key); if (hashTable[index] == null) return; // key does not exist Node cur = hashTable[index]; // key is the first in the bucket if (cur.val == key) { hashTable[index] = cur.next; size--; return; } // find prev to the key that needs to be removed while (cur != null && cur.next != null) { if (cur.next.val == key) { cur.next = cur.next.next; size--; return; } cur = cur.next; } // key does not exist }
- The first thing we do is we get the index of
key
and check if an element exists. If not, that meanskey
does not exist in our hash table. - However, if there exists a
Node
athashTable[index]
, then that means there is a possibility thatkey
does exist. We are still not 100% sure due to the possibility of collisions. - Before traversing the list, we check if
key
is stored at the head. If it is, we remove it from the hash table by overwriting the value stored athashTable[index]
withcur.next
, which is the second element in the list. We also decrement the size of our hash table and then exit the function. - If the
key
is not stored at the head of the list, we must traverse the list to find the node previous to thekey
. -
We are looking for the node previous to the one that stores
key
so that we can remove any reference to it by changing thenext
pointer of that previous node to point to the node afterkey
.- This effectively unlinks the
key
node from the list.
- This effectively unlinks the
The garbage collector will then automatically remove any reference to
key
from memory since it is no longer being referenced anywhere in our program.If we traverse the entire list and we do not find the node storing
key
, then that means it does not exist.
- The first thing we do is we get the index of
-
Finally, we can now look at
contains()
, our last and final method:
public boolean contains(int key) { int index = getIndex(key); Node cur = hashTable[index]; while (cur != null) { if (cur.val == key) return true; cur = cur.next; } return false; }
- This method simply checks if a
key
exists in our hash table or not. - It does that by finding the index that the
key
would map to. - It then traverses the list found at
hashTable[index]
and checks theval
property of each node withkey
. - We return
true
if we find the node with a value equal tokey
; otherwise, we will returnfalse
after exiting the loop.
- This method simply checks if a
Full Code
class MyHashSet { static class Node { int val; Node next; Node(int val) { this.val = val; } } private Node[] hashTable; private int size; public MyHashSet() { this.hashTable = new Node[16]; } /** * Worst-case time complexity: O(k), where k is the number of nodes in a bucket (due to collisions). * Amortized time complexity: O(1) for add/remove/contains when hashTable are well-distributed. */ public void add(int key) { if (getLoadFactor() >= 0.75) { resize(); // copy the map over to a new one } // check if node already exists in map if (contains(key)) return; // get key's index int index = getIndex(key); // insert new node at beginning of bucket at index Node newNode = new Node(key); newNode.next = hashTable[index]; hashTable[index] = newNode; size++; } public void remove(int key) { int index = getIndex(key); if (hashTable[index] == null) return; // key does not exist Node cur = hashTable[index]; // key is the first in the bucket if (cur.val == key) { hashTable[index] = cur.next; size--; return; } // find prev to the key that needs to be removed while (cur != null && cur.next != null) { if (cur.next.val == key) { cur.next = cur.next.next; size--; return; } cur = cur.next; } // key does not exist } public boolean contains(int key) { int index = getIndex(key); Node cur = hashTable[index]; while (cur != null) { if (cur.val == key) return true; cur = cur.next; } return false; } private int getIndex(int key) { return hashKey(key) % hashTable.length; } private int hashKey(int key) { return Math.abs(Integer.hashCode(key)); } private double getLoadFactor() { return (double) size / hashTable.length; // resize occurs at >= 0.75 } private void resize() { int newSize = hashTable.length * 2; Node[] newHashTable = new Node[newSize]; // new map will be twice the size for (Node head : hashTable) { while (head != null) { Node next = head.next; int newIndex = hashKey(head.val) % newSize; head.next = newHashTable[newIndex]; // in case of a collision, we insert node at the head newHashTable[newIndex] = head; head = next; } } hashTable = newHashTable; // re-assign old hash table to new hash table } }
Complexity Analysis
- The hash set uses separate chaining to handle collisions, storing elements in linked lists within each bucket. Operations like add, remove, and contains are generally O(1) on average when keys are well-distributed.
- In the worst case, when all keys hash to the same bucket, operations degrade to O(k), where k is the number of elements in that bucket as we have already mentioned before.
- The hash table resizes (doubles in capacity) when the load factor reaches 0.75, which ensures low collision probability.
- Resizing is O(n) but occurs infrequently, so its cost is amortized over multiple operations.
- Space complexity accounts for both the hash table array (m buckets) and all inserted elements (n keys).
Time Complexity Table
Operation | Average Case | Worst Case |
---|---|---|
add() | O(1) (amortized) | O(k) |
remove() | O(1) (amortized) | O(k) |
contains() | O(1) (amortized) | O(k) |
Space Complexity Table
Component | Space Used | Explanation |
---|---|---|
Hash Table Array | O(m) | m is the number of buckets (doubles on resize) |
Stored Elements | O(n) | Each inserted key is stored as a node in a linked list |
Total Space | O(n + m) | Accounts for both table size and number of elements |
Possible Improvements
- In production-grade hash tables like Java's
HashMap
, buckets that exceed a certain threshold (typically 8 nodes) are converted from linked lists into balanced trees (e.g., red-black trees) to improve worst-case lookup time from O(k) to O(log k), where k is the total number of elements in the bucket.
- Memory usage can be further optimized by shrinking the hash table when the load factor drops below a lower threshold (commonly 0.25), which helps avoid wasted space after many deletions.
- Advanced hash functions or open addressing (e.g., linear probing, quadratic probing) could be explored as alternatives to separate chaining, though they come with their own trade-offs.
If you found this breakdown helpful or took a different approach to designing HashSet, I’d love to hear your thoughts!
Feel free to drop a comment, leave a ❤️, or share how you'd optimize it further.
Follow me for more deep dives into LeetCode design problems and backend engineering concepts.
Top comments (0)