When we’re working with data, we often need to store and retrieve certain values. This can become problematic, especially when the dataset is substantial. Traditionally, arrays are used to index data, but these can take undesirable lengths of time to compute. Hashing is an approach to solve this issue and is often used with data structures. Continue reading to get the lowdown on what hashing is, how it works and what we can use it for in programming.
What Is Hashing and What Is It Used For?
In simple words, hashing is a technique that’s used to make data storage and retrieval more efficient. Consider the scenario where you have an encyclopedia full of technical terms. If you want to find a specific explanation in the book, you could just check every page successively, although this would take a very long time on average. You could also use the index page to locate the information, although this could still take a while if the encyclopedia is especially large. A much quicker way to find the information you need would be by hashing.
With hashing, we map the data to an array index, using something known as a hash function. This function produces a hash code or hash value for each input, and this is used to index the data. We can use hashing with a lot of different data structures, including hash tables, hashed heaps, cryptographic hash functions and bloom filters.
Here is a quick overview of these structures.
This is a structure where a hash function is used to map inputs to indexes in an array, in what we know as a key-value pair. The key is the input, and a hash value is produced, which gives the index of the value.
We can consider this a combination of a hash table with a heap (a tree structure). Typically, hashed heaps are used for modifying, inserting and deleting elements, as well as retrieving minimum and maximum elements.
Cryptographic Hash Functions
These have a complicated name but aren’t too difficult to understand. Cryptographic simply means that these functions are intended to be used in security-based applications. The function is used to take an input and produce a hash value. This value is then generally encrypted to ensure data integrity. Some key elements of a cryptographic hash function are that the result must always be the same, given the same input, and it must be impossible to generate the input from the resulting hash value.
This is a type of data structure known as a probabilistic structure. Basically, this means that they test the likelihood of whether an element belongs to a specific category. Bloom filters work by hashing a value and using a bit array (an array where the bits are set with values 0 or 1). The bits that correspond to the element are set to 1. When testing another element, the element is hashed using the same function and its bits are compared. If they’re all 1, we can be fairly confident this element belongs to the same set.
How Does Hashing Work?
As previously mentioned, hashing requires a hash function and a hash table. In the key-value pair, the key is used to identify the data, and the hash function is used to map this. For example, if we want to map a color to the corresponding type of fruit, we could use the color as the key. The related hash values would then be the fruit type. The hash function could be several things, but a relatively simple function would count the length of the string input, assuming that all the fruit colors contain a different number of characters (i.e. red for strawberry, green for kiwi, yellow for banana, blue for blueberry).
In this case, we would compute the hash code for the color and place the key and value at the correct index in the respective array. So, for a strawberry, the hash code would have a length of 3, therefore the color “Red” would be stored at the 3rd position in the key array, and the fruit “Strawberry” would be placed in the 3rd position in the value array.
What Are Collisions and How Do We Avoid Them?
While hashing in data structures is an extremely useful process, you can run into some problems. The most commonly encountered issue is a collision. Simply put, this is where two different input values end up with identical hash codes, and therefore should theoretically be stored in the same index. This can lead to unpredictable behavior and major problems when trying to store or retrieve data. We mainly employed two methods to manage collisions effectively: separate chaining and open addressing. We’re covering both of these next.
Here, we use an extra data structure to avoid collisions. This often takes the form of a linked list, but not always. In the hash table, each index is pointed to a linked list of elements. When we encounter a value with a duplicated index, this element is added to the linked list at the same index. When we want to retrieve a value, we look at the linked list and compare the key to find our data. Typically, separate chaining doesn’t require too much memory to handle, but if the list is appreciably long, you can run into memory issues.
Another way to handle collisions is by open addressing. This doesn’t rely on any new structures like a linked list. Instead, if we run into a new element with the same index, it’s simply placed into the next slot that’s available in the hash table. When searching for the data value, the function begins at the expected element index. If the element isn’t found here, subsequent indices are searched sequentially until the element is found. This can be a more useful method when storage is tight, but open addressing can be slower than separate chaining. There are 3 main methods used for open addressing – linear probing, quadratic probing and double hashing. The example we just gave demonstrated linear probing, so let’s briefly cover the other two here.
This is similar to linear probing, except the element is placed according to a quadratic function, rather than a linear function. This can help to reduce the possibility of clustering (where elements are located close together) because elements spread more widely across the hash table. Quadratic probing can also lead to an infinite loop, where the function is continuously returning the index occupied by another element. However, we can somewhat manage this by using the double hashing method instead.
When we use double hashing, we use two hash functions instead of one. If the desired index is already occupied, the secondary hash function is used to generate a step size (the number of steps to take to reach the next empty index). If this index isn’t empty, then the steps are incremented until an empty index is reached. We can still run into infinite loops if the hash function isn’t sufficient, but double hashing typically gives more efficient performance when the table is densely populated.
What Are the Main Applications of Hashing?
We can use hashing for many things in programming. Here are a few common applications, along with the type of hashing used.
- Hash tables: indexing databases, caching and spell checkers
- Hash heaps: Sorting algorithms and shortest-path algorithms
- Cryptographic hash functions: security, digital signatures, password storage and message authentication
- Bloom filters: databases, data compression, networking, web query optimization, and DNA sequence analysis.
Hashing in data structures is an essential technique when working with data. This is particularly true for large datasets where you desire quick information storage and retrieval. There are many methods used to implement hashing, such as hash heaps, hash tables, bloom filters and cryptographic functions. Whichever method is appropriate will depend on your use case. Collision handling is an important aspect to be aware of when hashing. We can handle collisions by using either open addressing or separate chaining methods.
The image featured at the top of this post is ©graphicwithart/Shutterstock.com.