© graphicwithart / Shutterstock.com

When we’re working with data, we often need to store and retrieve certain values. This can become problematic, especially when the dataset is substantial. Traditionally, arrays are used to index data, but these can take undesirable lengths of time to compute. Hashing is an approach to solve this issue and is often used with data structures. Continue reading to get the lowdown on what hashing is, how it works and what we can use it for in programming.

What Is Hashing and What Is It Used For?

In simple words, hashing is a technique that’s used to make data storage and retrieval more efficient. Consider the scenario where you have an encyclopedia full of technical terms. If you want to find a specific explanation in the book, you could just check every page successively, although this would take a very long time on average. You could also use the index page to locate the information, although this could still take a while if the encyclopedia is especially large. A much quicker way to find the information you need would be by hashing.

With hashing, we map the data to an array index, using something known as a hash function. This function produces a hash code or hash value for each input, and this is used to index the data. We can use hashing with a lot of different data structures, including hash tables, hashed heaps, cryptographic hash functions and bloom filters.

Here is a quick overview of these structures.

Hash Tables

This is a structure where a hash function is used to map inputs to indexes in an array, in what we know as a key-value pair. The key is the input, and a hash value is produced, which gives the index of the value. 

Hashed Heaps 

We can consider this a combination of a hash table with a heap (a tree structure). Typically, hashed heaps are used for modifying, inserting and deleting elements, as well as retrieving minimum and maximum elements. 

Cryptographic Hash Functions

These have a complicated name but aren’t too difficult to understand. Cryptographic simply means that these functions are intended to be used in security-based applications. The function is used to take an input and produce a hash value. This value is then generally encrypted to ensure data integrity. Some key elements of a cryptographic hash function are that the result must always be the same, given the same input, and it must be impossible to generate the input from the resulting hash value.

Bloom Filters

This is a type of data structure known as a probabilistic structure. Basically, this means that they test the likelihood of whether an element belongs to a specific category. Bloom filters work by hashing a value and using a bit array (an array where the bits are set with values 0 or 1). The bits that correspond to the element are set to 1. When testing another element, the element is hashed using the same function and its bits are compared. If they’re all 1, we can be fairly confident this element belongs to the same set.

How Does Hashing Work?

hash function
Hash functions map keys to values in matching pairs.

As previously mentioned, hashing requires a hash function and a hash table. In the key-value pair, the key is used to identify the data, and the hash function is used to map this. For example, if we want to map a color to the corresponding type of fruit, we could use the color as the key. The related hash values would then be the fruit type. The hash function could be several things, but a relatively simple function would count the length of the string input, assuming that all the fruit colors contain a different number of characters (i.e. red for strawberry, green for kiwi, yellow for banana, blue for blueberry).

In this case, we would compute the hash code for the color and place the key and value at the correct index in the respective array. So, for a strawberry, the hash code would have a length of 3, therefore the color “Red” would be stored at the 3rd position in the key array, and the fruit “Strawberry” would be placed in the 3rd position in the value array.

What Are Collisions and How Do We Avoid Them?

While hashing in data structures is an extremely useful process, you can run into some problems. The most commonly encountered issue is a collision. Simply put, this is where two different input values end up with identical hash codes, and therefore should theoretically be stored in the same index. This can lead to unpredictable behavior and major problems when trying to store or retrieve data. We mainly employed two methods to manage collisions effectively: separate chaining and open addressing. We’re covering both of these next.

Separate Chaining

Here, we use an extra data structure to avoid collisions. This often takes the form of a linked list, but not always. In the hash table, each index is pointed to a linked list of elements. When we encounter a value with a duplicated index, this element is added to the linked list at the same index. When we want to retrieve a value, we look at the linked list and compare the key to find our data. Typically, separate chaining doesn’t require too much memory to handle, but if the list is appreciably long, you can run into memory issues.

Open Addressing

Another way to handle collisions is by open addressing. This doesn’t rely on any new structures like a linked list. Instead, if we run into a new element with the same index, it’s simply placed into the next slot that’s available in the hash table. When searching for the data value, the function begins at the expected element index. If the element isn’t found here, subsequent indices are searched sequentially until the element is found. This can be a more useful method when storage is tight, but open addressing can be slower than separate chaining. There are 3 main methods used for open addressing – linear probing, quadratic probing and double hashing. The example we just gave demonstrated linear probing, so let’s briefly cover the other two here.

Quadratic Probing 

This is similar to linear probing, except the element is placed according to a quadratic function, rather than a linear function. This can help to reduce the possibility of clustering (where elements are located close together) because elements spread more widely across the hash table. Quadratic probing can also lead to an infinite loop, where the function is continuously returning the index occupied by another element. However, we can somewhat manage this by using the double hashing method instead.

Double Hashing

When we use double hashing, we use two hash functions instead of one. If the desired index is already occupied, the secondary hash function is used to generate a step size (the number of steps to take to reach the next empty index). If this index isn’t empty, then the steps are incremented until an empty index is reached. We can still run into infinite loops if the hash function isn’t sufficient, but double hashing typically gives more efficient performance when the table is densely populated.

email client ex Eudora
Hash functions are used to add signatures to emails.

What Are the Main Applications of Hashing?

We can use hashing for many things in programming. Here are a few common applications, along with the type of hashing used.

  • Hash tables: indexing databases, caching and spell checkers
  • Hash heaps: Sorting algorithms and shortest-path algorithms
  • Cryptographic hash functions: security, digital signatures, password storage and message authentication
  • Bloom filters: databases, data compression, networking, web query optimization, and DNA sequence analysis.

Wrapping Up

Hashing in data structures is an essential technique when working with data. This is particularly true for large datasets where you desire quick information storage and retrieval. There are many methods used to implement hashing, such as hash heaps, hash tables, bloom filters and cryptographic functions. Whichever method is appropriate will depend on your use case. Collision handling is an important aspect to be aware of when hashing. We can handle collisions by using either open addressing or separate chaining methods.

Types of Hashing in Data Structures – Explained in Plain English FAQs (Frequently Asked Questions) 

What is hashing?

Hashing is a technique to map inputs to outputs efficiently, meaning that data storage and retrieval are quicker than before. A hash function is used to map a particular input to an output using a specific hash code.

How does hashing differ from a traditional array?

Whereas a typical array maps an input to a specific index, a hash function accesses the data using a hash code or hash value, which is used to index the data. This is typically a lot quicker than a typical array.

What is a hash table?

A hash table is a structure where a hash function is used to map input keys to output values.

What makes for a good hash function?

A good hash function typically tries to map each input to a unique index. This is hard to achieve in reality, however, as we often have a structure that changes in size, as well as elements trying to occupy the same index. We could increase the hash table size so that it can accommodate each element, but this isn’t feasible for large amounts of data. In this case, we aim to provide a hash function that’s easy to execute, spreads keys fairly uniformly, has a low load factor, reduces collisions as much as possible and doesn’t allow for the input to be reproduced from the output.

What does load factor mean in hashing?

The load factor represents the number of items in a hash table divided by the hash table size. We want this factor to be as small as possible. If the load factor is too great, that means the table is too densely populated.

What is a collision?

A collision is where the new element to be mapped has the same index as another element. We must use collision handling techniques, such as linear probing, quadratic probing, double hashing, or separate chaining to overcome this problem.

What is rehashing?

This is where we hash again, usually if the load factor is too high for our liking. The size of the hash table is increased, values are hashed again and the load factor is reduced, along with the complexity of the table.

What are the advantages of hashing?

Hashing allows for quick data storage and retrieval, especially compared to conventional arrays or search trees. The time for searching and modifying elements is also constant.

What are the disadvantages of hashing?

Hashing can be inefficient when we have a lot of collisions, which can be hard to manage with especially large datasets. Hashing also doesn’t allow for null values, which can be problematic depending on the type of data we’re working with.

What are the applications of hashing?

We can use hashing for hash tables, password storage, message authentication, digital signatures, databases, caching, query optimization, and even scientific pursuits like DNA sequence analysis.

About the Author

More from History-Computer

  • Knowledge Hut Available here: https://www.knowledgehut.com/blog/programming/hashing-in-data-structure#what-is-hashing
  • Geeks For Geeks Available here: https://www.geeksforgeeks.org/introduction-to-hashing-data-structure-and-algorithm-tutorials/
  • Free Code Camp Available here: https://www.freecodecamp.org/news/what-is-hashing/