Home

 › 

Articles

 › 

Software

 › 

Understanding Huffman Coding, With Examples

huffman coding

Understanding Huffman Coding, With Examples

If you’re a software developer using any language, you’ll know that Huffman coding is one of the basic functions of any good program. Developed in the 1950s, this algorithm significantly reduces the size of files, making them easier to transfer. 

If you’re hoping to add this model to your tool belt, we’ve got you covered. This article breaks down the Huffman code, explaining how it works and how you might use it. We even give you the basic syntax to get started. Keep reading to learn how to compress your files today.

What Is Huffman Coding?

Have you wondered how files get compressed? In 1952, David Huffman created a coding algorithm that reduced the size of digital files while keeping all the original data. Self-named, the Huffman code became one of the first lossless data compression techniques and is still frequently used today.

The algorithm relies on counting the number of each character in a text or data file. It then assigns a code based on frequency. The resulting code makes the file much smaller because the most common characters have the shortest lines. 

How Does the Huffman Coding Algorithm Work?

The Huffman coding algorithm implements a few steps to shorten the size of files. We’ll use the phrase “Hello world” to show how it works.

First, the algorithm counts the frequency of every letter in the phrase. In this example, the letter “L” appears three times, the letter “O” twice, and the rest once.

Next, the algorithm combines the lowest-frequency letters into one node. For example, the letters “H” and “E” might be paired together. This action happens until every node ties together, creating a root node. This step builds a binary tree.

After the tree is built, the algorithm gives each letter a code. The higher frequency of the letter, the shorter the code. In this example, the letter “L” may receive a code such as “0” and the letter “O” might receive a code such as “01.” When every letter has a code, the tree can be traversed.

After traversing the completed binary tree, the algorithm can encode the message. The message comes out as a binary string, significantly shortening the file size. And when the text needs to be decoded, the algorithm goes back to the binary tree to retrieve the letters.

How Do You Write the Algorithm?

Huffman Coding is commonly used for file compression and has a function in several programming languages, including:

  • C
  • C++
  • Java
  • Python
  • Ruby
  • Swift

The basic syntax is slightly different for each language, but the algorithm’s foundation stays the same. Here’s how you might write Huffman coding in Python:

from heapq import heappush, heappop, heapify
from collections import defaultdict
 
def huffman_encoding(message):
    # Step 1: Count the frequency of each character in the message
    frequency = defaultdict(int)
    for character in message:
        frequency[character] += 1
 
    # Step 2: Build a binary tree
    heap = [[weight, [symbol, '']] for symbol, weight in frequency.items()]
    heapify(heap)
    while len(heap) > 1:
        left = heappop(heap)
        right = heappop(heap)
        for pair in left[1:]:
            pair[1] = '0' + pair[1]
        for pair in right[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [left[0] + right[0]] + left[1:] + right[1:])
 
    # Step 3: Traverse the binary tree to assign codes to each character
    code = dict(heappop(heap)[1:])
 
    # Step 4: Encode the message using the assigned codes
    encoded_message = ''.join([code[character] for character in message])
 
    return encoded_message, code
 
 
def huffman_decoding(encoded_message, code):
    # Step 5: Decode the message using the binary tree to convert the codes back into the original characters
    reverse_code = {value: key for key, value in code.items()}
 
    decoded_message = ''
    i = 0
    while i < len(encoded_message):
        j = i + 1
        while encoded_message[i:j] not in reverse_code:
            j += 1
        decoded_message += reverse_code[encoded_message[i:j]]
        i = j
 
    return decoded_message

What Applications Does the Huffman Coding Algorithm Have?

Created over 70 years ago, the Huffman code algorithm is one of the key functions in programming. Therefore, every developer should have some familiarity with the model and how to implement it. Here are a few examples of how to compress different types of files:

File Compression

One of the more common uses for Huffman coding, developers can use the algorithm to reduce the size of files significantly. Here, the algorithm reads input in binary mode and writes the compressed file in the same mode. It also stores the characters of a file in a lookup table. 

Here’s what the Python syntax would look like for file compression:

import heapq
import os

class HuffmanCoding:
    def __init__(self, path):
        self.path = path
        self.heap = []
        self.codes = {}
        self.reverse_mapping = {}

    class HeapNode:
        def __init__(self, char, freq):
            self.char = char
            self.freq = freq
            self.left = None
            self.right = None

        def __lt__(self, other):
            return self.freq < other.freq

        def __eq__(self, other):
            if(other == None):
                return False
            if(not isinstance(other, HeapNode)):
                return False
            return self.freq == other.freq

    def make_frequency_dict(self, text):
        frequency = {}
        for character in text:
            if not character in frequency:
                frequency[character] = 0
            frequency[character] += 1
        return frequency

    def make_heap(self, frequency):
        for key in frequency:
            node = self.HeapNode(key, frequency[key])
            heapq.heappush(self.heap, node)

    def merge_nodes(self):
        while(len(self.heap)>1):
            node1 = heapq.heappop(self.heap)
            node2 = heapq.heappop(self.heap)

            merged = self.HeapNode(None, node1.freq + node2.freq)
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)

    def make_codes_helper(self, root, current_code):
        if(root == None):
            return

        if(root.char != None):
            self.codes[root.char] = current_code
            self.reverse_mapping[current_code] = root.char
            return

        self.make_codes_helper(root.left, current_code + "0")
        self.make_codes_helper(root.right, current_code + "1")

    def make_codes(self):
        root = heapq.heappop(self.heap)
        current_code = ""
        self.make_codes_helper(root, current_code)

    def get_encoded_text(self, text):
        encoded_text = ""
        for character in text:
            encoded_text += self.codes[character]
        return encoded_text

    def pad_encoded_text(self, encoded_text):
        extra_padding = 8 - len(encoded_text) % 8
        for i in range(extra_padding):
            encoded_text += "0"

        padded_info = "{0:08b}".format(extra_padding)
        encoded_text = padded_info + encoded_text
        return encoded_text

    def get_byte_array(self, padded_encoded_text):
        if(len(padded_encoded_text) % 8 != 0):
            print("Encoded text not padded properly")
            exit(0)

        b = bytearray()
        for i in range(0, len(padded_encoded_text), 8):
            byte = padded_encoded_text[i:i+8]
            b.append(int(byte, 2))
        return b

    def compress(self):
        filename, file_extension = os.path.splitext(self.path)
        output_path = filename + ".bin"

        with open(self.path, 'r+') as file, open(output_path, 'wb') as output:
            text = file.read()
            text = text.rstrip()

            frequency = self.make_frequency_dict(text)
            self.make_heap(frequency)
            self.merge_nodes()
            self.make_codes()

            encoded_text = self.get_encoded_text(text)
            padded_encoded_text = self.pad

Image Compression

This application works similarly to file compression, although it uses slightly different methods. In Python, the algorithm reads image files using the “Pillow” library. The image’s pixels pass through the Huffman code to receive their code before creating its binary tree.

Here’s what the syntax would look like for image compression in Python:

import heapq
import os
from PIL import Image

class HuffmanCoding:
    def __init__(self, path):
        self.path = path
        self.heap = []
        self.codes = {}
        self.reverse_mapping = {}

    class HeapNode:
        def __init__(self, char, freq):
            self.char = char
            self.freq = freq
            self.left = None
            self.right = None

        def __lt__(self, other):
            return self.freq < other.freq

        def __eq__(self, other):
            if(other == None):
                return False
            if(not isinstance(other, HeapNode)):
                return False
            return self.freq == other.freq

    def make_frequency_dict(self, pixels):
        frequency = {}
        for pixel in pixels:
            if not pixel in frequency:
                frequency[pixel] = 0
            frequency[pixel] += 1
        return frequency

    def make_heap(self, frequency):
        for key in frequency:
            node = self.HeapNode(key, frequency[key])
            heapq.heappush(self.heap, node)

    def merge_nodes(self):
        while(len(self.heap)>1):
            node1 = heapq.heappop(self.heap)
            node2 = heapq.heappop(self.heap)

            merged = self.HeapNode(None, node1.freq + node2.freq)
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)

    def make_codes_helper(self, root, current_code):
        if(root == None):
            return

        if(root.char != None):
            self.codes[root.char] = current_code
            self.reverse_mapping[current_code] = root.char
            return

        self.make_codes_helper(root.left, current_code + "0")
        self.make_codes_helper(root.right, current_code + "1")

    def make_codes(self):
        root = heapq.heappop(self.heap)
        current_code = ""
        self.make_codes_helper(root, current_code)

    def get_encoded_text(self, pixels):
        encoded_text = ""
        for pixel in pixels:
            encoded_text += self.codes[pixel]
        return encoded_text

    def pad_encoded_text(self, encoded_text):
        extra_padding = 8 - len(encoded_text) % 8
        for i in range(extra_padding):
            encoded_text += "0"

        padded_info = "{0:08b}".format(extra_padding)
        encoded_text = padded_info + encoded_text
        return encoded_text

    def get_byte_array(self, padded_encoded_text):
        if(len(padded_encoded_text) % 8 != 0):
            print("Encoded text not padded properly")
            exit(0)

        b = bytearray()
        for i in range(0, len(padded_encoded_text), 8):
            byte = padded_encoded_text[i:i+8]
            b.append(int(byte, 2))
        return b

    def compress(self):
        filename, file_extension = os.path.splitext(self.path)
        output_path = filename + ".bin"

        image = Image.open(self.path)
        pixels = list(image.getdata())

        frequency = self.make_frequency_dict(pixels)
        self.make_heap(frequency)
        self.merge_nodes()
        self.make_codes()

        encoded_text = self.get_encoded_text(pixels)
        padded_encoded_text = self.pad_encoded_text(encoded_text)
        b = self

Audio Compression

Just as image compression uses a different compression style, so does audio. For this type of file, the data is read through the “Wave” library. The algorithm takes samples from the audio and passes them through the coding system before creating its binary tree.

Here’s what a sample code would look like for audio compression:

import heapq
import os
import wave

class HuffmanCoding:
    def __init__(self, path):
        self.path = path
        self.heap = []
        self.codes = {}
        self.reverse_mapping = {}

    class HeapNode:
        def __init__(self, char, freq):
            self.char = char
            self.freq = freq
            self.left = None
            self.right = None

        def __lt__(self, other):
            return self.freq < other.freq

        def __eq__(self, other):
            if(other == None):
                return False
            if(not isinstance(other, HeapNode)):
                return False
            return self.freq == other.freq

    def make_frequency_dict(self, samples):
        frequency = {}
        for sample in samples:
            if not sample in frequency:
                frequency[sample] = 0
            frequency[sample] += 1
        return frequency

    def make_heap(self, frequency):
        for key in frequency:
            node = self.HeapNode(key, frequency[key])
            heapq.heappush(self.heap, node)

    def merge_nodes(self):
        while(len(self.heap)>1):
            node1 = heapq.heappop(self.heap)
            node2 = heapq.heappop(self.heap)

            merged = self.HeapNode(None, node1.freq + node2.freq)
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)

    def make_codes_helper(self, root, current_code):
        if(root == None):
            return

        if(root.char != None):
            self.codes[root.char] = current_code
            self.reverse_mapping[current_code] = root.char
            return

        self.make_codes_helper(root.left, current_code + "0")
        self.make_codes_helper(root.right, current_code + "1")

    def make_codes(self):
        root = heapq.heappop(self.heap)
        current_code = ""
        self.make_codes_helper(root, current_code)

    def get_encoded_text(self, samples):
        encoded_text = ""
        for sample in samples:
            encoded_text += self.codes[sample]
        return encoded_text

    def pad_encoded_text(self, encoded_text):
        extra_padding = 8 - len(encoded_text) % 8
        for i in range(extra_padding):
            encoded_text += "0"

        padded_info = "{0:08b}".format(extra_padding)
        encoded_text = padded_info + encoded_text
        return encoded_text

    def get_byte_array(self, padded_encoded_text):
        if(len(padded_encoded_text) % 8 != 0):
            print("Encoded text not padded properly")
            exit(0)

        b = bytearray()
        for i in range(0, len(padded_encoded_text), 8):
            byte = padded_encoded_text[i:i+8]
            b.append(int(byte, 2))
        return b

    def compress(self):
        filename, file_extension = os.path.splitext(self.path)
        output_path = filename + ".bin"

        with wave.open(self.path, 'rb') as wave_file:
            frame_rate = wave_file.getframerate()
            num_channels = wave_file.getnchannels()
            sample_width = wave_file.getsampwidth()
            num_frames = wave_file.getnframes()

            samples = []

            for i in range(num_frames):
                wave_data = wave_file.readframes(

Frequently Asked Questions

What is Hoffman coding and how does it work?

Huffman coding is a lossless data compression technique used to reduce the size of digital files by assigning shorter codes to frequently used characters and longer codes to less frequently used characters. This results in a compressed file that preserves all the original information of the input file.

What is a lossless data compression algorithm?

A lossless data compression algorithm is a technique for compressing data in a way that preserves the exact original information. It enables the decompressed data to be identical to the original data, with no loss of data or degradation in quality.

What is a binary tree in programming?

A binary tree is a data structure composed of nodes, where each node has at most two child nodes called left and right. The left child node comes before the right child node, and they are arranged in a hierarchical order, forming a tree-like structure.

How is Hoffman coding used?

Applications of Huffman coding include file compression, image compression, and audio compression to reduce the size of digital files for storage and transmission.

To top