
© Profit_image / Shutterstock.com
If you’re a software developer using any language, you’ll know that Huffman coding is one of the basic functions of any good program. Developed in the 1950s, this algorithm significantly reduces the size of files, making them easier to transfer.
If you’re hoping to add this model to your tool belt, we’ve got you covered. This article breaks down the Huffman code, explaining how it works and how you might use it. We even give you the basic syntax to get started. Keep reading to learn how to compress your files today.
What Is Huffman Coding?
Have you wondered how files get compressed? In 1952, David Huffman created a coding algorithm that reduced the size of digital files while keeping all the original data. Self-named, the Huffman code became one of the first lossless data compression techniques and is still frequently used today.
The algorithm relies on counting the number of each character in a text or data file. It then assigns a code based on frequency. The resulting code makes the file much smaller because the most common characters have the shortest lines.
How Does the Huffman Coding Algorithm Work?
The Huffman coding algorithm implements a few steps to shorten the size of files. We’ll use the phrase “Hello world” to show how it works.
First, the algorithm counts the frequency of every letter in the phrase. In this example, the letter “L” appears three times, the letter “O” twice, and the rest once.
Next, the algorithm combines the lowest-frequency letters into one node. For example, the letters “H” and “E” might be paired together. This action happens until every node ties together, creating a root node. This step builds a binary tree.
After the tree is built, the algorithm gives each letter a code. The higher frequency of the letter, the shorter the code. In this example, the letter “L” may receive a code such as “0” and the letter “O” might receive a code such as “01.” When every letter has a code, the tree can be traversed.
After traversing the completed binary tree, the algorithm can encode the message. The message comes out as a binary string, significantly shortening the file size. And when the text needs to be decoded, the algorithm goes back to the binary tree to retrieve the letters.
How Do You Write the Algorithm?
Huffman Coding is commonly used for file compression and has a function in several programming languages, including:
- C
- C++
- Java
- Python
- Ruby
- Swift
The basic syntax is slightly different for each language, but the algorithm’s foundation stays the same. Here’s how you might write Huffman coding in Python:
from heapq import heappush, heappop, heapify
from collections import defaultdict
def huffman_encoding(message):
# Step 1: Count the frequency of each character in the message
frequency = defaultdict(int)
for character in message:
frequency[character] += 1
# Step 2: Build a binary tree
heap = [[weight, [symbol, '']] for symbol, weight in frequency.items()]
heapify(heap)
while len(heap) > 1:
left = heappop(heap)
right = heappop(heap)
for pair in left[1:]:
pair[1] = '0' + pair[1]
for pair in right[1:]:
pair[1] = '1' + pair[1]
heappush(heap, [left[0] + right[0]] + left[1:] + right[1:])
# Step 3: Traverse the binary tree to assign codes to each character
code = dict(heappop(heap)[1:])
# Step 4: Encode the message using the assigned codes
encoded_message = ''.join([code[character] for character in message])
return encoded_message, code
def huffman_decoding(encoded_message, code):
# Step 5: Decode the message using the binary tree to convert the codes back into the original characters
reverse_code = {value: key for key, value in code.items()}
decoded_message = ''
i = 0
while i < len(encoded_message):
j = i + 1
while encoded_message[i:j] not in reverse_code:
j += 1
decoded_message += reverse_code[encoded_message[i:j]]
i = j
return decoded_message
What Applications Does the Huffman Coding Algorithm Have?
Created over 70 years ago, the Huffman code algorithm is one of the key functions in programming. Therefore, every developer should have some familiarity with the model and how to implement it. Here are a few examples of how to compress different types of files:
File Compression
One of the more common uses for Huffman coding, developers can use the algorithm to reduce the size of files significantly. Here, the algorithm reads input in binary mode and writes the compressed file in the same mode. It also stores the characters of a file in a lookup table.
Here’s what the Python syntax would look like for file compression:
import heapq
import os
class HuffmanCoding:
def __init__(self, path):
self.path = path
self.heap = []
self.codes = {}
self.reverse_mapping = {}
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def __eq__(self, other):
if(other == None):
return False
if(not isinstance(other, HeapNode)):
return False
return self.freq == other.freq
def make_frequency_dict(self, text):
frequency = {}
for character in text:
if not character in frequency:
frequency[character] = 0
frequency[character] += 1
return frequency
def make_heap(self, frequency):
for key in frequency:
node = self.HeapNode(key, frequency[key])
heapq.heappush(self.heap, node)
def merge_nodes(self):
while(len(self.heap)>1):
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)
merged = self.HeapNode(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged)
def make_codes_helper(self, root, current_code):
if(root == None):
return
if(root.char != None):
self.codes[root.char] = current_code
self.reverse_mapping[current_code] = root.char
return
self.make_codes_helper(root.left, current_code + "0")
self.make_codes_helper(root.right, current_code + "1")
def make_codes(self):
root = heapq.heappop(self.heap)
current_code = ""
self.make_codes_helper(root, current_code)
def get_encoded_text(self, text):
encoded_text = ""
for character in text:
encoded_text += self.codes[character]
return encoded_text
def pad_encoded_text(self, encoded_text):
extra_padding = 8 - len(encoded_text) % 8
for i in range(extra_padding):
encoded_text += "0"
padded_info = "{0:08b}".format(extra_padding)
encoded_text = padded_info + encoded_text
return encoded_text
def get_byte_array(self, padded_encoded_text):
if(len(padded_encoded_text) % 8 != 0):
print("Encoded text not padded properly")
exit(0)
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self):
filename, file_extension = os.path.splitext(self.path)
output_path = filename + ".bin"
with open(self.path, 'r+') as file, open(output_path, 'wb') as output:
text = file.read()
text = text.rstrip()
frequency = self.make_frequency_dict(text)
self.make_heap(frequency)
self.merge_nodes()
self.make_codes()
encoded_text = self.get_encoded_text(text)
padded_encoded_text = self.pad
Image Compression
This application works similarly to file compression, although it uses slightly different methods. In Python, the algorithm reads image files using the “Pillow” library. The image’s pixels pass through the Huffman code to receive their code before creating its binary tree.
Here’s what the syntax would look like for image compression in Python:
import heapq
import os
from PIL import Image
class HuffmanCoding:
def __init__(self, path):
self.path = path
self.heap = []
self.codes = {}
self.reverse_mapping = {}
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def __eq__(self, other):
if(other == None):
return False
if(not isinstance(other, HeapNode)):
return False
return self.freq == other.freq
def make_frequency_dict(self, pixels):
frequency = {}
for pixel in pixels:
if not pixel in frequency:
frequency[pixel] = 0
frequency[pixel] += 1
return frequency
def make_heap(self, frequency):
for key in frequency:
node = self.HeapNode(key, frequency[key])
heapq.heappush(self.heap, node)
def merge_nodes(self):
while(len(self.heap)>1):
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)
merged = self.HeapNode(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged)
def make_codes_helper(self, root, current_code):
if(root == None):
return
if(root.char != None):
self.codes[root.char] = current_code
self.reverse_mapping[current_code] = root.char
return
self.make_codes_helper(root.left, current_code + "0")
self.make_codes_helper(root.right, current_code + "1")
def make_codes(self):
root = heapq.heappop(self.heap)
current_code = ""
self.make_codes_helper(root, current_code)
def get_encoded_text(self, pixels):
encoded_text = ""
for pixel in pixels:
encoded_text += self.codes[pixel]
return encoded_text
def pad_encoded_text(self, encoded_text):
extra_padding = 8 - len(encoded_text) % 8
for i in range(extra_padding):
encoded_text += "0"
padded_info = "{0:08b}".format(extra_padding)
encoded_text = padded_info + encoded_text
return encoded_text
def get_byte_array(self, padded_encoded_text):
if(len(padded_encoded_text) % 8 != 0):
print("Encoded text not padded properly")
exit(0)
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self):
filename, file_extension = os.path.splitext(self.path)
output_path = filename + ".bin"
image = Image.open(self.path)
pixels = list(image.getdata())
frequency = self.make_frequency_dict(pixels)
self.make_heap(frequency)
self.merge_nodes()
self.make_codes()
encoded_text = self.get_encoded_text(pixels)
padded_encoded_text = self.pad_encoded_text(encoded_text)
b = self
Audio Compression
Just as image compression uses a different compression style, so does audio. For this type of file, the data is read through the “Wave” library. The algorithm takes samples from the audio and passes them through the coding system before creating its binary tree.
Here’s what a sample code would look like for audio compression:
import heapq
import os
import wave
class HuffmanCoding:
def __init__(self, path):
self.path = path
self.heap = []
self.codes = {}
self.reverse_mapping = {}
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def __eq__(self, other):
if(other == None):
return False
if(not isinstance(other, HeapNode)):
return False
return self.freq == other.freq
def make_frequency_dict(self, samples):
frequency = {}
for sample in samples:
if not sample in frequency:
frequency[sample] = 0
frequency[sample] += 1
return frequency
def make_heap(self, frequency):
for key in frequency:
node = self.HeapNode(key, frequency[key])
heapq.heappush(self.heap, node)
def merge_nodes(self):
while(len(self.heap)>1):
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)
merged = self.HeapNode(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged)
def make_codes_helper(self, root, current_code):
if(root == None):
return
if(root.char != None):
self.codes[root.char] = current_code
self.reverse_mapping[current_code] = root.char
return
self.make_codes_helper(root.left, current_code + "0")
self.make_codes_helper(root.right, current_code + "1")
def make_codes(self):
root = heapq.heappop(self.heap)
current_code = ""
self.make_codes_helper(root, current_code)
def get_encoded_text(self, samples):
encoded_text = ""
for sample in samples:
encoded_text += self.codes[sample]
return encoded_text
def pad_encoded_text(self, encoded_text):
extra_padding = 8 - len(encoded_text) % 8
for i in range(extra_padding):
encoded_text += "0"
padded_info = "{0:08b}".format(extra_padding)
encoded_text = padded_info + encoded_text
return encoded_text
def get_byte_array(self, padded_encoded_text):
if(len(padded_encoded_text) % 8 != 0):
print("Encoded text not padded properly")
exit(0)
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self):
filename, file_extension = os.path.splitext(self.path)
output_path = filename + ".bin"
with wave.open(self.path, 'rb') as wave_file:
frame_rate = wave_file.getframerate()
num_channels = wave_file.getnchannels()
sample_width = wave_file.getsampwidth()
num_frames = wave_file.getnframes()
samples = []
for i in range(num_frames):
wave_data = wave_file.readframes(