Demystifying Lossless File Compression

The Idea

We all know the basic idea of file compression. We see ZIP files regularly, and know how to extract and compress them. Although most everyone knows how to use these compressed files, and understand that compressing files makes them smaller, most people have no idea how it works technically.

Though this may seem like magic, it's a fairly straightforward concept. In this article, we are going to be demystifying file compression.

The Two Classes of Compression

Although people typically think about ZIP files and pictures when talking about compression in general, many people don't realize that there are two different compression classes commonly used.

    • Lossless Compression is a type of compression in which the data is compressed, and can be later decompressed without any loss of data. This is useful for things in which the minor details can really matter. For example, on a research paper, every character in the document would need to be the exact same after decompression, or the document could be poorly graded due to compression ignoring the commas.
    • Lossy Compression is the other type of compression, where data can be slightly modified during the compression process. Pictures commonly are compressed this way, as we can not very well distinguish between two very similar colors. This is a good time to use lossy compression, and not be so specific when it comes to colors.

 

Color Difference Example
Although these two colors look the same, they're actually different. This is an example for why it's okay to use lossy compression in pictures.

The Steps for Lossless Compression

We'll be doing "Huffman Coding" to do this, as it's one of the simplest compression methods.

Compressing files in this way takes four major steps, listed below.

    1. Generate statistical model of the data (the letters in a text file for instance). This would include finding out which letters occur more often, and which series of letters are most common.
    2. Generate a "dictionary", which is essentially a tree with the most common items closer to the starting point than uncommon items. This is used as a reference for what compressed data translates to when not compressed.
    3. Rewrite the data (text in this case) in a compressed form according to the dictionary.
    4. Combine all the parts.

Although it seems really complicated, it's not too hard to follow when there's an example. We're going to be demonstrating how this works with the following string.

Human Readable: AABABCABCD

Binary Translation: 01000001 01000001 01000010 01000001 01000010 01000011 01000001 01000010 01000011 01000100

As a note, this binary is 80 bits long.

The Statistics

To begin this compression process, we're going to start off by statistically analyzing the text. Here's some notes of what I see:

    1. The letter A appears 4 times, and is 40% of the letters.
    2. The letter B appears 3 times, and is 30% of the letters.
    3. The letter C appears 2 times, and it 20% of the letters.
    4. The letter D appears 1 times, and is 10% of the letters.

The Dictionary (Tree)

Now that we have a statistical analysis of this text, we can build a dictionary (tree) based on putting the most common letters the closest to the starting point on our tree, and making directions to each item on the tree. Here's the example tree that I came up with.

The dictionary (tree) that I made for this example.
The dictionary (tree) that I made for this example.

We have to somehow represent our dictionary in binary, so that during decompression, it can be referenced. This is done by writing the binary directions to the first spot on the tree, then writing what the letter is in binary, then doing the same for all the other objects on the tree. Here's a table with the directions, the letter, the binary for the letter, and the combined directions and binary for the letter.

Directions to LetterLetterLetter in BinaryDirections+Binary
0A01000001001000001
10B010000101001000010
110C0100001111001000011
111D0100010011101000100

Since we have the directions and binary for the letter, we just put them all one after another, which leaves us:

001000001 1001000010 11001000011 11101000100

Compressing the Data

Now that we have a dictionary (tree), we rewrite the text by using the bits in the tree as the directions to go to get to the letter we are writing.

    1. A is written as 0
    2. B is written as 10
    3. C is written as 110
    4. D is written as 111

Rewriting the text using this formula gives us 0 0 10 0 10 110 0 10 110 111

Combining it All

Here's the stuff we have so far in binary:

    • Dictionary = 001000001 1001000010 11001000011 11101000100
    • Compressed data = 0 0 10 0 10 110 0 10 110 111

To finish writing the file, we just write it in the order of [dictionary][compressed data].

This leaves us with this:

001000001 1001000010 11001000011 11101000100 0 0 10 0 10 110 0 10 110 111

which is 60 bits long, 20 bits shorter than the original binary text.

Decompression Though?

Decompression works in the same way, just reversed. This article is not going to cover the decompression part of this, as extracting the dictionary from the start of the binary is a more complicated process. I may cover this later on, but just keep in mind that it is possible, it just takes many steps.

Final Notes

Although this example may seem complicated, this is the simplest one that exists. As such, this one has many variations, which can deal with probability in sections of data and such.

If this (somehow) wasn't technical enough for you, take a look at David Huffman's paper A Method for the Construction of Minimum-Redundancy Codes.

Leave a Reply