How compression works, with a simplified worked example.

There are two types of compression. Lossy compression removes part of the uncompressed file, and hopes the user will not notice the bits that are missing. Once those bits have gone there's no way to get them back. Obviously, this is not suitable for all files that need to be compressed, and so we have lossless compression. I'll be talking about lossless compression here.

Language has redundancy. In English we know that "e" is much more common than "z". We can use redundancy for compression. Because this is a simple example and 26 characters would be cumbersome imagine a language that had an alphabet composed of 4 characters: a, b, c, and d.

In our imaginary language we know that if we have a big enough text that 'a' will occur 50% of the time, that 'b' will occur 30% of the time, that 'c' will occur 15% of the time, and that 'd' will occur 5% of the time. (Just for simplicity there are no spaces in this language.)

Here's an example sentence.

ababbabcaacaabcaaad

There are a couple of steps to go before we have our compressed output. First we represent each letter with a number.

a = 0
b = 10
c = 110
d = 111

You can see that 'a' is represented with a single digit, and that 'd' is represented by 3 digits. Note also the lack of leading zeros, and that this sequence is not just counting up in binary. Let's write our sentence again, and you'll see why that's important.

01001010010110001100010110000111

To turn this back to characters you start reading from the left. You can see that '0' can be 'a', and that there is no character for '01'. Thus, the first '0' must be 'a', it cannot be anything else.

So far, our text is bigger than the original. If we stop here it's lousy compression!

Let's represent those 0s and 1s in something that uses less space.

00 = q
01 = w
10 = z
11 = x

Now we get

wqzzwwzqxqwwzqwz

We have compressed our 19 character sentence to 16 characters. Note that this process is reversible. You can turn that string of q, w, x, and z back into 0s and 1s, and you can turn those 0s and 1s back into the original text.

This is a very simple example of compression. You can see that we need some redundancy in the original text. Compression does not work when characters (or words or tokens or whatever) in the original text have an equal probability of occuring. You're likely to find this characteristic in randomly generated files, or in files that have been previously compressed. Because this is a very simple explanation of lossless compression I won't talk about the math. (But I'd welcome very simple explanations of the math that I can put here or point people to! I'd also like good descriptions of the math for people who understand these things.