Skip to content

Latest commit

 

History

History
98 lines (75 loc) · 7.11 KB

README.md

File metadata and controls

98 lines (75 loc) · 7.11 KB

huffman-encoder

Introduction

This application can be used to encode any text file with a specifically-tailored, variable-length, optimal binary code - namely, the Huffman Code.

Huffman Code

The Huffman Code is a prefix code (Meaning that none of the words of the code is an index of another word - it makes it possible to allow variable-length words without losing the ability to be unambigously decoded).

As an example, let's encode the word "cat" in a very simple way. c -> '1', a -> '11', t-> '111'. This gives us 110101. Now, when we receive the encoded string, we don't know whether the code should be interpreted as 1 11 111, or maybe 1 1 1 111, or even maybe 1 1 1 1 1 1.

As far as prefix codes are concerned, the meaning of the code is clear. Let's try: c -> '0', a -> '10', t->'11'. Now, there is only one way to interpret 01011.

One of the benefits of the Huffman code is the fact that its length is optimal - this is virtually impossible to contruct a shorter code.

Application

There are two use cases for the application:

  • Encode
  • Decode

Encode

The following actions will be performed:

  1. You need to enter the path to the file that needs to be encoded and the name of the output file.
  2. The input file will be inspected and an optimal (one of the shortest possible) alphabet will be generated.
  3. The alphabet will be saved to a file on your disk.
  4. The file will be encoded according to the alphabet.
  5. The result of the encoding process will be saved to the output file.

Decode

  1. You need to provide the encoded file, the alphabet file and the output file.
  2. The file is decoded according to the alphabet.
  3. The decoded text is saved to the output file.

Usage

Encode

  dotnet run -a <file_to_encode> <output_file>

Let's encode the LICENSE file and write the output to encoded.txt.

  dotnet run -a LICENSE encoded.txt

Two new files have been generated:

  • encoded.txt

  • encoded.txt.alphabet

e:000
h:0010
u:00110
d:00111
o:010
t:011
a:1000
f:10010
b:100110
g:100111
s:1010
r:1011
c:11000
l:11001
n:1101
m:111000
v:11100100
k:111001010
x:1110010110
j:1110010111
y:1110011
w:111010
p:111011
i:1111

The alphabet is prepared specifically for the file in question. The letters that appear frequently have shorter codes.

Decode

    dotnet run -a decode <file_to_decode> <alphabet_file> <output_file>

Now, let's reverse this process.

    dotnet run -a decode encoded.txt encoded.txt.alphabet decoded.txt

In the newly created file, we get:

mitlicensecopyrightcpermissionisherebygrantedfreeofchargetoanypersonobtainingacopyofthissoftwareandassociateddocumentationfilesthesoftwaretodealinthesoftwarewithoutrestrictionincludingwithoutlimitationtherightstousecopymodifymergepublishdistributesublicenseandorsellcopiesofthesoftwareandtopermitpersonstowhomthesoftwareisfurnishedtodososubjecttothefollowingconditionstheabovecopyrightnoticeandthispermissionnoticeshallbeincludedinallcopiesorsubstantialportionsofthesoftwarethesoftwareisprovidedasiswithoutwarrantyofanykindexpressorimpliedincludingbutnotlimitedtothewarrantiesofmerchantabilityfitnessforaparticularpurposeandnoninfringementinnoeventshalltheauthorsorcopyrightholdersbeliableforanyclaimdamagesorotherliabilitywhetherinanactionofcontracttortorotherwisearisingfromoutoforinconnectionwiththesoftwareortheuseorotherdealingsinthesoftware

That is exactly what we had encoded.

More detail

  • Capital letters are treated as lowercase letters (reducing code length)
  • All characters apart from letters are skipped (reducing the length even further)
  • There are no spaces in the code. (given that the space is the most commonly used character, we save so many bits...)