This application can be used to encode any text file with a specifically-tailored, variable-length, optimal binary code - namely, the Huffman Code.
The Huffman Code is a prefix code (Meaning that none of the words of the code is an index of another word - it makes it possible to allow variable-length words without losing the ability to be unambigously decoded).
As an example, let's encode the word "cat" in a very simple way. c -> '1', a -> '11', t-> '111'. This gives us 110101. Now, when we receive the encoded string, we don't know whether the code should be interpreted as 1 11 111, or maybe 1 1 1 111, or even maybe 1 1 1 1 1 1.
As far as prefix codes are concerned, the meaning of the code is clear. Let's try: c -> '0', a -> '10', t->'11'. Now, there is only one way to interpret 01011.
One of the benefits of the Huffman code is the fact that its length is optimal - this is virtually impossible to contruct a shorter code.
There are two use cases for the application:
- Encode
- Decode
The following actions will be performed:
- You need to enter the path to the file that needs to be encoded and the name of the output file.
- The input file will be inspected and an optimal (one of the shortest possible) alphabet will be generated.
- The alphabet will be saved to a file on your disk.
- The file will be encoded according to the alphabet.
- The result of the encoding process will be saved to the output file.
- You need to provide the encoded file, the alphabet file and the output file.
- The file is decoded according to the alphabet.
- The decoded text is saved to the output file.
dotnet run -a <file_to_encode> <output_file>
Let's encode the LICENSE file and write the output to encoded.txt.
dotnet run -a LICENSE encoded.txt
Two new files have been generated:
-
encoded.txt
1110001111011110011111110000001101101000011000010111011111001110111111100111001001111000111011000101111100011111010101011110101101111110100010000101100010011011100111001111011100011010110000011110010101100000001010010110000010100010111001110000110101000110111100111110110001011101001011010101001100111000111111011111110110011110001100001011101111100110101001001100101111101010100101001001111101010001011000100011010011110001010101001011000111110000110000011100111010110000011011100000011010111000011111101011011001011111100100010100110010000101001010010011111010100010110000110100011100010001100111111101011001000010100101001001111101010001011000111010111101100100100011001110110001010011101111111100001111110101101111111011100011001001100011111111101100111111010111101100100100011001111001111111100011110111000011111101011010110010000101111111001110010011101001101000110101000011000010111011111001111100001000111111110010111001111100000010111001110001110110011010011011001111110100010001111111101001110111111100110001100110001010001101001101100111111100000011011010000100011010011101010111010000110011100111000010111011111100010100101001001100100001010010100100111110101000101100010001101001110110101110110001011111000111101111101100010111010010110110100110101110100010010111000011001000010100101001001111101010001011000111110101001000110101111011111101000100000011101101000111010101001010100011010011011100101110001100001101101001100100001001001011001110010101110101111110110011111000010110100111111101111110101101101001100100001000100110010111001000001100001011101111100111011111110011100100111101010011111111000000100011010011101100101111101011101100010111110001111101010101111010110111010100111111110000001010001010001100111001100110000111111011100011001001100011100000111111111011000110011100111000010111011111100010100101011101000110100110101001110001101011111110001100111101101010110111111010110110100101001001100100001010010100100111110101000101100001100100001010010100100111110101000101100011111010111011101101011100100111100111000001111000101011111010111010111101100100100011001111101010001011101110001101011111001101010010100011011110011111001010111111010011100011100101101110111011000101010100101011111111100011101111001111100000111111111011100011001001100011111111101100111100110001100111101010011110011111111000111101100000111011010011001000011101010001011101110001101011111100010100101001011100000010111100000101000110101110001001101111110011111011111001110010111101111010001010101010010010101110001110111000101101111111100000110110011000101111101100110101111101101010100001000110100111110101011011111110110010101111111101100111000111000000110101111111101110101000011100100000110101110100010100011001110010110010000100000110011001001010111010010101111000010111011111001110111111100111001001100100101100100111000101110101001100001100111111000100110110010001001001010111000110111100111100011001100011111110000011110001110001000100111000101001010110100110010000101111001111110001001101111110011111011111001111101000100000110010000101111111101100011011000110000111111010110101010010110000101101011101110001100001101101010110110101011010011001000010111110101111101000010001011111110101111110110011110010101101011100001000110011010100100101011111111011100001011011101000110000111111010110111101011110110010011001000010100101001001111101010001011000010101101100100000011010100000101011010011001000010110011100010001100111111101100111101011111101011001000010100101001001111101010001011000
-
encoded.txt.alphabet
e:000
h:0010
u:00110
d:00111
o:010
t:011
a:1000
f:10010
b:100110
g:100111
s:1010
r:1011
c:11000
l:11001
n:1101
m:111000
v:11100100
k:111001010
x:1110010110
j:1110010111
y:1110011
w:111010
p:111011
i:1111
The alphabet is prepared specifically for the file in question. The letters that appear frequently have shorter codes.
dotnet run -a decode <file_to_decode> <alphabet_file> <output_file>
Now, let's reverse this process.
dotnet run -a decode encoded.txt encoded.txt.alphabet decoded.txt
In the newly created file, we get:
mitlicensecopyrightcpermissionisherebygrantedfreeofchargetoanypersonobtainingacopyofthissoftwareandassociateddocumentationfilesthesoftwaretodealinthesoftwarewithoutrestrictionincludingwithoutlimitationtherightstousecopymodifymergepublishdistributesublicenseandorsellcopiesofthesoftwareandtopermitpersonstowhomthesoftwareisfurnishedtodososubjecttothefollowingconditionstheabovecopyrightnoticeandthispermissionnoticeshallbeincludedinallcopiesorsubstantialportionsofthesoftwarethesoftwareisprovidedasiswithoutwarrantyofanykindexpressorimpliedincludingbutnotlimitedtothewarrantiesofmerchantabilityfitnessforaparticularpurposeandnoninfringementinnoeventshalltheauthorsorcopyrightholdersbeliableforanyclaimdamagesorotherliabilitywhetherinanactionofcontracttortorotherwisearisingfromoutoforinconnectionwiththesoftwareortheuseorotherdealingsinthesoftware
That is exactly what we had encoded.
- Capital letters are treated as lowercase letters (reducing code length)
- All characters apart from letters are skipped (reducing the length even further)
- There are no spaces in the code. (given that the space is the most commonly used character, we save so many bits...)