My variant is 10, so the file I needed to process, accordingly to
the information in list_of_files.docx
is PIC
. The first part of
the task suggests to compute the empiric probability distribution
for block length of
file :: ByteString
file = $(embedFile "/home/volhovm/code/coding-theory/PIC")
-- Returns all substrings of length l of `file`
subBlocks :: Int -> [ByteString]
subBlocks bSize =
map (\i -> BS.take bSize $ BS.drop i file)
[0..(BS.length file - bSize - 1)]
-- Given a block size returns a map from substring to number of occurences
distributionMap :: Int -> M.Map ByteString Int
distributionMap bSize =
foldr (M.alter $ Just . maybe 1 (+1)) M.empty $ subBlocks bSize
file
is function that returns string
content of the file
PIC
. The file is imported as bytestring which is alias for array
of bytes (Byte8
). subBlocks
is a function that, when given
bSize
– the length of the passing window/block, returns all
substrings of this length. distributionMap
collects a map from
string
to int
, where key : string
is a substring of length
bSize
, and element int
is a number of this substring occurences
in the file
.
Next step is computing entropy
-- Generates an entropy from block size
entropy :: Int -> Double
entropy bSize =
let dm = distributionMap bSize
n = intcast $ BS.length file - bSize
distribution = M.map (\x -> intcast x / n) dm
entr = negate $ sum $ map (\x -> x * log x) $ M.elems distribution
in entr / intcast bSize
Variable dm
retrieves a distribution map using already introduced
function. n
is the number of all subblocks/windows inspected:
distribution
stays for the same map,
but with modified elements – each number of occurences is now
divided by n
. And, finally, entr
is entropy – for every element
distribution
it first computes
Finally, entropy
function is runned
main :: IO ()
main = forM_ [1..4] $ \i -> putStr (show i ++ ": ") >> print (entropy i)
That produces the following result:
n | entropy per char |
---|---|
1 | 0.8388 |
2 | 0.7048 |
3 | 0.6328 |
4 | 0.5665 |
Next step is applying archiver algorithm to data and comparing practical
results of compression with theoretical data. I chose to use gzip
algorithm from the library to avoid metadata addition that may
happen if console archiver is applied. Results with best compression
settings (where
s(U) (bytes) | s(C) (bytes) |
---|---|
513216 | 52218 |
As it can be seen, compression ratio is
Let’s assume that
- Extending the alphabet, introducing fixed blocks of initial
alphabet characters (let’s say of a size
$n$ ) as new alphabet characters. - Taking into account relation of current character and previous ones.
Theorem 1.5.c states that $H_n(X) ≥ H(X|Xn-1)$, so more optimal entropy comes when we include previous results into creating encoding algorithm.
The most naive approach is to calculate, using the instructions for the part 1, entropy for the first and the second parts of our datafile. Here are the results (obtained from modified solution for part 1):
n | First part | Second part |
---|---|---|
1 | 0.9961 | 0.6724 |
2 | 0.8370 | 0.5597 |
3 | 0.7456 | 0.4951 |
4 | 0.6589 | 0.4407 |
So it can be seen that though the entropy in the first half is higher, it’s less efficient to construct encoding algorithm using information about, let’s say, blocks (if we’re thinking in context of encoding blocks of constant size) from the first part only. So as probability distribution is not the same, any algorithm created in the task context will be less efficient.