This is a Haskell implementation of Python's Construct library. It provides a succinct and easy way to specify data formats. Before you get to the succinct part, though, you'll probably need a bunch of extensions and imports:
{-# LANGUAGE FlexibleInstances, StandaloneDeriving, TemplateHaskell #-}
module README where
import Data.Functor.Identity (Identity(Identity))
import Data.Word (Word8)
import Data.ByteString (ByteString)
import qualified Data.ByteString as ByteString
import qualified Data.ByteString.Char8 as ASCII
import qualified Rank2.TH
import Text.ParserCombinators.Incremental.LeftBiasedLocal (Parser, completeResults, feed, feedEof)
import Construct
import Prelude hiding ((*>), (<*))
With that out of the way, let's take the simple example format from the original. Here's what its specification looks like in Haskell:
data BitMap f = BitMap{
width :: f Word8,
height :: f Word8,
pixels :: f [[Word8]]
}
deriving instance Show (BitMap Identity)
$(Rank2.TH.deriveAll ''BitMap)
format :: Format (Parser ByteString) Maybe ByteString (BitMap Identity)
format = literal (ASCII.pack "BMP") *> mfix (\this-> record
BitMap{
width= byte,
height= byte,
pixels= count (fromIntegral $ height this) $
count (fromIntegral $ width this) byte
})
There are two parts to the specification.
The data BitMap
declaration specifies the in-memory layout of a simple bitmap. Note that it's declared as a record
with every field wrapped in a type constructor parameter. This declaration style is sometimes called Higher-Kinded
Data. To regain a regular record, just instantiate the
parameter to Identity
— a BitMap Identity
contains exactly one value of each field.
The other part of the specification is the format
definition that specifies the bi-directional mapping between the
in-memory and the serialized form of the bitmap. A bijection, to be precise. The two definitions are enough to automatically
serialize the in-memory record form into the binary form:
-- |
-- >>> let record = BitMap{width= pure 3, height= pure 2, pixels= pure [[7,8,9], [11,12,13]]}
-- >>> serialize format record
-- Just "BMP\ETX\STX\a\b\t\v\f\r"
and to parse the serialized binary form back into the record structure:
-- |
-- >>> let bytes = ASCII.pack "BMP\ETX\STX\a\b\t\v\f\r"
-- >>> completeResults $ feedEof $ feed bytes $ parse format
-- [(BitMap {width = Identity 3, height = Identity 2, pixels = Identity [[7,8,9],[11,12,13]]},"")]
Examples of more complex and realistic formats can be found in the
test
directory.
I owe the inspiration for this library to Yair Chuchem and his post that introduced me to Construct. I must also express gratitude to the authors of the original library of course. And finally, to the authors of the paper Invertible Syntax Descriptions:Unifying Parsing and Pretty Printing which I remembered reading just in time to avoid following some bad ideas.
I had to overcome two problems in the course of the implementation. The first difficulty, already touched on in the
aforementioned blog post, is how to convert a record of formats into a format of the record. As the author of
rank2classes, I went for an obvious solution: parameterize the
record as seen in the example, make it an instance of the
Rank2.Traversable
class,
and apply Rank2.traverse
to
it.
That little trick alone would have been enough for a nearly complete package, except the Python library also enables a record field to refer to any of the preceding fields. This is not a problem when serializing, but when parsing it means that a parser must be able to refer to an already-parsed part of the value that's still being parsed.
The standard solution to the problem of accessing the results (parsed values) of a computation (parsing) while within
the computation is known as the
MonadFix
class, though
perhaps not as widely as it deserves. You won't find any parsers in the list of its instances, though, and for a good
reason - it's quite impossible for a parser to obtain a value that it hasn't parsed yet. All we need, luckily, is
access an already parsed part of a partially parsed structure. A limited MonadFix
instance can do that, provided
that we also use a specialized form of Rank2.traverse
capable of preserving the record structure while it's being
parsed.
As it happens, I have an old parser library named
incremental-parser on Hackage, and the name seems quite
appropriate for what I'm doing here. I added the necessary functionality there, but another parser combinator library
should be capable of the same feat. It just needs to implement the mfix
and fixSequence
combinators and it can be
used with the present library.