I’m learning how to parse big CSV files in Haskell. This is my fourth attempt. I’ll be trying things (hopefully) that are almost directly translatable to my work, as parsing addresses out of free text. Good luck to me!
The data we’ll be analysing are the public records of Australian patents. You can get the CSV files from data.gov.au. No, I will not include 785 MB of data (compressed) in this repository. However you can find the head
of that file in the ./data
directory. Obtained with
$ head -n 20 IPGOD.IPGOD122B_PAT_ABSTRACTS.csv > pat_abstracts.csv
Read a CSV
file as a stream, so I don’t need to load the entire thing to work on it.
Being able to inspect the stream using something like take
or show
with indexing. I assume I would be doing it in GHCi
.
from unstructured text, such as addresses. That’s a big part of what I do for work, and the main motivation for looking beyond Python. I want to move away from regular expressions and do it fast.
After parsing we must be able to subset the stream according to boolean constraints. These must be composable.
reshape results in tabular form as a prelude to exporting to CSV
or database.
At some point we will want to analyse results for such things like counts.
Encoding results back into an output file, or sending it to a database.
Finally, I eagerly welcome help to move this forward. Get in touch!