Skip to content

dmvianna/conduit-patents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

conduit-patents

I’m learning how to parse big CSV files in Haskell. This is my fourth attempt. I’ll be trying things (hopefully) that are almost directly translatable to my work, as parsing addresses out of free text. Good luck to me!

The dataset

The data we’ll be analysing are the public records of Australian patents. You can get the CSV files from data.gov.au. No, I will not include 785 MB of data (compressed) in this repository. However you can find the head of that file in the ./data directory. Obtained with

$ head -n 20 IPGOD.IPGOD122B_PAT_ABSTRACTS.csv > pat_abstracts.csv

What I plan to achieve

1. Read a stream

Read a CSV file as a stream, so I don’t need to load the entire thing to work on it.

2. Inspect the stream

Being able to inspect the stream using something like take or show with indexing. I assume I would be doing it in GHCi.

3. Extract relevant info

from unstructured text, such as addresses. That’s a big part of what I do for work, and the main motivation for looking beyond Python. I want to move away from regular expressions and do it fast.

4. Filter the stream

After parsing we must be able to subset the stream according to boolean constraints. These must be composable.

5. Tabular results

reshape results in tabular form as a prelude to exporting to CSV or database.

6. Group by

At some point we will want to analyse results for such things like counts.

7. Output

Encoding results back into an output file, or sending it to a database.

How can I achieve it?

Finally, I eagerly welcome help to move this forward. Get in touch!

About

parsing addresses out of free text

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published