Skip to content

MT Archive Ingestion Checklist

Matt Post edited this page Jun 22, 2020 · 4 revisions

So you want to help ingest the many conferences in the MT Archive. Wonderful!

Important resources

We have converted the MT Archive into TSV files, roughly organized by venue and year:

https://github.com/mardub1635/mt-archive

There is a master list in a Spreadsheet here:

https://docs.google.com/spreadsheets/d/1fpxmdV_BPwR6BQHyU9VJQxXeSOmy4__5nQCHBEviyAw/edit?usp=sharing

Basic task

Note that there are many special cases for volumes being ingested, partially due to the fact that MT Archive was created and managed by hand, and not from a database. These will require editorial judgments. However, some of them are simple and straightforward. For such files, the following ingestion process should work.

  1. Pick one of the venues in the mt-archive repo, under data/. For example, let's do 1994.amta.

  2. Find the corresponding page in the MT archive. You can do this by (a) following the link in the Conference List spreadsheet or poking around the MT Archive site. For 1994.amta, that brings us here.

  3. Verify that there is a bijection between the titles on the webpage and in the TSV file.

  4. Check name spellings and so on. Names with two parts (e.g., "Matt Post") can stay as they are. Names with more parts should be manually split into last and first names (e.g., "Van Durme, Benjamin"). There are many typos, so please take some time to find them. Any corrections, please issue as a PR against that repo.

  5. Run the ingestion script (you need a copy of the acl-anthology repo).

    acl-anthology/bin/ingest_tsv.py mt-archive/data/amta/1994.amta.tsv mt-archive/data/amta/1996.amta.tsv mt-archive/conference-list.tsv
    

    (The conference-list.tsv file is just the Google spreadsheet above, exported to TSV).

This will do two things: (a) create a file acl-anthology/data/xml/1994.amta.xml and (b) download and copy the PDFs to ~/anthology-files/pdf/amta/*.

  1. Add the XML file to the Anthology repo, and create a PR against our master branch. Tarball up the PDF files, and add them to the PR.