-
Notifications
You must be signed in to change notification settings - Fork 304
MT Archive Ingestion Checklist
So you want to help ingest the many conferences in the MT Archive. Wonderful!
We have converted the MT Archive into TSV files, roughly organized by venue and year:
There is a master list in a Spreadsheet here:
https://docs.google.com/spreadsheets/d/1fpxmdV_BPwR6BQHyU9VJQxXeSOmy4__5nQCHBEviyAw/edit?usp=sharing
Note that there are many special cases for volumes being ingested, partially due to the fact that MT Archive was created and managed by hand, and not from a database. These will require editorial judgments. However, some of them are simple and straightforward. For such files, the following ingestion process should work.
-
Pick one of the venues in the
mt-archive
repo, underdata/
. For example, let's do 1994.amta. -
Find the corresponding page in the MT archive. You can do this by (a) following the link in the Conference List spreadsheet or poking around the MT Archive site. For 1994.amta, that brings us here.
-
Verify that there is a bijection between the titles on the webpage and in the TSV file.
-
Check name spellings and so on. Names with two parts (e.g., "Matt Post") can stay as they are. Names with more parts should be manually split into last and first names (e.g., "Van Durme, Benjamin"). There are many typos, so please take some time to find them. Any corrections, please issue as a PR against that repo.
-
Run the ingestion script (you need a copy of the acl-anthology repo).
acl-anthology/bin/ingest_tsv.py mt-archive/data/amta/1994.amta.tsv mt-archive/data/amta/1996.amta.tsv mt-archive/conference-list.tsv
(The
conference-list.tsv
file is just the Google spreadsheet above, exported to TSV).
This will do two things: (a) create a file acl-anthology/data/xml/1994.amta.xml
and (b) download and copy the PDFs to ~/anthology-files/pdf/amta/*
.
- Add the XML file to the Anthology repo, and create a PR against our master branch. Tarball up the PDF files, and add them to the PR.