-
Notifications
You must be signed in to change notification settings - Fork 304
MT Archive Ingestion Checklist
So you want to help ingest the many conferences in the MT Archive. Wonderful!
The MT Archive is a large collection of papers in machine translation reaching back to the beginning of the field. The site was created and maintained by John Hutchins until around 2015 as a wonderful service to the community. The one downside is that he built all of the resources by hand, using Microsoft Word to make documents and then exporting them to HTML.
Through a collaboration with the EAMT, the Anthology has undertaken the two-step task of (a) digitizing the resources in the Archive and (b) ingesting them into the Anthology. We have mostly completed step (a), and are now hoping to crowdsource step (b), ingesting conferences one by one. This document will tell you how to do that.
We have converted the MT Archive into TSV files, roughly organized by venue and year:
There is a master list in a Spreadsheet here:
https://docs.google.com/spreadsheets/d/1fpxmdV_BPwR6BQHyU9VJQxXeSOmy4__5nQCHBEviyAw/edit?usp=sharing
Note that there are many special cases for volumes being ingested, partially due to the fact that MT Archive was created and managed by hand, and not from a database. These will require editorial judgments. However, some of them are simple and straightforward. For such files, the following ingestion process should work.
-
Pick one of the venues in the
mt-archive
repo, underdata/
. For example, let's do 1994.amta. -
Find the corresponding page in the MT archive. You can do this by (a) following the link in the Conference List spreadsheet or poking around the MT Archive site. For 1994.amta, that brings us here.
-
Verify that there is a bijection between the titles on the webpage and in the TSV file.
-
Check name spellings and so on. Names with two parts (e.g., "Matt Post") can stay as they are. Names with more parts should be manually split into last and first names (e.g., "Van Durme, Benjamin"). There are many typos, so please take some time to find them. Any corrections, please issue as a PR against that repo.
-
Run the ingestion script (you need a copy of the acl-anthology repo).
acl-anthology/bin/ingest_tsv.py mt-archive/data/amta/1994.amta.tsv mt-archive/data/amta/1996.amta.tsv mt-archive/conference-list.tsv
(The
conference-list.tsv
file is just the Google spreadsheet above, exported to TSV).
This will do two things: (a) create a file acl-anthology/data/xml/1994.amta.xml
and (b) download and copy the PDFs to ~/anthology-files/pdf/amta/*
.
- Add the XML file to the Anthology repo, and create a PR against our master branch. Tarball up the PDF files, and add them to the PR.
There are many complications. If you encounter one of these, it will likely have to be handled by the Anthology director, so please move on to a more simple conference.
- A conference with attached workshops
- Conferences with multiple volumes
Our first priority is the conferences EAMT, AMTA, and MT Summit. We prefer to ingest them chronologically.