Skip to content

Latest commit

 

History

History
265 lines (177 loc) · 13.3 KB

README.md

File metadata and controls

265 lines (177 loc) · 13.3 KB

Multilingual Open Text (MOT)

This is the repository for Multilingual Open Text (MOT), a project of the Broadening Linguistic Technologies (BLT) Lab at Brandeis University. MOT was created by Chester Palen-Michel, June Kim, and Constantine Lignos. This work was supported by a 2021 Brandeis University Provost Research Grant.

If you use the corpus please cite our LREC 2022 paper:

@InProceedings{palenmichel-kim-lignos:2022:LREC,
  author    = {Palen-Michel, Chester  and  Kim, June  and  Lignos, Constantine},
  title     = {Multilingual Open Text Release 1: Public Domain News in 44 Languages},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {2080--2089},
  abstract  = {We present Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001--2022 and collected from Voice of America's news websites. We describe our process for collecting, filtering, and processing the data. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0), and all software used to create the corpus is released under the MIT License. The corpus will be regularly updated as additional documents are published.},
  url       = {https://aclanthology.org/2022.lrec-1.224}
}

Releases

The latest version of the MOT data can always be found at our latest GitHub release.

Languages

The current release contains 43 languages: Albanian (sqi), Amharic (amh), Armenian (hye), Azerbaijani (aze), Bambara (bam), Bangla (ben), Bosnian (bos), Burmese (mya), Dari (prs), English (eng), French (fra), Georgian (kat), Greek (ell), Haitian Creole (hat), Hausa (hau), Indonesian (ind), Khmer (khm), Kinyarwanda (kin), Korean (kor), Kurdish (kur), Lao (lao), Lingala (lin), Macedonian (mkd), Mandarin (cmn), Northern (nde), Oromo (orm), Pashto (pus), Persian (fas), Portuguese (por), Russian (rus), Serbian (srp), Shona (sna), Somali (som), Spanish (spa), Swahili (swh), Thai (tha), Tibetan (bod), Tigrinya (tir), Turkish (tur), Ukranian (ukr), Urdu (urd), Uzbek (uzb), and Vietnamese (vie).

Release Layout

The data is released in one gzipped tar file per crawled site in the source data. Each site file is prefixed with an ISO 639-3 code denoting its language.

There are sometimes multiple sites per language. For example, in English (language code eng), there's the main news site at https://www.voanews.com/, the editorials site at https://editorials.voa.gov/, and a site for learning English at https://learningenglish.voanews.com/.

Downloading and Decompressing the Latest Release

All command-line instructions in this section require the bash shell and cloning/downloading this repository.

We have provided two scripts to help download and decompress all the data. Since they download all sites (currently 5.6GB compressed), they take a while to run. If you only want a handful of sites, it's probably easiest to download them manually.

The fastest way to download the data is to set up the GitHub CLI, which allows for much faster release downloads. Once you have set it up, run gh_download_latest_release.sh.

If you don't have the GitHub CLI available, run download_latest_release.sh instead.

Both of the download scripts place compressed files (one per site) in the release directory. To decompress the downloaded files, run decompress_latest_release.sh.

Sentence Segmentation and Tokenization

Each JSON document in the release has paragraphs and n_paragraphs fields. These contain the text of each website divided by paragraphs and the number of paragraphs, respectively. We provide sentence segmentation and tokenization for all languages in MOT, which can be accessed with the fields sentences, n_sentences, tokens, and n_tokens.

Working with the Data

Overview

The motext script contains two commands that assist in accessing data in the MOT corpus.

To install motext, run pip install motext. (If you are working with a clone of the repository and want to make changes to motext, you can run pip install -e . from the root of the clone.)

Currently, two commands are supported by this script: search and extract. For a description of these commands, run motext --help:

Usage: motext [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  extract  Extract json documents into text files in the output directory.
  search   Search for json files with the keyword string in source.

Extract

Extracting given a source directory

The extract command takes in a folder and extracts text from the JSON files within it.

For example, let's say you have downloaded and decompressed the data so you have the VOA Zimbabwe site data located at release/eng_voazimbabwe. You want to extract the text of the news articles to text/eng_voazimbabwe, with each line containing space-separated tokens. You would run:

motext extract sentences release/eng_voazimbabwe text/eng_voazimbabwe --types article

This will produce a new directory text/eng_voazimbabwe containing an article subdirectory, just like in the source data. Extracted files will be sorted by content type in their respective directories.

The arguments are as follows:

  • units: Choose from [sentences, tokens, paragraphs]. Extract will print one sentence/paragraph per line or a sentence of space-separated tokens per line.
  • source: This is the source directory from which data will be extracted.
  • output_dir: This is the folder to which you would like the data extracted. If output_dir does not exist, a new directory will be created.
  • (optional) --num-files N: The maximum number of files to extract from source.
  • (optional) --max-per-file N: Allows controlling the maximum length of each extracted file. If extracting paragraphs, this limits the number of paragraphs; if extracting sentences or tokenized sentences, this limits the number of sentences.
  • (optional) --types type1,type2: The content types to extract from the source directory (choose from article, audio, video, and photo). By default, all available content types will be used. Select multiple types by separating with a comma, for example --types article,audio.
  • (optional) --include-title: Whether to include the title at the top of each document on its own line.
  • (optional) --include-authors: Whether to include authors below the title, each on their own line.

Say you would like to extract sentences from swh_voaswahili of all content types (this will include article, audio, video, and photo) to the directory output_folder including titles and authors.

Run the following:

motext extract tokens swh_voaswahili output_folder --include-title --include-authors

If instead you want one paragraph per line, run:

motext extract paragraphs swh_voaswahili output_folder --include-title --include-authors

If you only want 6 files total, run:

motext extract sentences swh_voaswahili output_folder --num-files 6 --include-title --include-authors

If you want to constrain the number of sentences per file to 7, run:

motext extract sentences swh_voaswahili output_folder --max-per-file 7

If you want all audio and photo content, run:

motext extract sentences swh_voaswahili output_folder --types audio,photo

Note that content types are separated by a comma and no spaces.

Extracting given a source text file

The source argument of extract can also be a text file containing paths directly to json files. Most efficiently, this will be a text file produced by the search function, outlined in the Search section below. To use extract in this way, it is the same syntax as if the text file were a directory. An example of this usage is as follows:

motext extract sentences filter_text_file.txt output_folder

Search

The search function allows the user to produce a text file of paths to json files that are tagged by a keyword. To call search, run:

motext search source output_dir filename keyword

Say you would like a list of all articles in swh_voaswahili that are tagged with "Afrika".

Run:

motext search swh_voaswahili searches_dir afrika_search Afrika

The list will be stored in afrika_search.txt in searches_dir.

Now say you want to constrain the content types you are searching through to only audio and videos.

Run:

motext search swh_voaswahili searches_dir afrika_search Afrika --types audio,video

Scraping, Extraction, and Creating Releases

This repository contains all the code used to create version 1 of MOT. While we provide this for transparency, replication, and in case it will be useful to others, we do not recommend using it due to its complexity. However, documentation for our release creation process is below.

Setup

We recommend using a conda environment when working with the codebase. Use Python 3.8 or higher. Install dependencies with pip install -r requirements.txt.

You will need to install MongoDB to store scraped documents. mongo installation instructions. To start the database: mongod --dbpath voa-mongodb/ --wiredTigerCacheSizeGB 16 --port 27200 We specify a specfic path to store the database and a specific port and limit the cache size.

To dump or restore the DB from a past archive: mongodump --port 27200 --archive=dump-7.30.21.gz --gzip mongorestore --port 27200 --archive=dump-7.30.21.gz --gzip Use --bypassDocumentValidation flag if the backedup db doesn't have all documents passing validation.

Running Scraping and Extraction

Run downloadsitemaps.py to get fresh sitemaps of VOA. This requires the voa-domain.tsv file with the different VOA domains. python extraction/downloadsitemaps.py voa-domains.tsv sitemaps-10.27.21 filemap-10.27.21.tsv (Sometimes this fails with 503 error, just run again if needed)

There are two ways to scrape. You can scrape from scratch or you can scrape with only the new urls after comparing with prior sitemaps.

Updating the scrape with only new urls

Diff the new sitemap with whatever the most recent previous sitemap is. python scripts/comparesitemaps.py filemap-8.16.21.tsv filemap-10.27.21.tsv --early-sitemap-dir sitemaps-8.16.21/ --late-sitemap-dir sitemaps-10.27.21/ --outdir sitemap-diff-10.27.21/

Back up the database if it hasn't been backed up lately: mongodump --port 27200 --archive=dump-7.30.21.gz --gzip

Scrape using the diffed sitemap urls: python extraction/scraper.py update sitemap-diff-10.27.21/new_urls-filemap-10.27.21.tsv --port 27200

Scraping from scratch

Skip comparesitemaps.py and use python extraction/scraper.py scrape filemap-10.27.21.tsv sitemaps-10.27.21 --port 27200

Dump documents: This step can be skipped if extracting from the mongo database directly. Skipping this saves a lot of wasted space writing to disk. python extraction/dump_documents.py <outdir> <filemap> --n-processes 20 If including the custom models download them from custom-ersatz-models or train your own models using Ersatz and use the flag --custom-segmentation-dir <custom-models-dir-path>.

Run extraction script: It is currently recommended without GPUs and just use a high number of cpus for extraction. extracttext.py can be run on dumped json documents from the database or can be run from the database directly. Use fromdb to query the database directly and do extraction without writing intermediary files. Use fromfiles to run extract text from intermediate json files dumped from the db.

Sample call with parameters that seem ok on our dev machine. time python extraction/extracttext.py fromdb ~/mot/extractions-03.02.22/ --port 27200 --n-extractors 50 --n-db-queriers 10 --batchsize 100 --filemap filemap-03.01.22.tsv --start-date 2001-01-01

In the event that the extraction script hangs, this is likely caused by one process crashing. Run with printing redirected to a log and search the log to see what went wrong. The rest of the processes will likely finish the rest of the work, but you may lose a handful of documents unless the error is fixed.

One-off Scripts

The directory scripts contains a number of one-off scripts that we used briefly but are not part of the main extraction process.

Quality Checks

The directory qualitychecks contains some scripts for analysis of the corpus.

Making a new release

Run release.sh <extractions-dir> <releaseable-extractions-dir> to filter categories that we do not include and create tgz files.

Install gh if it isn't already installed. conda install gh --channel conda-forge Login with gh auth login. Follow the steps for logging in through a browser. Create a release draft on github. gh release upload <release number> <dir with the final extractions for release>/*.tgz Check everything is uploaded and publish the release on github.