Skip to content

Latest commit

 

History

History

byu-now

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

byu-now

BYU's "NOW" Corpus (News On the Web)

https://corpus.byu.edu/now/

Description

5.97 billion words / 6.0+ million texts. (As of early Dec 2016; continually growing). 20 countries. The most up-to-date corpus of English. 4-5 words added each day (130 million each month, 1.5 billion each year). Wide range of online newspapers and magazines (technology, entertainment, sports, politics, etc) The NOW corpus (News on the Web) contains 5.9 billion words of data from web-based newspapers and magazines from 2010 to the present time. More importantly, the corpus grows by about 4-5 million words of data each day (from about 10,000 new articles), or about 150 million words each month.

This corpus was obtained from Mark Davies at BYU using the above URL in April, 2018.

File format

This corpus is available in three formats: Database (db), Word/lemma/POS (wlp), and plain text (text). Note that 10 tokens every 200 tokens have been replaced by @ @ @ @ @ @ @ @ @ @ to get around copyright restrictions.

Note that this includes an update from November 2016 to March 2018. For these months, there is a separate lexicon file and sources file for each month!

Restricted access

A campus-wide (academic multi-user) license was purchased for this corpus by Noah Smith and Yejin Choi (via Dallas Card and Eunsol Choi).

To be granted access to the BYU corpora, you must read and agree to the terms of the license (below).

After reading the license (pay careful attention to the distribution restrictions), contact one of the nlp-corpora maintainers to be granted access to the set of BYU corpora:

License

(The following is a copy of LICENSE.txt from the original corpus.)

Restrictions on use of the corpora

You must agree to these restrictions in order to obtain the data , or else obtain a waiver from us for a particular point listed below.

[ Data ]

  1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.

  2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement. Academic Single-User licenses do not allow the data to be distributed over a network.

  3. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus.

  4. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.)

[ License ]

  1. Academic Single User license: the data can only be accessed by one single person. Otherwise, you will need an Academic Multi-User license (ACAD-2+).

  2. Academic licenses: are only valid for one campus. So if you are part of a research group, for example, with members at universities X, Y, and Z, they all need to purchase the data separately.

  3. Academic licenses: you can not use the data to create software or products that will be sold to others.

  4. Academic licenses: students in your undergraduate classes cannot have access to substantial portions of the data (e.g. 50,000 words or more). Graduate students can have access to the data for work on theses and dissertations. The data is primarily intended for use in research, not teaching. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.

  5. Academic Multi-User and Commercial licenses: supervisors will make best efforts to ensure that other employees or students who have access to the data are aware of these restrictions.

  6. Commercial license: large companies with employees at several different sites (especially different countries) may need to contact us for a special license.

[ General ]

  1. Any publications or products that are based on the data should contain a reference to the source of the data: http://www.corpusdata.org.

  2. Note that a small, unique change will be made to each set of data, and this will serve as a "fingerprint" to identify you as the unique source of the datasets that you download. Automated Google searches are run daily to find copies of the data on the Web. If we find the data online and it is the data that was sent to you (and we will be able to determine that is the case), then you will be required to contact the administrators for that website, to have the data removed.