Version 1.0
Authors: Alan Juffs, Na-Rae Han, Ben Naismith
Contact: [email protected]
This repository contains the dataset, as well as additional tools and tutorials, for the University of Pittsburgh English Language Institute Corpus (PELIC).
Corpus citation:
Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977
- Overview
- Corpus description
- Data collection and processing
- Dataset contents
- Additional resources
- Pitt ELI Toolkit (pelitk)
- PELIC spelling
- Future data release
- References
- License
This README.md
file introduces the the dataset for the Pittsburgh English Language Institute Corpus (PELIC), a large learner corpus of written and spoken texts. These texts were collected in an English for Academic Purposes (EAP) context over seven years in the University of Pittsburgh’s Intensive English Program, and were produced by students with a wide range of linguistic backgrounds and proficiency levels. Unlike most learner corpora which are cross-sectional (Callies, 2015), PELIC is quasi-longitudinal, offering greater opportunities for tracking development in a natural classroom setting.
In Section 2 we describe some of the key characteristics of this corpus, and Section 3 addresses how the data were collected and processed. Section 4 provides information about each of the files corpus in the repository so that they may be easily accessed and used for linguistic research. Sections 5 and 6 look at accompanying bespoke resources for processing this, or other, corpus data. Finally, Section 7 suggests possible avenues for future data release and development.
Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available learner corpora, supplemented by a useful set of tools and tutorials for accessing these data. For information regarding publications and presentations based on PELIC data, as well as for information regarding the people and parties responsible for the corpus, please visit the Pitt ELI Corpus repository.
PELIC is based on data collected from students at the English Language Institute (ELI) at the University of Pittsburgh from 2005-2012 as part of the National Science Foundation project housed at Pitt and CMU. The Intensive English Program (IEP) at the ELI was one of seven site partners with the Pittsburgh Science of Learning Center that provided in vivo research contexts. Part of this research was collecting data from learners; other components were experimental. The IEP data include written and spoken production from writing classes, grammar classes, reading classes, and speaking classes. At present, the written data is publicly available (though see Future Data Release).
Table 1 provides the basic statistics regarding the composition and size of PELIC, and Figure 1 (Juffs, 2020) presents a snapshot of the first languages (L1s) of the students. L1s represented by the largest number of students are Arabic, Chinese, Japanese, Korean, Spanish, and Turkish. Levels of proficiency in the dataset range from Level 2 (approximately equal to the Common European Framework (CEFR) A2) to Level 5 (CEFR B2/C1). There are few Level 2 students as the ELI did not regularly offer that level during the period of data collection. Students contributed data in all skill areas, so ultimately researchers will be able to analyze the data from many students in many skills areas.
Table 1 - PELIC composition
PELIC | Total N |
---|---|
students | 1177 |
texts | 46230 |
tokens | 4250703 |
word types | 39623 |
lemma types | 39307 |
Figure 1 - Number of students by L1
An idea of the numbers of texts from writing assignments from each language group at each level can be observed in Figure 2 (Juffs, 2020). Note that many students contributed to several levels, (e.g., 3, 4, and 5), making it possible to track longitudinally the same students' development over several semesters with multiple skills. Furthermore, many assignments have several versions, with revisions based on teacher feedback, so that uptake of teacher comments and the influence on language development can be investigated (Tables 2 and 3).
Figure 2 - Number of texts by L1
Table 2 - Number of text versions
Text versions | Students (N) |
---|---|
1 | 41589 |
2 | 4032 |
3 | 583 |
4+ | 22 |
Table 3 - Number of semesters attended
Semesters attended | Students (N) |
---|---|
1 | 529 |
2 | 373 |
3 | 204 |
4 | 53 |
5 | 16 |
6 | 2 |
Average (mean) | 1.86 |
Originally, the collected data resided on a server: relational data and textual data (such as paragraphs of text students wrote) were kept in a MySQL database, and all external files that went with them such as MS Word documents and recorded audio files were stored separately.
A web interface running MySQL query with a drop-down menu then let researchers specify parameters of their interest and access textual and audio data.
However, issues with server maintenance and advancement in tools for corpus analysis brought us to the adoption of a different approach: publishing the textual portion of the dataset in the form of .csv
files, which researchers can then analyze in full with the aid of Python or R.
The initial stage of data processing focused on conversion and clean-up: tasks included data conversion, cleaning, standardization, data-point culling and anonymization.
First, the MySQL tables were converted to .csv
(comma-separated values) format.
Spurious users (teachers, admins, test accounts) were purged, individual users were assigned unique identifiers, and their personal information entries in the database were removed.
Spurious data rows were likewise purged (deleted entries, test runs, etc.), and data fields with little information value were dropped.
Data column values were examined, cleaned up and converted into standardized values (for example, "home language" values were full of misspellings and variations).
In the second stage of data processing, we started to reach into the textual (i.e., corpus) data content and apply deep cleaning.
Excessive use of empty lines or symbols for formatting purposes (******
, etc.) were pared down, and instances of defunct \r
line breaks were removed.
All Unicode-based punctuation was converted to ASCII-based (’
to '
, for example).
A particularly vexing problem involved appearance of the ?
character in place of '
or some other punctuation in many of the student-written text, which we suspect occurred during a particular period when the data collection system was misconfigured on text encoding.
Fixing this involved a combination of automation and manual correction, as genuine tokens of ?
and broken characters were not always easily discerned.
The next stage focused on anonymization within text. Certain textual units such as website URLs and email addresses were rounded up and converted into the place-holder tags ANON_URLPAGE
and ANON_EMAIL
.
Secondly, mentions of personal names of students and teachers were identified and replaced with ANON_NAME_0
. Some texts contained mentions of multiple different personal names; in such cases, we differentiated them as ANON_NAME_1
, ANON_NAME_2
, etc. so as to keep such references distinguished.
Finally, the last stage involved adding interpretive layers to the text, i.e., some basic levels of linguistic information such as number of tokens, tokenization, part-of-speech (POS) tags, and lemmatization.
Tokenization For tokenization, we adopted NLTK's scheme based on Penn Treebank, which has long been the standard within the natural language processing (NLP) community and therefore will be crucial in being able to apply popular NLP applications to our text. We augmented this scheme by applying additional pre- and post- processing. The pre-processing normalized punctuation, which as one might expect was highly irregular in learners' writing contributing to tokenization errors. In post-processing, we further broke up punctuation and tokens that were not properly tokenized apart. Importantly, we selectively broke up hyphenated tokens that should not be treated as a single lexical item. In deciding what constitutes a single lexical unit (e.g., well-known, so-called) and what does not (e.g., coffee-loving, twelve-foot-long) we consulted COCA's list of to 100,000 frequent words, determining those found in the list to be in the former group.
##There was one remaining issue with adopting NLTK's tokenization scheme: it famously separates out all symbols and punctuation into their own tokens (,
, ...
in the example below), which means its token count will be greatly inflated compared to what's commonly thought of as "word count" in general and further the concept of "text length" within the SLA community.
>>> sent = "Well, Jenny didn't like Tom's shirt..."
>>> nltk.word_tokenize(sent) # NLTK's tokenizer: 10 tokens
[['Well', ',', 'Jenny', 'did', "n't", 'like', 'Tom', "'s", 'shirt', '...']
>>> re.findall(r"[A-Za-z_]+", sent) # RE tokenzer: 8 tokens
['Well', 'Jenny', 'didn', 't', 'like', 'Tom', 's', 'shirt']
Because of this mismatch, we felt it necessary to provide a secondary token count that more closely reflects the common expectations.
One popular, robust and lexicon-agnostic method for tokenization is based on regular expressions.
r"[A-Za-z_]+"
matches any stretch of alphabetic characters with _
allowed inside (so that place-holder tokens such as ANON_EMAIL
are matched as a whole).
As a result, the word counts of texts using Regular Expression (RE) based tokenization are smaller and reflect more closely how words are counted in the field of applied linguistics; in NLP, removal of punctuation marks is a common and important preprocessing step (Etaiwi & Naymat, 2017).
The example above showcases the tokenization of a short text using these two different methods. As we can see, there is a significant difference in the length of the sentence depending on whether the comma and period are considered to be tokens or not.
However, NLTK-based tokenization is adopted for all other purposes as it allows for other NLTK-based processing, e.g. part-of-speech tagging and lemmatization.
In all future references to text lengths, we use these re-based token counts as reported in the text_len
column in answer.csv
. In using our dataset, we hope the research community will likewise take proper caution to use this measure, especially in computing text-length-dependent metrics.
Part-Of-Speech tagging Producing part-of-speech (POS) tags was not a primary goal for us but simply a means to assist with lemmatization.
For example, in lemmatizing rose, knowing its POS (noun or verb) is critical in picking between rose and rise.
While there are plenty of high-accuracy POS taggers available for English, we settled on NLTK's built-in POS tagger (nltk.pos_tag()
) using the Penn Treebank POS tagset.
The resulting POS tags were not checked for quality: for processing tasks relying on accurate POS tags, we recommend users to produce their own using state-of-the-art POS taggers.
Lemmatization Since lemmas are one of the more fundamental and useful linguistic units within the SLA research, we decided to add a lemma layer. To our surprise, finding a good off-the-shelf lemmatizer for English proved difficult. Within the NLP community, working with fully inflected English words as types is the standard approach, therefore NLP suites tend to lack lemmatizing functions altogether; SpaCy provides one, but we found its output unreliable.
This brought us to take it upon ourselves to produce lemmas for the learner-written texts.
We relied on two key pieces of information: POS tags (rationale given above) and frequency. The latter was used for disambiguation: does can be lemmatized as do or doe ("female deer"), but the former is far more likely.
The COCA+ 100k word forms list proved a valuable resource, as it provided frequency ranks of English words with POS as well as lemma information, all compiled via automated processing.
We also utilized the Someya Lemma List, which contains fewer (14k) but manually curated hence more reliable entries.
We also created a supplementary lemma dictionary not covered by these two resources (e.g., ANON_NAME_0
, n't, 've, Mr. etc. as legitimate lemmas).
The lemmatization process can be summarized as follows: look up the token in our supplementary lemma dictionary; if not found, look up in COCA and Someya; if multiple lemma candidates, refer to its POS; if still ambiguous, rule for the most frequent lemma/POS; if token was not found in these lists, output the original token form as the lemma. As a spot check of the lemmatizer's accuracy, 10 texts of over 50 words in length (2231 tokens total) were manually lemmatized. When compared to the automated lemmatization process, there was a 99.3% percent agreement rate (2216/2231), indicating high reliability. Of the 15 items which were mis-lemmatized, the most common issue was for forms ending in -ing which can either be a noun form (keeping the ing), a verb form (removing the ing), or an adjective form (keeping the ing). Context is important for determining the correct lemma form in such cases, and with student language, grammatical errors can make the intended form difficult to decipher.
The tok_lem_POS
column in answer.csv
file contains the triple: (token, lemma, POS)
. A snippet from the very first entry:
('I', 'i', 'PRP'), ('met', 'meet', 'VBD'), ('my', 'my', 'PRP$'), ('friend', 'friend', 'NN'),
('Nife', 'nife', 'NNP'), ('while', 'while', 'IN'), ('I', 'i', 'PRP'), ('was', 'be', 'VBD'),
('studying', 'study', 'VBG'), ('in', 'in', 'IN'), ('a', 'a', 'DT'), ('middle', 'middle', 'JJ'),
('school', 'school', 'NN'), ('.', '.', '.')
There are five files in the corpus_files
folder which contain all of the corpus texts, information about the texts, and and information about the students:
In addition, there is a csv file, PELIC_compiled.csv
, in the home directory, which combines data from the various corpus files. (For a tutorial on how to build PELIC_compiled.csv
, please see Tutorials.)
level_id | Level description | CEFR level |
---|---|---|
2 | Pre-Intermediate | A2/B1 |
3 | Intermediate | B1 |
4 | Upper-Intermediate | B1+/B2 |
5 | Advanced | B2+/C1 |
class_id | Class description |
---|---|
g | Grammar |
l | Listening |
r | Reading |
s | Speaking |
w | Writing |
question_type_id | Question type |
---|---|
1 | Paragraph writing |
2 | Short answer |
3 | Multiple choice |
4 | Essay |
5 | Fill-in-the-blank |
6 | Sentence completion |
7 | Word bank |
8 | Chart |
9 | Word selection |
10 | Audio recording |
answer.csv
is the largest file in the dataset, containing all of the written texts, i.e., in PELIC, the texts are not separate txt files stored separately. answer.csv
is organized such that each row is a text with a unique identifier, the answer_id.
There are 9 columns in total, providing the text in various raw and processed forms, and information regarding the source of the text:
Column | Column name | Description |
---|---|---|
A | answer_id | a unique identifier for each text - a 1-5 digit integer, e.g. 19399 |
B | question_id | a code which links to question.csv , containing task information |
C | anon_id | a unique anonymous identifier for each student - two letters and one integer, e.g. eq0 |
D | course_id | a code which links to course.csv , containing course information, e.g. level, class type, semester |
E | version | the version number of the text (typically 1, 2 or 3) |
F | text_len | the number of tokens using re-based tokenization |
G | text | the raw text produced by the student (as a single string) |
H | tokens | the tokenized text using NLTK-based tokenization (each token is a string) |
I | tok_lem_POS | a list of three-part tuples - the token, lemma, and part of speech for each token in column H |
course.csv
contains information about every course in which PELIC texts were produced. course.csv
is organized such that each row is a unique course with a unique identifier, the course_id. There are five columns:
Column | Column name | Description |
---|---|---|
A | course_id | a unique identifier for each course - a 1-4 digit integer, e.g. 987 |
B | class_id | a code to identify in which of the five class types the text was produced (see Glossary above) |
C | level_id | a code to identify in which of the four levels the text was produced (see Glossary above) |
D | semester | the year and semester (fall, spring, summer) in which the text was produced |
E | section | the class section as there are sometimes multiple identical classes running in parallel |
question.csv
contains information about the questions, tasks, or prompts that the texts are based on. question.csv
is organized such that each row is a unique question/task/prompt with a unique identifier, the question_id. There are four columns:
Column | Column name | Description |
---|---|---|
A | question_id | a unique identifier for each question/task/prompt - a 1-4 digit integer, e.g. 6107 |
B | question_type_id | a code to identify the type of task (see Glossary above) |
C | stem | the text for the question/task/prompt |
D | allow_text | tasks which allow students to write an answer (like essays) are 1, tasks where students do not write an answer (like choosing a word from a word bank) are 0 |
student_information.csv
is a large file, containing all of the information about the students. student_information.csv
is organized such that each row is a student with a unique anonymous identifier, the anon_id.
There are 21 columns in total, providing all available information about the students relating to their background and history of language learning:
Column | Column name | Description |
---|---|---|
A | anon_id | a unique anonymous identifier for each student - two letters and one integer, e.g. eq0 |
B | gender | 'Male','Female',or 'Unknown' based on student responses |
C | birth year | four digit year |
D | native language | students input their own first language (not from a drop-down menu) |
E | language_used_at-home | language used at home in their home country, not in the US |
F | non-native_language_1 | the L2 with which the student feels they have the highest proficiency |
G | yrs_of_study_lang1 | the number of years the student has studied the L2 provided in column F |
H | study_in_classroom_lang1 | whether or not the student studied their L2 from column F in a classroom setting ('yes' or 'no') |
I | ways_of_study_lang1 | students selected from a menu how they studied their L2 from column F, e.g. Practiced reading aloud |
J, N | non-native_language_2, 3 | same as column F but for an additional non-native language |
K, O | yrs_of_study_lang2, 3 | same as column G but for an additional non-native language |
L, P | study_in_classroom_lang2, 3 | same as column H but for an additional non-native language |
M, Q | ways_of_study_lang2, 3 | same as column I but for an additional non-native language |
R | course_history | a list of all the courses attended (course_id codes) |
S | yrs_of_english_learning | the number of years the student has been learning English, selected from a drop-down list |
T | yrs_in_english_environment | the number of years the student has lived in an English-speaking environment, selected from a drop-down list |
U | age | the student's age at the time of enrollment |
test_scores.csv
contains information about students' test scores from their intial placement tests upon entering the ELI. test_scores.csv
is organized such that each row is a unique student with a unique identifier, the anon_id. There are 18 columns which provide scores for the different components of the placement test:
Column | Column name | Description |
---|---|---|
A | anon_id | a unique anonymous identifier for each student - two letters and one integer, e.g. eq0 |
B | LCT_Form_1 | in-house listening test (LCT) version number, first time test taken |
C | LCT_Score_1 | in-house listening test (LCT) score, first time test taken |
D | MTELP_Form_1 | Michigan Test of English Language Proficiency (MTELP) versions number |
E | MTELP_I_1 | MTELP Grammar section |
F | MTELP_II_1 | MTELP Reading section |
G | MTELP_III_1 | MTELP Listening section |
H | MTELP_Conv_Score_1 | MTELP total combined score |
I | Writing_Sample_1 | in-house writing test score (scale of 1-6) |
J - Q | Same as B-I | Same as columns B-I but for the second time students took the tests |
PELIC_compiled.csv
is a compilation of the files described above. Like answer.csv
, PELIC_compiledcsv
is organized such that each row is a unique text with a unique identifier, the anon_id. Accompanying each text are data relating to the author (from student_information.csv
) and the course (course.csv
). These columns have been selected due to their usefulness for conducting linguistic analysis. However, other columns may be added or deleted as desired; see the build_PELIC_compiled tutorial for how to create and manipulate the PELIC_compiled.csv
.
There are 13 columns in the pre-supplied version of PELIC_compiled.csv
in the repository:
Column | Column name | Source |
---|---|---|
A | answer_id | answer.csv column A |
B | anon_id | answer.csv column B |
C | L1 | student_information.csv column D |
D | gender | student_information.csv column B |
E | course_id | course.csv column A |
F | level_id | course.csv column C |
G | class_id | course.csv column B |
H | question_id | answer.csv column H |
I | version | answer.csv column I |
J | text_len | answer.csv column J |
K | text | answer.csv column K |
L | tokens | answer.csv column L |
M | tok_lem_POS | answer.csv column M |
The corpus_stats
folder currently contains PELIC frequency statistics. All of these frequency data can be calculated from the original files in the corpus_files
folder or PELIC_compiled.csv
. However, for quicker access to frequency information, the files in this folder may be useful.
The corpus_stats
folder contains the following files:
File | Description |
---|---|
README.md |
a README file containing a description of the folder contents |
frequency_stats.ipynb |
a jupyter notebook describing how word_frequencies.csv and lemma_frequencies.csv were created |
word_frequencies.csv |
a csv file containing the total frequency and per million frequency for every word in PELIC |
lemma_frequencies.csv |
a csv file containing the total frequency and per million frequency for every lemma in PELIC |
- Distributions do not take capitalization into account – a capitalized word and the same non-capitalized word will go towards the same count.
- Frequencies are based on the NLTK-based tokenized text tokens. As described in Section 3, punctuation is therefore also included in the distributions. If considering frequency ranking (for example for frequency bands), it is important to first exclude punctuation.
The tutorials
folder contains three tutorials relating to the PELIC dataset:
-
build_PELIC_compiled.ipynb
The build_PELIC_compiled notebook provides a tutorial for creating thePELIC_compiled.csv
from the PELIC corpus files in thecorpus_files
folder. The final csv file is also available in the home directory. For more information onPELIC_compiled.csv
, see Section 4. -
exploratory_data_analysis.ipynb
The exploratory_data_analysis notebook provides a standard first step (EDA) in any data exploration and corpus analysis. It presents and demonstrates basic statistics of PELIC's composition, including the figures and statistics presented in this file. The statistics relate to the aspects of the corpus such as the students, their first languages, their genders, the classes and semesters, and of course the texts themselves. -
PELIC_concordancing_tutorial.ipynb
The PELIC_concordancing_tutorial notebook provides a short example of the type of linguistic investigation that can be carried out with the data in PELIC. The focus of the investigation is a set of verbs which are important indicators of syntactic complexity. The tutorial has two aims:
- to present a straightforward and replicable way of accessing and processing the corpus data necessary to answer genuine research questions, using tools from the Pitt ELI Toolkit (pelitk)
- to demonstrate how to build a concordance list and dataframe using the PELIC data
pelitk
is a python package that contains implementations of various lexical analysis tools that are useful for Second Language Acquisition (SLA) work. These modules can be imported and used in Python. At present, there are two modules available:
conc.py
- functions for creating concordances to show selected key words in contextlex.py
- functions measuring lexical sophistication and diversity using a range of indices
For details of pelitk
contents and example usage, please see the pelitk
repo README.md file.
PELIC-spelling
is a repository which provides information and code about applying spelling correction to the PELIC dataset. Spelling correction is an important element to consider in any corpus study involving learner data. The decision whether to correct texts or not will invariably impact results: in some instances it may be preferable to use the raw text, maintaining its integrity and avoiding an additional layer of processing. However, for other projects, corrected text may provide a more accurate representation of the language features being investigated.
For details of PELIC-spelling
contents and example usage, please see the PELIC-spelling
repo README.md file.
The spoken data from speaking classes will be available in both .wav format (analyzable in PRAAT) and .mp3 format and will include the students' transcriptions of their own spoken data. A publication based on a small subset of these data are in Vercellotti (2017).
- Recorded Speaking and Grammar Activities (.wav files):
Arabic 20,678; Chinese 9,870; Japanese 3,564; Korean 11,827 - A small subset of these files which are annotated in CHAT/CLAN and a list of published research is available at Talkbank.org.
- Callies, M. (2015). Learner Corpus Methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 35-55). New York: Cambridge University Press.
- Etaiwi, W. & Naymat, G. (2017). The Impact of applying Different Preprocessing Steps on Review Spam Detection, Procedia Computer Science, 113, 273-279. https://doi.org/10.1016/j.procs.2017.08.368
- Juffs, A. (2020). Aspects of Language Development in an Intensive English Program. New York: Routledge.
- Leńko-Szymańska, A. (2019). Defining and Assessing Lexical Proficiency. Routledge.
- Sinclair, J. (2003). Reading concordances: an introduction. London; New York: Pearson/Longman.
- Vercellotti, M. L. (2017). The development of complexity, accuracy and fluency in second language performance. Applied Linguistics, 38, 90-111.
PELIC dataset by Alan Juffs, Na-Rae Han, Ben Naismith is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Based on a work at https://github.com/ELI-Data-Mining-Group/PELIC_dataset.