Skip to content

The University of Pittsburgh English Language Institute Corpus (PELIC) dataset

Notifications You must be signed in to change notification settings

juanberrios/PELIC-dataset

 
 

Repository files navigation

The University of Pittsburgh English Language Institute Corpus (PELIC)

Version 1.0
Authors: Alan Juffs, Na-Rae Han, Ben Naismith
Contact: [email protected]

DOI

This repository contains the dataset, as well as additional tools and tutorials, for the University of Pittsburgh English Language Institute Corpus (PELIC).

Corpus citation:
Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. http://doi.org/10.5281/zenodo.3991977


Table of contents

  1. Overview
  2. Corpus description
  3. Data collection and processing
  4. Dataset contents
  5. Additional resources
  6. Pitt ELI Toolkit (pelitk)
  7. PELIC spelling
  8. Future data release
  9. References
  10. License

1. Overview

This README.md file introduces the the dataset for the Pittsburgh English Language Institute Corpus (PELIC), a large learner corpus of written and spoken texts. These texts were collected in an English for Academic Purposes (EAP) context over seven years in the University of Pittsburgh’s Intensive English Program, and were produced by students with a wide range of linguistic backgrounds and proficiency levels. Unlike most learner corpora which are cross-sectional (Callies, 2015), PELIC is quasi-longitudinal, offering greater opportunities for tracking development in a natural classroom setting.

In Section 2 we describe some of the key characteristics of this corpus, and Section 3 addresses how the data were collected and processed. Section 4 provides information about each of the files corpus in the repository so that they may be easily accessed and used for linguistic research. Sections 5 and 6 look at accompanying bespoke resources for processing this, or other, corpus data. Finally, Section 7 suggests possible avenues for future data release and development.

Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available learner corpora, supplemented by a useful set of tools and tutorials for accessing these data. For information regarding publications and presentations based on PELIC data, as well as for information regarding the people and parties responsible for the corpus, please visit the Pitt ELI Corpus repository.


2. Corpus description

PELIC is based on data collected from students at the English Language Institute (ELI) at the University of Pittsburgh from 2005-2012 as part of the National Science Foundation project housed at Pitt and CMU. The Intensive English Program (IEP) at the ELI was one of seven site partners with the Pittsburgh Science of Learning Center that provided in vivo research contexts. Part of this research was collecting data from learners; other components were experimental. The IEP data include written and spoken production from writing classes, grammar classes, reading classes, and speaking classes. At present, the written data is publicly available (though see Future Data Release).

Table 1 provides the basic statistics regarding the composition and size of PELIC, and Figure 1 (Juffs, 2020) presents a snapshot of the first languages (L1s) of the students. L1s represented by the largest number of students are Arabic, Chinese, Japanese, Korean, Spanish, and Turkish. Levels of proficiency in the dataset range from Level 2 (approximately equal to the Common European Framework (CEFR) A2) to Level 5 (CEFR B2/C1). There are few Level 2 students as the ELI did not regularly offer that level during the period of data collection. Students contributed data in all skill areas, so ultimately researchers will be able to analyze the data from many students in many skills areas.

Table 1 - PELIC composition

PELIC Total N
students 1177
texts 46230
tokens 4250703
word types 39623
lemma types 39307

Figure 1 - Number of students by L1
Figure 1 - Number of students by L1

An idea of the numbers of texts from writing assignments from each language group at each level can be observed in Figure 2 (Juffs, 2020). Note that many students contributed to several levels, (e.g., 3, 4, and 5), making it possible to track longitudinally the same students' development over several semesters with multiple skills. Furthermore, many assignments have several versions, with revisions based on teacher feedback, so that uptake of teacher comments and the influence on language development can be investigated (Tables 2 and 3).

Figure 2 - Number of Texts by L1
Figure 2 - Number of texts by L1


Table 2 - Number of text versions

Text versions Students (N)
1 41589
2 4032
3 583
4+ 22

Table 3 - Number of semesters attended

Semesters attended Students (N)
1 529
2 373
3 204
4 53
5 16
6 2
Average (mean) 1.86

3. Data collection and processing

Originally, the collected data resided on a server: relational data and textual data (such as paragraphs of text students wrote) were kept in a MySQL database, and all external files that went with them such as MS Word documents and recorded audio files were stored separately. A web interface running MySQL query with a drop-down menu then let researchers specify parameters of their interest and access textual and audio data. However, issues with server maintenance and advancement in tools for corpus analysis brought us to the adoption of a different approach: publishing the textual portion of the dataset in the form of .csv files, which researchers can then analyze in full with the aid of Python or R.

3.1. Conversion, clean-up and culling of data entries

The initial stage of data processing focused on conversion and clean-up: tasks included data conversion, cleaning, standardization, data-point culling and anonymization. First, the MySQL tables were converted to .csv (comma-separated values) format. Spurious users (teachers, admins, test accounts) were purged, individual users were assigned unique identifiers, and their personal information entries in the database were removed. Spurious data rows were likewise purged (deleted entries, test runs, etc.), and data fields with little information value were dropped. Data column values were examined, cleaned up and converted into standardized values (for example, "home language" values were full of misspellings and variations).

3.2. Cleaning student-written text and anonymization

In the second stage of data processing, we started to reach into the textual (i.e., corpus) data content and apply deep cleaning. Excessive use of empty lines or symbols for formatting purposes (******, etc.) were pared down, and instances of defunct \r line breaks were removed. All Unicode-based punctuation was converted to ASCII-based ( to ', for example). A particularly vexing problem involved appearance of the ? character in place of ' or some other punctuation in many of the student-written text, which we suspect occurred during a particular period when the data collection system was misconfigured on text encoding. Fixing this involved a combination of automation and manual correction, as genuine tokens of ? and broken characters were not always easily discerned.

The next stage focused on anonymization within text. Certain textual units such as website URLs and email addresses were rounded up and converted into the place-holder tags ANON_URLPAGE and ANON_EMAIL. Secondly, mentions of personal names of students and teachers were identified and replaced with ANON_NAME_0. Some texts contained mentions of multiple different personal names; in such cases, we differentiated them as ANON_NAME_1, ANON_NAME_2, etc. so as to keep such references distinguished.

3.3. Linguistic processing of text

Finally, the last stage involved adding interpretive layers to the text, i.e., some basic levels of linguistic information such as number of tokens, tokenization, part-of-speech (POS) tags, and lemmatization.

Tokenization For tokenization, we adopted NLTK's scheme based on Penn Treebank, which has long been the standard within the natural language processing (NLP) community and therefore will be crucial in being able to apply popular NLP applications to our text. We augmented this scheme by applying additional pre- and post- processing. The pre-processing normalized punctuation, which as one might expect was highly irregular in learners' writing contributing to tokenization errors. In post-processing, we further broke up punctuation and tokens that were not properly tokenized apart. Importantly, we selectively broke up hyphenated tokens that should not be treated as a single lexical item. In deciding what constitutes a single lexical unit (e.g., well-known, so-called) and what does not (e.g., coffee-loving, twelve-foot-long) we consulted COCA's list of to 100,000 frequent words, determining those found in the list to be in the former group.

##There was one remaining issue with adopting NLTK's tokenization scheme: it famously separates out all symbols and punctuation into their own tokens (,, ... in the example below), which means its token count will be greatly inflated compared to what's commonly thought of as "word count" in general and further the concept of "text length" within the SLA community.

>>> sent = "Well, Jenny didn't like Tom's shirt..."
>>> nltk.word_tokenize(sent)             # NLTK's tokenizer: 10 tokens
[['Well', ',', 'Jenny', 'did', "n't", 'like', 'Tom', "'s", 'shirt', '...']
>>> re.findall(r"[A-Za-z_]+", sent)      # RE tokenzer: 8 tokens
['Well', 'Jenny', 'didn', 't', 'like', 'Tom', 's', 'shirt']

Because of this mismatch, we felt it necessary to provide a secondary token count that more closely reflects the common expectations. One popular, robust and lexicon-agnostic method for tokenization is based on regular expressions. r"[A-Za-z_]+" matches any stretch of alphabetic characters with _ allowed inside (so that place-holder tokens such as ANON_EMAIL are matched as a whole). As a result, the word counts of texts using Regular Expression (RE) based tokenization are smaller and reflect more closely how words are counted in the field of applied linguistics; in NLP, removal of punctuation marks is a common and important preprocessing step (Etaiwi & Naymat, 2017).

The example above showcases the tokenization of a short text using these two different methods. As we can see, there is a significant difference in the length of the sentence depending on whether the comma and period are considered to be tokens or not. However, NLTK-based tokenization is adopted for all other purposes as it allows for other NLTK-based processing, e.g. part-of-speech tagging and lemmatization. In all future references to text lengths, we use these re-based token counts as reported in the text_len column in answer.csv. In using our dataset, we hope the research community will likewise take proper caution to use this measure, especially in computing text-length-dependent metrics.

Part-Of-Speech tagging Producing part-of-speech (POS) tags was not a primary goal for us but simply a means to assist with lemmatization. For example, in lemmatizing rose, knowing its POS (noun or verb) is critical in picking between rose and rise. While there are plenty of high-accuracy POS taggers available for English, we settled on NLTK's built-in POS tagger (nltk.pos_tag()) using the Penn Treebank POS tagset. The resulting POS tags were not checked for quality: for processing tasks relying on accurate POS tags, we recommend users to produce their own using state-of-the-art POS taggers.

Lemmatization Since lemmas are one of the more fundamental and useful linguistic units within the SLA research, we decided to add a lemma layer. To our surprise, finding a good off-the-shelf lemmatizer for English proved difficult. Within the NLP community, working with fully inflected English words as types is the standard approach, therefore NLP suites tend to lack lemmatizing functions altogether; SpaCy provides one, but we found its output unreliable.

This brought us to take it upon ourselves to produce lemmas for the learner-written texts. We relied on two key pieces of information: POS tags (rationale given above) and frequency. The latter was used for disambiguation: does can be lemmatized as do or doe ("female deer"), but the former is far more likely. The COCA+ 100k word forms list proved a valuable resource, as it provided frequency ranks of English words with POS as well as lemma information, all compiled via automated processing. We also utilized the Someya Lemma List, which contains fewer (14k) but manually curated hence more reliable entries. We also created a supplementary lemma dictionary not covered by these two resources (e.g., ANON_NAME_0, n't, 've, Mr. etc. as legitimate lemmas).

The lemmatization process can be summarized as follows: look up the token in our supplementary lemma dictionary; if not found, look up in COCA and Someya; if multiple lemma candidates, refer to its POS; if still ambiguous, rule for the most frequent lemma/POS; if token was not found in these lists, output the original token form as the lemma. As a spot check of the lemmatizer's accuracy, 10 texts of over 50 words in length (2231 tokens total) were manually lemmatized. When compared to the automated lemmatization process, there was a 99.3% percent agreement rate (2216/2231), indicating high reliability. Of the 15 items which were mis-lemmatized, the most common issue was for forms ending in -ing which can either be a noun form (keeping the ing), a verb form (removing the ing), or an adjective form (keeping the ing). Context is important for determining the correct lemma form in such cases, and with student language, grammatical errors can make the intended form difficult to decipher.

The tok_lem_POS column in answer.csv file contains the triple: (token, lemma, POS). A snippet from the very first entry:

('I', 'i', 'PRP'), ('met', 'meet', 'VBD'), ('my', 'my', 'PRP$'), ('friend', 'friend', 'NN'),
('Nife', 'nife', 'NNP'), ('while', 'while', 'IN'), ('I', 'i', 'PRP'), ('was', 'be', 'VBD'),
('studying', 'study', 'VBG'), ('in', 'in', 'IN'), ('a', 'a', 'DT'), ('middle', 'middle', 'JJ'),
('school', 'school', 'NN'), ('.', '.', '.')

4. Dataset contents

There are five files in the corpus_files folder which contain all of the corpus texts, information about the texts, and and information about the students:

In addition, there is a csv file, PELIC_compiled.csv, in the home directory, which combines data from the various corpus files. (For a tutorial on how to build PELIC_compiled.csv, please see Tutorials.)

Glossary of codes in files:

level_id Level description CEFR level
2 Pre-Intermediate A2/B1
3 Intermediate B1
4 Upper-Intermediate B1+/B2
5 Advanced B2+/C1
class_id Class description
g Grammar
l Listening
r Reading
s Speaking
w Writing
question_type_id Question type
1 Paragraph writing
2 Short answer
3 Multiple choice
4 Essay
5 Fill-in-the-blank
6 Sentence completion
7 Word bank
8 Chart
9 Word selection
10 Audio recording

answer.csv

answer.csv is the largest file in the dataset, containing all of the written texts, i.e., in PELIC, the texts are not separate txt files stored separately. answer.csv is organized such that each row is a text with a unique identifier, the answer_id.

There are 9 columns in total, providing the text in various raw and processed forms, and information regarding the source of the text:

Column Column name Description
A answer_id a unique identifier for each text - a 1-5 digit integer, e.g. 19399
B question_id a code which links to question.csv, containing task information
C anon_id a unique anonymous identifier for each student - two letters and one integer, e.g. eq0
D course_id a code which links to course.csv, containing course information, e.g. level, class type, semester
E version the version number of the text (typically 1, 2 or 3)
F text_len the number of tokens using re-based tokenization
G text the raw text produced by the student (as a single string)
H tokens the tokenized text using NLTK-based tokenization (each token is a string)
I tok_lem_POS a list of three-part tuples - the token, lemma, and part of speech for each token in column H

course.csv

course.csv contains information about every course in which PELIC texts were produced. course.csv is organized such that each row is a unique course with a unique identifier, the course_id. There are five columns:

Column Column name Description
A course_id a unique identifier for each course - a 1-4 digit integer, e.g. 987
B class_id a code to identify in which of the five class types the text was produced (see Glossary above)
C level_id a code to identify in which of the four levels the text was produced (see Glossary above)
D semester the year and semester (fall, spring, summer) in which the text was produced
E section the class section as there are sometimes multiple identical classes running in parallel

question.csv

question.csv contains information about the questions, tasks, or prompts that the texts are based on. question.csv is organized such that each row is a unique question/task/prompt with a unique identifier, the question_id. There are four columns:

Column Column name Description
A question_id a unique identifier for each question/task/prompt - a 1-4 digit integer, e.g. 6107
B question_type_id a code to identify the type of task (see Glossary above)
C stem the text for the question/task/prompt
D allow_text tasks which allow students to write an answer (like essays) are 1, tasks where students do not write an answer (like choosing a word from a word bank) are 0

student_information.csv

student_information.csv is a large file, containing all of the information about the students. student_information.csv is organized such that each row is a student with a unique anonymous identifier, the anon_id.

There are 21 columns in total, providing all available information about the students relating to their background and history of language learning:

Column Column name Description
A anon_id a unique anonymous identifier for each student - two letters and one integer, e.g. eq0
B gender 'Male','Female',or 'Unknown' based on student responses
C birth year four digit year
D native language students input their own first language (not from a drop-down menu)
E language_used_at-home language used at home in their home country, not in the US
F non-native_language_1 the L2 with which the student feels they have the highest proficiency
G yrs_of_study_lang1 the number of years the student has studied the L2 provided in column F
H study_in_classroom_lang1 whether or not the student studied their L2 from column F in a classroom setting ('yes' or 'no')
I ways_of_study_lang1 students selected from a menu how they studied their L2 from column F, e.g. Practiced reading aloud
J, N non-native_language_2, 3 same as column F but for an additional non-native language
K, O yrs_of_study_lang2, 3 same as column G but for an additional non-native language
L, P study_in_classroom_lang2, 3 same as column H but for an additional non-native language
M, Q ways_of_study_lang2, 3 same as column I but for an additional non-native language
R course_history a list of all the courses attended (course_id codes)
S yrs_of_english_learning the number of years the student has been learning English, selected from a drop-down list
T yrs_in_english_environment the number of years the student has lived in an English-speaking environment, selected from a drop-down list
U age the student's age at the time of enrollment

test_scores.csv

test_scores.csv contains information about students' test scores from their intial placement tests upon entering the ELI. test_scores.csv is organized such that each row is a unique student with a unique identifier, the anon_id. There are 18 columns which provide scores for the different components of the placement test:

Column Column name Description
A anon_id a unique anonymous identifier for each student - two letters and one integer, e.g. eq0
B LCT_Form_1 in-house listening test (LCT) version number, first time test taken
C LCT_Score_1 in-house listening test (LCT) score, first time test taken
D MTELP_Form_1 Michigan Test of English Language Proficiency (MTELP) versions number
E MTELP_I_1 MTELP Grammar section
F MTELP_II_1 MTELP Reading section
G MTELP_III_1 MTELP Listening section
H MTELP_Conv_Score_1 MTELP total combined score
I Writing_Sample_1 in-house writing test score (scale of 1-6)
J - Q Same as B-I Same as columns B-I but for the second time students took the tests

PELIC_compiled.csv

PELIC_compiled.csv is a compilation of the files described above. Like answer.csv, PELIC_compiledcsv is organized such that each row is a unique text with a unique identifier, the anon_id. Accompanying each text are data relating to the author (from student_information.csv) and the course (course.csv). These columns have been selected due to their usefulness for conducting linguistic analysis. However, other columns may be added or deleted as desired; see the build_PELIC_compiled tutorial for how to create and manipulate the PELIC_compiled.csv.

There are 13 columns in the pre-supplied version of PELIC_compiled.csv in the repository:

Column Column name Source
A answer_id answer.csv column A
B anon_id answer.csv column B
C L1 student_information.csv column D
D gender student_information.csv column B
E course_id course.csv column A
F level_id course.csv column C
G class_id course.csv column B
H question_id answer.csv column H
I version answer.csv column I
J text_len answer.csv column J
K text answer.csv column K
L tokens answer.csv column L
M tok_lem_POS answer.csv column M

5. Additional resources

Corpus stats

The corpus_stats folder currently contains PELIC frequency statistics. All of these frequency data can be calculated from the original files in the corpus_files folder or PELIC_compiled.csv. However, for quicker access to frequency information, the files in this folder may be useful.

The corpus_stats folder contains the following files:

File Description
README.md a README file containing a description of the folder contents
frequency_stats.ipynb a jupyter notebook describing how word_frequencies.csv and lemma_frequencies.csv were created
word_frequencies.csv a csv file containing the total frequency and per million frequency for every word in PELIC
lemma_frequencies.csv a csv file containing the total frequency and per million frequency for every lemma in PELIC

corpus_stats notes:

  • Distributions do not take capitalization into account – a capitalized word and the same non-capitalized word will go towards the same count.
  • Frequencies are based on the NLTK-based tokenized text tokens. As described in Section 3, punctuation is therefore also included in the distributions. If considering frequency ranking (for example for frequency bands), it is important to first exclude punctuation.

Tutorials

The tutorials folder contains three tutorials relating to the PELIC dataset:

  1. build_PELIC_compiled.ipynb
    The build_PELIC_compiled notebook provides a tutorial for creating the PELIC_compiled.csv from the PELIC corpus files in the corpus_files folder. The final csv file is also available in the home directory. For more information on PELIC_compiled.csv, see Section 4.

  2. exploratory_data_analysis.ipynb
    The exploratory_data_analysis notebook provides a standard first step (EDA) in any data exploration and corpus analysis. It presents and demonstrates basic statistics of PELIC's composition, including the figures and statistics presented in this file. The statistics relate to the aspects of the corpus such as the students, their first languages, their genders, the classes and semesters, and of course the texts themselves.

  3. PELIC_concordancing_tutorial.ipynb
    The PELIC_concordancing_tutorial notebook provides a short example of the type of linguistic investigation that can be carried out with the data in PELIC. The focus of the investigation is a set of verbs which are important indicators of syntactic complexity. The tutorial has two aims:

  • to present a straightforward and replicable way of accessing and processing the corpus data necessary to answer genuine research questions, using tools from the Pitt ELI Toolkit (pelitk)
  • to demonstrate how to build a concordance list and dataframe using the PELIC data

6. Pitt ELI Toolkit (pelitk)

pelitk is a python package that contains implementations of various lexical analysis tools that are useful for Second Language Acquisition (SLA) work. These modules can be imported and used in Python. At present, there are two modules available:

  1. conc.py - functions for creating concordances to show selected key words in context
  2. lex.py - functions measuring lexical sophistication and diversity using a range of indices

For details of pelitk contents and example usage, please see the pelitk repo README.md file.


7. PELIC spelling

PELIC-spelling is a repository which provides information and code about applying spelling correction to the PELIC dataset. Spelling correction is an important element to consider in any corpus study involving learner data. The decision whether to correct texts or not will invariably impact results: in some instances it may be preferable to use the raw text, maintaining its integrity and avoiding an additional layer of processing. However, for other projects, corrected text may provide a more accurate representation of the language features being investigated.

For details of PELIC-spelling contents and example usage, please see the PELIC-spelling repo README.md file.


8. Future data release

The spoken data from speaking classes will be available in both .wav format (analyzable in PRAAT) and .mp3 format and will include the students' transcriptions of their own spoken data. A publication based on a small subset of these data are in Vercellotti (2017).

  • Recorded Speaking and Grammar Activities (.wav files):
    Arabic 20,678; Chinese 9,870; Japanese 3,564; Korean 11,827
  • A small subset of these files which are annotated in CHAT/CLAN and a list of published research is available at Talkbank.org.

9. References


10. License

Creative Commons License
PELIC dataset by Alan Juffs, Na-Rae Han, Ben Naismith is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Based on a work at https://github.com/ELI-Data-Mining-Group/PELIC_dataset.


Back to top

About

The University of Pittsburgh English Language Institute Corpus (PELIC) dataset

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 54.5%
  • Jupyter Notebook 45.5%