A presentation on Capstone Project "PredictNextWord"
By Gyaan GM [May 1, 2016]
The Coursera Data Science Specialization Capstone Project from Johns Hopkins University (JHU) allows students to create a usable public data product that can show their skills to potential employers. For this iteration of the class, JHU partnered with SwiftKey (http://swiftkey.com/en/) to apply data science in the area of Natural Language Processing.
The objective of this project is to build a working predictive text model. The data used in the model came from a corpus called HC Corpora (www.corpora.heliohost.org). A corpus is body of text, usually containing a large number of sentences. [1]
[1] http://desilinguist.org/pdf/crossroads.pdf
The algorithm developed to predict the next word in a user-entered text string was based on a classic N-gram model. [2] Using a subset of cleaned data from blogs, twitter, and news Internet files, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed.
To improve accuracy, Jelinek-Mercer smoothing was used in the algorithm, combining trigram, bigram, and unigram probabilities. [3] Where interpolation failed, part-of-speech tagging (POST) was employed to provide default predictions by part of speech. [4] Suggested word completion was based on the unigrams. A profanity filter was also utilized on all output using Google's bad words list. [5]
[2] http://en.wikipedia.org/wiki/N-gram [3] http://www.ee.columbia.edu/~stanchen/papers/h015l.final.pdf [4] http://en.wikipedia.org/wiki/Part-of-speech_tagging [5] https://badwordslist.googlecode.com/files/badwords.txt
Then a Shiny (http://shiny.rstudio.com/) app that accepts a phrase as input, suggests word completion from the unigrams, and predicts the most likely next word based on the linear interpolation of trigrams, bigrams, and unigrams is developed. The web-based application can be found here and the source files for this project can be found here. App user interface looks like this
The user interface of this application works as follows:When the text [**1**] is entered, the field with the predicted next word [**2**] refreshes instantaneously and also the whole text input [**3**] gets displayed with suggested completion.