The project aimed to classify Gutenberg texts accurately. Employing advanced NLP methodologies, it covered collection, preprocessing, feature engineering, and model evaluation for literary work classification. as part of the University of Ottawa's 2023 NLP course.
- Required libraries: scikit-learn, pandas, matplotlib.
- Execute cells in a Jupyter Notebook environment.
- The uploaded code has been executed and tested successfully within the Google Colab environment.
Text Classification Task to categorize a 5 Gutenberg texts into their respective literary works or books.
selected_books = ['austen-emma.txt','carroll-alice.txt','chesterton-brown.txt','edgeworth-parents.txt','shakespeare-hamlet.txt']
-
Data Preparation, Preprocessing and, Cleaning:
-
Listing all the books in Gutenberg’s library.
{'austen-emma.txt': 'Jane Austen', 'austen-persuasion.txt': 'Jane Austen', 'austen-sense.txt': 'Jane Austen', 'carroll-alice.txt': 'Lewis Carroll', 'chesterton-ball.txt': 'G.K. Chesterton', 'chesterton-brown.txt': 'G. K. Chesterton', 'chesterton-thursday.txt': 'G. K. Chesterton', 'edgeworth-parents.txt': 'Maria Edgeworth', 'melville-moby_dick.txt': 'Dick Herman Melville', 'shakespeare-caesar.txt': 'William Shakespeare', 'shakespeare-hamlet.txt': 'William Shakespeare', 'whitman-leaves.txt': 'Walt Whitman'}
-
Choose five different books by five different authors belong to the same category (History).
-
Data preparation:
- Removing stop words.
- Converting all words to the lower case.
- Tokenize the text.
- Lemmatization is the next step that reduces a word to its base form.
-
Data Partitioning: partition each book into 200 documents, each document is a 100 word record.
-
Data labeling as follows:
- austen-emma→ a
- chesterton-thursday→ b
- shakespeare-hamlet→ c
- chesterton-ball→ d
- carroll-alice→ e
-
Word Cloud Generation: Generates word clouds displaying the most frequent 100 words in books for each author.
-
Shuffle Dataset
-
-
Feature Engineering:
- Transformation
- Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
- Term Frequency - Inverse Document Frequency (TF-IDF):a technique to quantify words in a set of documents. We compute a score for each word to signify its importance in the document and corpus.
- N-grams
- Word Embedding (Word2Vec)
- Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
- Transformation
- Encoding
-
Modeling: For each technique of the above, these following models are trained and tested.
- Random Forest
- Gaussian Naive Bayes
- K Nearest Neighbors
-
Model Evaluation
-
Error Analysis of Champion Model:
Best Model= Gaussian Naive Bayes
Accacruy and Champion Embedding: [0.98, 'N-Grams']
-
By reducing the number of words, it will lead to reduce the accuracy of our champion model
Accuracy with number of words 100 is 98.67 % Accuracy with number of words 70 is 97.33 % Accuracy with number of words 50 is 94.67 % Accuracy with number of words 40 is 94.67 % Accuracy with number of words 30 is 92.0 % Accuracy with number of words 20 is 84.0 %
-
Indicate that the n estimators’ parameter is not significantly impacting the model's performance on our dataset.
Accuracy with n estimators100 is 98.67 % Accuracy with n estimators70 is 98.67 % Accuracy with n estimators50 is 98.67 % Accuracy with n estimators40 is 98.67 % Accuracy with n estimators30 is 98.67 % Accuracy with n estimators20 is 98.67 %