-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_processing.tex
17 lines (12 loc) · 1.41 KB
/
data_processing.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
\section{Data Processing}
\label{sec:data_processing}
Data processing is used in this project to improve feature extraction from sentences as due to their length as much meta data as possible needs to be created to aid classification.
\subsection{Preprocessing}
\label{sec:preprocessing}
Sentence preprocessing was conducted in order to clean the dataset prior to being fed into classifiers.
Preprocessing steps include stemming, lemmatisation and stopword removal. These different options are made available separately so that classifiers can use them as required.
\subsection{Word2Vec}
\label{sec:word2vec}
One-hot encoding of words leads to impractically large dimensionality, and it is desirable to instead create a lower dimensional embedding for the vocabulary being used.
Word2Vec uses a two-layer neural network to autoencode words using their neighbours in the input corpus. The resulting space is of far lower dimensionality, and preserves semantic relationships, (i.e. $queen - king + man = woman$). A Word2Vec embedding was trained on our own corpus. However due to the relatively small dataset, it is likely that this embedding does not strongly enforce relationships between words and will not contain words that are present in the test data.
To address this, a pre-trained encoding was also used: A `Global Vectors for Word Representation` (GloVe) representation trained on 6 billion words from wikipedia.