Skip to content

mathav95raj/fake_news_classifier_glove_BLSTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fake_news_classifier_glove_BLSTM

A fake news classifier model is built with a Bidirectional LSTM and Glove embeddings of 100 dimension used as feature. Achieved a public score of 0.92 in kaggle submission https://www.kaggle.com/mathav95raj/blstm-fake-news-classifier-with-glove-embeddings

Reason for choosing BiLSTM - Vanilla Recurrent neural networks cannot remember long range dependencies in a sentence. A LSTM layer has forget, update and output gates which allow the model to have control on what information is retained and what is discarded. Since the full news text is available beforehand, a bidirectional layer allows us to interpret information from a sentence in both forward and reverse direction

Glove vectors - Traditional way of learning embeddings will result in a huge number of parameters to be learnt and there are high chances that the model might overfit. To avoid this issue, transfer learning was performed by utlizing pre trained embeddings from https://nlp.stanford.edu/projects/glove/

The glove vecotrs chosen for this project was learnt from Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 100d) NOTE: Glove vectors can be downloaded separately and change the glove path accordingly

Trainable parameters have been substantially reduced from 1 million to about 32 thousand.

A custom embedding layer class is implemented that performs backpropagation training for only a portion of the embedding matrix

Example:

Vocabulary of training set : 10k

words in training set that are also in Glove : 8k (Non Trainable)

words in training set, not in Glove but each word occurs greather than WORD_FREQUENCY(see config) times : 1k (Trainable)

words in training set, not in Glove, each word occurs less frequently : represented by token (Trainable)

Padding a news title to fixed length(see config): represented by token (Trainable)

I faced issue in implementing the custom class, a simple keras.concatenate would not do the job because the one hot vector indices were re numbered when I created a separate layer for fixed and variable embeddings.. instead of the embedding matrix looking like {0 1 ..7999 |8000.. 9001} it was like {0 1 ... 7999 | 0 1 .... 1001}

Thanks to https://github.com/sachinruk/deepschool.io for the custom class for concatenating fixed and trainable embedding layer.. related query by him in stack overflow: https://stackoverflow.com/questions/46032700/concatenate-embedding-layers/46149153#46149153

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published