Russian short messages sentiment classifier

This model is derived to get rid of dictionary-based approach, which is crucial for Russian language known by its rich affixation.

Input data

short texts

Target

labels denoting positive, negative or neutral tonality of a message

Data sources:

labelled data from vk.com
labelled data from Russian segment of Twitter
labelled data from otzovik

Due to machine labelling a substantial part of messages has been mislabelled. In order to compose a reliable dataset the initial data has been distilled to ~33k items. Another problem was the strong correlation between labels and the presence of emoticons in the text. In order to overcome this semiotic shift (thus, overfitting thereto) and to base entirely upon the semantics of the texts the most popular emoticons have been filtered out.

Preprocessing

The preprocessing process is reduced to two stages:

elimintaion of emoticons, ids and notorious punctuation marks
tokenization and encoding with old but gold Tensorflow solution SubwordTextEncoder By using this approach we reduce the vocab to 2^12 items without losing the ability to catch the semantics of a certain sequence. Therefore we do not need any stemming/lemmatization any more as well we have no need to coin a specific policy for OOVs since any word is bound to be split into a sequence of tokens.

Model

A number of architectures has been tested (MLP, CNN, LSTM, BiLSTM). The best result has been achieved with the BiLSTM.

Metrics

The dataset has been split into train/test in proportion 85/15. Then the model has been fit on the training subset with the validation split of 15%.

Metric	Value
Validation Accuracy	0.8119
F1-score (macro) on test	0.8015
avg CPU time	< 0.015s

In order to reproduce this result on an adequate dataset the weights of the model have been saved and then used in the Flask application.

Application

A dockerised application has been developed as an MVP. In order to build it run:

docker build -t [name] .

In order to run it in a detached mode run

docker run  -d -p 8585:8585 [name]

To verify its work please run

curl -X POST -d 'text=синхрофазотрон' http://localhost:8585/sentiment

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Russian short messages sentiment classifier

Input data

Target

Data sources:

Preprocessing

Model

Metrics

Application

License

About

Releases

Packages

Languages

KonZoX/russian_sentiment

Folders and files

Latest commit

History

Repository files navigation

Russian short messages sentiment classifier

Input data

Target

Data sources:

Preprocessing

Model

Metrics

Application

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages