This a try to do topic modellng on best 100 papers from github repo awesome-deep-learning-papers.
From the repo, we should have 100 papers but during the crawling with script,
the access towards one of them (Human-level control through deep reinforcement learning
) is blocked.
Then, a script and pdftotext is used to
parse pdfs to plain texts.
In find_topics.py, we concatenate all plain texts to papers.txt
which is of size 4 MB.
This means there is about 4000000 characters in the data.
The gensim
library is used as it is tailored for topic modelling. The findings are visualized by pyLDAvis
library and stored as .html
.