README

Information retrieval project: develop a Twitter search engine with presonalized search.
Group members:

Samuele Ventura, ID: 793060, email: [email protected]
Federico Belotti, ID: 808708, email: [email protected]

The project has been tested on Pop!_OS 19.10 and MacOS 10.15.2.
If you're using Windows...use Linux!

Requirements

In order to use this project, one needs to:

Download and install git
Download and install Docker and Docker Compose
Download and install conda
Download the data folder and unpack it in the project root: please remember that all the modules refers to the files in the data folder as they are, so do not change their positions or names!
This folder contains:
- A folder named models which contains all the word embeddings trained with TWEC and the additional bigrams.pkl and trigrams.pkl models
- A folder named query, which contains the query.json with 649727 multiple topics (sport, economy, politics, music, cinema, technology) tweets from 01-21-2020 to 02-04-2020, and the query.pkl which contains a list of lists with the preprocessed and tokenized tweets
- A folder named twec, which contains all the .txt files needed to train TWEC
- A folder named users, which contains all the users tweets and their respective preprocessed and tokenized .pkl files

Installation

Follow those instructions to create the workspace:

git clone https://github.com/Sbarbagnem/sbarbasearch.git
cd sbarbasearch
Copy the data folder in here
Clone the TWEC repository with git clone https://github.com/valedica/twec.git, needed to train a particular version of the Temporal Word Embeddings. For more information https://github.com/valedica/twec
conda env create --name sbarbasearch --file=environment.yml
conda activate sbarbasearch
cd twec; pip install -e .; cd .. to install TWEC
python requirements_installer.py to install packages needed to NLTK

Usage

If you're running a Linux-like distro you better start running sudo sysctl -w vm.max_map_count=262144, otherwise Kibana service won't start properly.
When you've downloaded the data folder and unpacked it in the project root, you can start the services up running docker-compose up -d, then python -m indexer.create_index to create the index, which can be managed at http://localhost:5601/app/kibana.
Run the webapp with python -m webapp.app and browse to http://127.0.0.1:5000/.

Useful commands

A rule of thumb to run every runnable script in this project is: python -m folder.script_name [options...].
Every script you wanna run, you must run it from the project root, so if for example you want to run a preprocess stage, from the project root run python -m preprocess.preprocess --help to see whats options you have, then run the script with the desired options.
Here are listed all the runnble scripts:

python -m crawl_tweet.crawl_tweet_query downloads or updates the query tweets using tweepy API. Run it with --help to see all the available options
python -m crawl_tweet.crawl_tweet_users downloads or updates the users tweets using tweepy API. Run it with --help to see all the available options
python -m embeddings.create_twec_data creates the txt files to train TWEC. Run it with --help to see all the available options
python -m embeddings.train_embeddings trains TWEC models. Run it with --help to see all the available options
python -m embeddings.train_phraser trains bigrams and/or trigrams models. Run it with --help to see all the available options
python -m indexer.create_index creates the index used by Elasticsearch during the search.
python -m preprocess.preprocess preprocess the query and the users json files. Run it with --help to see all the available options
python -m user_profile.create_bow creates the users profiles fitting a TF model. Run it with --help to see all the available options
python -m webapp.app starts the webapp at http://127.0.0.1:5000/

Typical workflow

If you want, for example, update users and query tweets, preprocessed them, learn bigram or trigram models and/or a new word embeddings with different parameters, please consider to follow those steps in order:

[Optional] python -m crawl_tweet.crawl_tweet_query to update the query tweets
[Optional] python -m crawl_tweet.crawl_tweet_users to update the users tweets
python -m preprocess.preprocess to preprocess the query and the users json files, and save the preprocessed files as list of sentences, where each sentence is a list of tokens. This is needed for both the phrases and the twec model
python -m embeddings.train_phraser to learn bigram and/or trigram models
python -m embeddings.create_twec_data to create txts needed to train twec
python -m embeddings.train_embeddings to train the twec models
python -m user_profile.create_bow to update the users profiles fitting a new TF model
python -m indexer.create_index to create the index used by Elasticsearch during the search
python -m webapp.app starts the webapp at http://127.0.0.1:5000/

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
config		config
crawl_tweet		crawl_tweet
embeddings		embeddings
indexer		indexer
preprocess		preprocess
user_profile		user_profile
webapp		webapp
.gitignore		.gitignore
README.MD		README.MD
TODO.MD		TODO.MD
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
report.pdf		report.pdf
requirements_installer.py		requirements_installer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Requirements

Installation

Usage

Useful commands

Typical workflow

About

Releases

Packages

Contributors 2

Languages

Sbarbagnem/sbarbasearch

Folders and files

Latest commit

History

Repository files navigation

README

Requirements

Installation

Usage

Useful commands

Typical workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages