Murpheus is a powerful analysis tool used to analyze a ton of Twitter data from the internet archive. It is built to scale from laptop hardware to massive supercomputer clusters.
This project's main goal is to lower the bar for entry for data scientists, researchers, and anyone else who wants to analyze and study Twitter datasets.
Keeping this in mind, we want to build a powerful toolset that can scale this massively, and with everything working right out of the box with minimal support (but if you do need any help, hop on to our issues page!).
The main framework that we used is Dask, a powerful library that provides advanced parallelism for analytics, enabling performance at scale for the tools you love[1]
Before installing, ensure that the version of python that you're using is python>=3.6. We intend to support all of the latest releases of Python as they come out!
Installing with pip is as easy as:
pip install smpa-murphy
Installing the latest version is also quite simple. To do so, run the following commands:
git clone https://github.com/Social-Media-Public-Analysis/murphy.git # -> clone the repo
cd murphy # -> move over to the repo
python setup.py install # -> install the library directly!
Annddd you're done!
The usage is fairly straightforward:
from dask.distributed import Client # -> Importing the dask client
from murphy import data_loader
client = Client() # -> feel free to modify this to point to your dask cluster!
data = data_loader.DataLoader(file_find_expression='../data/test_data/*.json.bz2') # -> Here, you can change the `find_file_expression` to point to any other location where you store your twitter data!
twitter_dataframe = data.twitter_dataframe # -> this return a dask dataframe that is lazily computed
This is what we see in when we view twitter_dataframe
in a jupyter cell:
created_at | id | id_str | text | source | truncated | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | user | geo | coordinates | place | contributors | is_quote_status | quote_count | reply_count | retweet_count | favorite_count | entities | favorited | retweeted | filter_level | lang | timestamp_ms | user_names | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
npartitions=2 | ||||||||||||||||||||||||||||
object | int64 | object | object | object | bool | object | object | object | object | object | object | object | object | object | object | bool | int64 | int64 | int64 | int64 | object | bool | bool | object | object | object | object | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
This is what we get when we run help(data_loader)
:
class DataLoader(builtins.object)
| DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
|
| Methods defined here:
|
| __init__(self, file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
| This is where you can specify how you want to configure the twitter dataset before you start processing it.
|
| :param file_find_expression: unix-like path that is used for listing out all of the files that we need
|
| :param remove_emoji: flag for removing emojis from all of the twitter text
|
| :param remove_retweets_symbols: flag for removing retweet strings from all of the twitter text (`RT @<retweet_username>:`)
|
| :param remove_truncated_tweets: flag for removing all tweets that are truncated, as not all information can be
| found in them
|
| :param add_usernames: flag for adding in the user names from who tweeted as a separate column instead of parsing
| it from the `user` column
|
| :param tokenize: tokenize tweets to make them easier to process
|
| :param filter_stopwords: remove stopwords from the tweets to make them easier to process
|
| :param lemmatize: lemmatize text to make it easier to process
|
| :param language: select the language that you want to work with
Here, we can see that the DataLoader
class has tons of configurable parameters that we can use to make development easier, including built in tokenization, lemmatization, and more!
Tests can be run after installation simply by:
pytest tests/
We're currently quite early in our development cycle, and are looking for people to help us out! It can be something as simple as designing out logo, adding high level documentation, creating a website, or whatever other idea that you have! Please contact us through the issues page if you have any ideas or would like to add any improvements to our projects!