Skip to content

Murpheus is a powerful analysis tool used to analyze a ton of Twitter data from the internet archive. It is built to scale from laptop hardware to massive supercomputer clusters.

License

Notifications You must be signed in to change notification settings

Social-Media-Public-Analysis/murphy

Repository files navigation

Murphy

Build codecov Maintainability GitHub last commit Issues PRs Welcome!

Murpheus is a powerful analysis tool used to analyze a ton of Twitter data from the internet archive. It is built to scale from laptop hardware to massive supercomputer clusters.

Motivation

This project's main goal is to lower the bar for entry for data scientists, researchers, and anyone else who wants to analyze and study Twitter datasets.

Keeping this in mind, we want to build a powerful toolset that can scale this massively, and with everything working right out of the box with minimal support (but if you do need any help, hop on to our issues page!).

Frameworks used:

The main framework that we used is Dask, a powerful library that provides advanced parallelism for analytics, enabling performance at scale for the tools you love[1]

Installation:

Before installing, ensure that the version of python that you're using is python>=3.6. We intend to support all of the latest releases of Python as they come out!

Installing with PIP:

Installing with pip is as easy as:

pip install smpa-murphy

Install the latest version:

Installing the latest version is also quite simple. To do so, run the following commands:

git clone https://github.com/Social-Media-Public-Analysis/murphy.git # -> clone the repo
cd murphy                                                            # -> move over to the repo
python setup.py install                                              # -> install the library directly!

Annddd you're done!

Usage:

The usage is fairly straightforward:

from dask.distributed import Client   # -> Importing the dask client
from murphy import data_loader

client = Client()                     # -> feel free to modify this to point to your dask cluster!


data = data_loader.DataLoader(file_find_expression='../data/test_data/*.json.bz2') # -> Here, you can change the `find_file_expression` to point to any other location where you store your twitter data!


twitter_dataframe = data.twitter_dataframe # -> this return a dask dataframe that is lazily computed

This is what we see in when we view twitter_dataframe in a jupyter cell:

Out[13]:
Dask DataFrame Structure:
created_at id id_str text source truncated in_reply_to_status_id in_reply_to_status_id_str in_reply_to_user_id in_reply_to_user_id_str in_reply_to_screen_name user geo coordinates place contributors is_quote_status quote_count reply_count retweet_count favorite_count entities favorited retweeted filter_level lang timestamp_ms user_names
npartitions=2
object int64 object object object bool object object object object object object object object object object bool int64 int64 int64 int64 object bool bool object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: assign, 60 tasks
. . .

Having a deeper look at data_loader.DataLoader

This is what we get when we run help(data_loader):

class DataLoader(builtins.object)
 |  DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
 |  
 |  Methods defined here:
 |  
 |  __init__(self, file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
 |      This is where you can specify how you want to configure the twitter dataset before you start processing it.
 |      
 |      :param file_find_expression: unix-like path that is used for listing out all of the files that we need
 |      
 |      :param remove_emoji: flag for removing emojis from all of the twitter text
 |      
 |      :param remove_retweets_symbols: flag for removing retweet strings from all of the twitter text (`RT @<retweet_username>:`)
 |      
 |      :param remove_truncated_tweets: flag for removing all tweets that are truncated, as not all information can be
 |                                      found in them
 |      
 |      :param add_usernames: flag for adding in the user names from who tweeted as a separate column instead of parsing
 |                            it from the `user` column
 |      
 |      :param tokenize: tokenize tweets to make them easier to process
 |      
 |      :param filter_stopwords: remove stopwords from the tweets to make them easier to process
 |      
 |      :param lemmatize: lemmatize text to make it easier to process
 |      
 |      :param language: select the language that you want to work with

Here, we can see that the DataLoader class has tons of configurable parameters that we can use to make development easier, including built in tokenization, lemmatization, and more!

Testing:

Tests can be run after installation simply by:

pytest tests/

Call for Contributions

We're currently quite early in our development cycle, and are looking for people to help us out! It can be something as simple as designing out logo, adding high level documentation, creating a website, or whatever other idea that you have! Please contact us through the issues page if you have any ideas or would like to add any improvements to our projects!

About

Murpheus is a powerful analysis tool used to analyze a ton of Twitter data from the internet archive. It is built to scale from laptop hardware to massive supercomputer clusters.

Resources

License

Stars

Watchers

Forks

Packages

No packages published