Murphy

Murpheus is a powerful analysis tool used to analyze a ton of Twitter data from the internet archive. It is built to scale from laptop hardware to massive supercomputer clusters.

Motivation

This project's main goal is to lower the bar for entry for data scientists, researchers, and anyone else who wants to analyze and study Twitter datasets.

Keeping this in mind, we want to build a powerful toolset that can scale this massively, and with everything working right out of the box with minimal support (but if you do need any help, hop on to our issues page!).

Frameworks used:

The main framework that we used is Dask, a powerful library that provides advanced parallelism for analytics, enabling performance at scale for the tools you love^[1]

Installation:

Before installing, ensure that the version of python that you're using is python>=3.6. We intend to support all of the latest releases of Python as they come out!

Installing with PIP:

Installing with pip is as easy as:

pip install smpa-murphy

Install the latest version:

Installing the latest version is also quite simple. To do so, run the following commands:

git clone https://github.com/Social-Media-Public-Analysis/murphy.git # -> clone the repo
cd murphy                                                            # -> move over to the repo
python setup.py install                                              # -> install the library directly!

Annddd you're done!

Usage:

The usage is fairly straightforward:

from dask.distributed import Client   # -> Importing the dask client
from murphy import data_loader

client = Client()                     # -> feel free to modify this to point to your dask cluster!


data = data_loader.DataLoader(file_find_expression='../data/test_data/*.json.bz2') # -> Here, you can change the `find_file_expression` to point to any other location where you store your twitter data!


twitter_dataframe = data.twitter_dataframe # -> this return a dask dataframe that is lazily computed

This is what we see in when we view twitter_dataframe in a jupyter cell:

Out[13]:

Dask DataFrame Structure:

	created_at	id	id_str	text	source	truncated	in_reply_to_status_id	in_reply_to_status_id_str	in_reply_to_user_id	in_reply_to_user_id_str	in_reply_to_screen_name	user	geo	coordinates	place	contributors	is_quote_status	quote_count	reply_count	retweet_count	favorite_count	entities	favorited	retweeted	filter_level	lang	timestamp_ms	user_names
npartitions=2
	object	int64	object	object	object	bool	object	object	object	object	object	object	object	object	object	object	bool	int64	int64	int64	int64	object	bool	bool	object	object	object	object
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: assign, 60 tasks

. . .

Having a deeper look at `data_loader.DataLoader`

This is what we get when we run help(data_loader):

class DataLoader(builtins.object)
 |  DataLoader(file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
 |  
 |  Methods defined here:
 |  
 |  __init__(self, file_find_expression: Union[str, pathlib.Path, List[pathlib.Path]], remove_emoji: bool = True, remove_retweets_symbols: bool = True, remove_truncated_tweets: bool = True, add_usernames: bool = True, tokenize: bool = True, filter_stopwords: bool = True, lemmatize: bool = True, language: str = 'english')
 |      This is where you can specify how you want to configure the twitter dataset before you start processing it.
 |      
 |      :param file_find_expression: unix-like path that is used for listing out all of the files that we need
 |      
 |      :param remove_emoji: flag for removing emojis from all of the twitter text
 |      
 |      :param remove_retweets_symbols: flag for removing retweet strings from all of the twitter text (`RT @<retweet_username>:`)
 |      
 |      :param remove_truncated_tweets: flag for removing all tweets that are truncated, as not all information can be
 |                                      found in them
 |      
 |      :param add_usernames: flag for adding in the user names from who tweeted as a separate column instead of parsing
 |                            it from the `user` column
 |      
 |      :param tokenize: tokenize tweets to make them easier to process
 |      
 |      :param filter_stopwords: remove stopwords from the tweets to make them easier to process
 |      
 |      :param lemmatize: lemmatize text to make it easier to process
 |      
 |      :param language: select the language that you want to work with

Here, we can see that the DataLoader class has tons of configurable parameters that we can use to make development easier, including built in tokenization, lemmatization, and more!

Testing:

Tests can be run after installation simply by:

pytest tests/

Call for Contributions

We're currently quite early in our development cycle, and are looking for people to help us out! It can be something as simple as designing out logo, adding high level documentation, creating a website, or whatever other idea that you have! Please contact us through the issues page if you have any ideas or would like to add any improvements to our projects!

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
.github/workflows		.github/workflows
autodocs		autodocs
data/test_data		data/test_data
docs		docs
examples/murpheus/batch_processing		examples/murpheus/batch_processing
murphy		murphy
notebooks		notebooks
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
codecov.yml		codecov.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Murphy

Motivation

Frameworks used:

Installation:

Installing with PIP:

Install the latest version:

Usage:

Having a deeper look at `data_loader.DataLoader`

Testing:

Call for Contributions

About

Releases

Packages

Contributors 2

Languages

License

Social-Media-Public-Analysis/murphy

Folders and files

Latest commit

History

Repository files navigation

Murphy

Motivation

Frameworks used:

Installation:

Installing with PIP:

Install the latest version:

Usage:

Having a deeper look at data_loader.DataLoader

Testing:

Call for Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Having a deeper look at `data_loader.DataLoader`

Packages