datasets

Various unique "real-world" datasets specifically for deep learning purpose. The files are provided in .msgpack format and can be also used separately from Autonomio for example with Pandas:

import pandas as pd
pd.read_msgpack('https://github.com/autonomio/datasets/raw/master/autonomio-datasets/election_in_twitter')

'election_in_twitter'

Dataset consisting of 10 minute samples of 80 million tweets from the beginning of November 2016 to end of December 2016. The keywords used to capture tweets are 'Trump' and 'Hillary'.

'tweet_sentiment'

Dataset with tweet text classified for sentiment using NLTK Vader including word2vec word vectors for each tweet using spaCy.

'sites_category_and_vec'

4,000 sites with word vectors and 5 categories.

'programmatic_ad_fraud'

Data from both buy and sell side and over 10 other sources.

'parties_and_employment'

9 years of monthly poll and unemployment numbers.

'random_tweets'

20,000 tweets with various data colums related with tweet quality, including if the tweet is from a bot or not.

'kaggle_titanic_train'

The train dataset provided as part of the hugely popular Kaggle Titanic Survitor prediction challenge.

'sites_and_vec'

20,000 sites with word vectors based on the landing page content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

datasets

'election_in_twitter'

'tweet_sentiment'

'sites_category_and_vec'

'programmatic_ad_fraud'

'parties_and_employment'

'random_tweets'

'kaggle_titanic_train'

'sites_and_vec'

Files

README.md

Latest commit

History

README.md

File metadata and controls

datasets

'election_in_twitter'

'tweet_sentiment'

'sites_category_and_vec'

'programmatic_ad_fraud'

'parties_and_employment'

'random_tweets'

'kaggle_titanic_train'

'sites_and_vec'