Processed data

These are processed data sets for feeding into my seq2seq model each is based on a different subset of ~2.5 lines of twitter feed data with different preprocessing NLP filters applied.

The preprocessing pipeline is:

Strip non-alphanumeric characters (except question mark and periods)
Lower case
Stem using Porter stemmer
Count word frequencies and sequence lengths
Confine all sequences, x, to be between seq_min<x<seq_max
Remove all sequences containing rare tokens that occur less then n_min
Assign <UNK> to all rare tokens occuring n_min< token < n_threshold, where n_threshold is determined heuristically.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
processed_data.rar		processed_data.rar
processed_data_1764604.rar		processed_data_1764604.rar
processed_data_v02_twitter_py35_seq_length_4_15_sample_134241_full.pkl		processed_data_v02_twitter_py35_seq_length_4_15_sample_134241_full.pkl
processed_data_v02_twitter_py35_seq_length_4_20_sample_229828_full.pkl		processed_data_v02_twitter_py35_seq_length_4_20_sample_229828_full.pkl
word_dict_v02_twitter_py35_seq_length_4_15_sample_134241_full.pkl		word_dict_v02_twitter_py35_seq_length_4_15_sample_134241_full.pkl
word_dict_v02_twitter_py35_seq_length_4_20_sample_229828_full.pkl		word_dict_v02_twitter_py35_seq_length_4_20_sample_229828_full.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Processed data

About

Releases

Packages

kuhanw/processed_data

Folders and files

Latest commit

History

Repository files navigation

Processed data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages