These are processed data sets for feeding into my seq2seq model each is based on a different subset of ~2.5 lines of twitter feed data with different preprocessing NLP filters applied.
The preprocessing pipeline is:
- Strip non-alphanumeric characters (except question mark and periods)
- Lower case
- Stem using Porter stemmer
- Count word frequencies and sequence lengths
- Confine all sequences, x, to be between seqmin<x<seqmax
- Remove all sequences containing rare tokens that occur less then nmin
- Assign <UNK> to all rare tokens occuring nmin< token < nthreshold, where nthreshold is determined heuristically.