From ba252251e42a3d61a3aab620f72a692b5ea417cd Mon Sep 17 00:00:00 2001 From: Lauren Dyson Date: Sun, 26 Feb 2017 14:35:02 -0600 Subject: [PATCH] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ce2bd15..1c1914d 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ This project is using a [dataset published by Signal Media](http://research.sign From the raw article text, we generate the following features: -1. Vectorized bigram Term Frequency-Inverse Document Frequency, with preprocessing to strip out named entities (people, places etc.) and replace them with anonymous placeholders (e.g. "Donald Trump" --> ""). We use Spacy for tokenization and entity recognition, and SkLearn for TFIDF vectorization. +1. Vectorized bigram Term Frequency-Inverse Document Frequency, with preprocessing to strip out named entities (people, places etc.) and replace them with anonymous placeholders (e.g. "Donald Trump" --> "-PERSON-"). We use Spacy for tokenization and entity recognition, and SkLearn for TFIDF vectorization. 2. Normalized frequency of parsed syntacical dependencies. Again, we use Spacy for parsing and SkLearn for vectorization. Here is an [excellent interactive visualization](https://demos.explosion.ai/displacy/) of Spacy's dependency parser. ## Pipeline