Skip to content

Files

Latest commit

63b7450 · Apr 20, 2021

History

History
This branch is 45 commits behind avinashkranjan/Amazing-Python-Scripts:main.

Bag of words model

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Apr 7, 2021
Apr 20, 2021

Package/Script Name

-->Package installed- NLKT

  • NLTK stands for 'Natural Language Tool Kit'. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.

--> Pandas

  • pandas is a library where your data can be stored, analyzed and processed in row and column representation

--> from sklearn.feature_extraction.text import CountVectorizer

  • Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

Setup instructions

  1. Input the sentences you would like to vectorize.
  2. The script will tokenize the sentences.
  3. It will transform the text to vectors where each word and its count is a feature.
  4. Then the bag of word model is ready.
  5. create dataframe where dataFrame is an analogy to excel-spreadsheet.
  6. Open excel and check the 'bowp.xlsx' where sheet name is 'data'. The dataframe will be stored over there.

Output

Image

Author(s)

Disclaimers, if any

There are no disclaimers for this script.