Developed for the Information Retrieval course instructed by Dr. Ahmad Nikabadi in Fall 2022
The project unfolded in two phases:
Normalization: Using the
library, the text is normalized. -
Removing Punctuation: Eliminating punctuation marks from the text.
Tokenization: Splitting the text into individual tokens or words.
Stemming: Reducing words to their root forms (e.g., "wants" and "wanted" to "want").
Stop Words Removal: Removing common words (e.g., "the", "and").
TF-IDF Scoring: Using Term Frequency-Inverse Document Frequency (TF-IDF) to rank documents.
Cosine Similarity: Implementing cosine similarity to measure the similarity between query and document vectors.
Champion Lists: Creating champion lists to speed up query processing.
Mahla Sharifi