Developed for the Information Retrieval course instructed by Dr. Ahmad Nikabadi in Fall 2022
The project unfolded in two phases:
-
Normalization: Using the
hazm
library, the text is normalized. -
Removing Punctuation: Eliminating punctuation marks from the text.
-
Tokenization: Splitting the text into individual tokens or words.
-
Stemming: Reducing words to their root forms (e.g., "wants" and "wanted" to "want").
-
Stop Words Removal: Removing common words (e.g., "the", "and").
-
TF-IDF Scoring: Using Term Frequency-Inverse Document Frequency (TF-IDF) to rank documents.
-
Cosine Similarity: Implementing cosine similarity to measure the similarity between query and document vectors.
-
Champion Lists: Creating champion lists to speed up query processing.
-
Mahla Sharifi