Persian Search Engine

Developed for the Information Retrieval course instructed by Dr. Ahmad Nikabadi in Fall 2022

Development Phases

The project unfolded in two phases:

Normalization: Using the hazm library, the text is normalized.
Removing Punctuation: Eliminating punctuation marks from the text.
Tokenization: Splitting the text into individual tokens or words.
Stemming: Reducing words to their root forms (e.g., "wants" and "wanted" to "want").
Stop Words Removal: Removing common words (e.g., "the", "and").

TF-IDF Scoring: Using Term Frequency-Inverse Document Frequency (TF-IDF) to rank documents.
Cosine Similarity: Implementing cosine similarity to measure the similarity between query and document vectors.
Champion Lists: Creating champion lists to speed up query processing.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
IR-Project-Fall1401.pdf		IR-Project-Fall1401.pdf
IR.ipynb		IR.ipynb
README.md		README.md
report.pdf		report.pdf