Skip to content

Persian News Search Engine: Information Retrieval Course Project (Fall 2022)

Notifications You must be signed in to change notification settings

mahlashrifi/Information-Retrieval-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Persian Search Engine

Developed for the Information Retrieval course instructed by Dr. Ahmad Nikabadi in Fall 2022

Development Phases

The project unfolded in two phases:

Phase 1

  • Normalization: Using the hazm library, the text is normalized.

  • Removing Punctuation: Eliminating punctuation marks from the text.

  • Tokenization: Splitting the text into individual tokens or words.

  • Stemming: Reducing words to their root forms (e.g., "wants" and "wanted" to "want").

  • Stop Words Removal: Removing common words (e.g., "the", "and").

Phase 2

  • TF-IDF Scoring: Using Term Frequency-Inverse Document Frequency (TF-IDF) to rank documents.

  • Cosine Similarity: Implementing cosine similarity to measure the similarity between query and document vectors.

  • Champion Lists: Creating champion lists to speed up query processing.

Contributors

About

Persian News Search Engine: Information Retrieval Course Project (Fall 2022)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published