search-engine

Mini Search Engine which can search the web and display the relevant results for the given search term. The web pages are crawled and inverted index is created and stored which is used to fetch the results while searching.

Design HLD

The search engine has microservice architecture and contains 3 different microservices namely

Crawler Service
Indexer Service
Search Service

Microservices & their flow

Crawler Service
1. Picks up Urls to be crawled from SQS
2. Downloads the HTML Content and stores it in s3
3. Sends a message to SQS with crawled page metadata
Indexer Service
1. Sends urls to be crawled to urls SQS
2. Reads SQS messages from HTML Metadata Queue and starts the orchestration
3. Downloads the HTML file from s3
4. Parses the HTML content
5. Sends relevant child urls for further crawling after updating url metadata and state in RDS
6. Creates Inverted Index after stemming html content and stores it in DB
Search Service
1. UI Facing service which uses REST APIs
2. Given a search Query, parses the query and searches for the search terms in the inverted index
3. Fetches url metadata of all the relevant urls
4. Ranks result based on different metrics such as count, page rank, isPresentInTitle etc based on separate algorithm
5. Displays top results

Prerequisites to run

AWS Local setup via cli
2 AWS SQS required for Crawler and Indexer Service
S3 bucket setup in both Crawler and Indexer Service
Local RDS and MongoDB setup. The config needs to be updated in Indexer and Search Service

Steps to run

Check each of the application.properties file and update different configs like sqs url, mongo and rds url and creds
Can also change the port in the appliaction.properties file
In indexservice, Constants.java file, change the MAX_NUMBER_OF_CRAWLED_PAGES to increase the number of crawled pages per run. This was done for testing to control AWS resourcce costs.
Once the configs are set correctly, start different services on different ports using java/mvn commands

Future Scope

While searching, search for synonyms as well
Implement autocomplete system using Trie
Improve Ranking algo → together words get higher search rank

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
crawlerservice		crawlerservice
indexservice		indexservice
searchservice		searchservice
README.md		README.md
searchengine.jpg		searchengine.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-engine

Design HLD

Microservices & their flow

Prerequisites to run

Steps to run

Future Scope

References

About

Releases

Packages

Contributors 2

Languages

nikunjagarwal321/search-engine

Folders and files

Latest commit

History

Repository files navigation

search-engine

Design HLD

Microservices & their flow

Prerequisites to run

Steps to run

Future Scope

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages