Mini Search Engine which can search the web and display the relevant results for the given search term. The web pages are crawled and inverted index is created and stored which is used to fetch the results while searching.
The search engine has microservice architecture and contains 3 different microservices namely
- Crawler Service
- Indexer Service
- Search Service
- Crawler Service
- Picks up Urls to be crawled from SQS
- Downloads the HTML Content and stores it in s3
- Sends a message to SQS with crawled page metadata
- Indexer Service
- Sends urls to be crawled to urls SQS
- Reads SQS messages from HTML Metadata Queue and starts the orchestration
- Downloads the HTML file from s3
- Parses the HTML content
- Sends relevant child urls for further crawling after updating url metadata and state in RDS
- Creates Inverted Index after stemming html content and stores it in DB
- Search Service
- UI Facing service which uses REST APIs
- Given a search Query, parses the query and searches for the search terms in the inverted index
- Fetches url metadata of all the relevant urls
- Ranks result based on different metrics such as count, page rank, isPresentInTitle etc based on separate algorithm
- Displays top results
- AWS Local setup via cli
- 2 AWS SQS required for Crawler and Indexer Service
- S3 bucket setup in both Crawler and Indexer Service
- Local RDS and MongoDB setup. The config needs to be updated in Indexer and Search Service
- Check each of the application.properties file and update different configs like sqs url, mongo and rds url and creds
- Can also change the port in the appliaction.properties file
- In indexservice, Constants.java file, change the MAX_NUMBER_OF_CRAWLED_PAGES to increase the number of crawled pages per run. This was done for testing to control AWS resourcce costs.
- Once the configs are set correctly, start different services on different ports using java/mvn commands
- While searching, search for synonyms as well
- Implement autocomplete system using Trie
- Improve Ranking algo → together words get higher search rank