Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Dynamic Scope" #915

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from
Open

"Dynamic Scope" #915

wants to merge 8 commits into from

Conversation

geeknik
Copy link
Contributor

@geeknik geeknik commented Jun 3, 2024

"Dynamic Scope" integration, cuts back on data usage while crawling by utilizing a TF-IDF machine learning model to discard pages which might be too similar to pages already crawled. 👍🏻

For example, when running katana -d 1 -u https://www.ibm.com/ -j -o ibm.json, the ibm.json is about 11MB. Now when running with Dynamic Scope which adds -uds to the command line, drops the output of ibm.json to about 3.4MB.

Crawl Fast, Crawl Smart. 🚀

@ehsandeep ehsandeep changed the base branch from main to dev June 3, 2024 19:15
@tarunKoyalwar tarunKoyalwar self-requested a review June 4, 2024 00:19
@ehsandeep ehsandeep requested review from Mzack9999 and removed request for tarunKoyalwar June 7, 2024 09:05
@Mzack9999
Copy link
Member

@geeknik very interesting approach, I'm going to review this soon and compare it with BM25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants