Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redefine how request probabilities are computed #3

Merged
merged 1 commit into from
Oct 27, 2023

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Oct 26, 2023

Overview

This PR proposes to increase the resolution of the request priorities from 10 to 100 bins.

One problem that this tries to address (although not fully) is that subCategory links extracted by ML-models may have 0.90 to 0.94 probabilities even though they're incorrect. On the other hand, good subCategory links tend to have 0.95 to 0.99 probabilities. The issue here is that the way the request probabilities were computed before is that these are binned in the same request priority = 9. This means that getting links from the Scheduler (which uses a priorityQueue) would treat these links as being in the same priority level.

Hopefully, in larger crawls, good subCategory links >= 95 would always be processed and low quality ones < 95 would lay dormant in the scheduler until the spider has been manually stopped or has reached the max number of items, requests, etc.

Obviously, this wouldn't work when we wait for the spider to exhaust all links.

Lastly, it also ensures that the nextPage link is explicitly higher than any of the subcategory links. This means that we ensure we paginate through categories with actual product links rather than traversing deeply into nestedCategory links that might not have any products (e.g. some sites have this set up where they don't show you the actual products until you've narrowed down your subCategories to the last level).

Before

Priorities:

  • 0 ⎯ normal links
  • 0 to 9 ⎯ subCategory links
  • 9 ⎯ nextPage links
  • 10 ⎯ item links

After

Priorities:

  • 0 ⎯ normal links
  • 0 to 99 ⎯ subCategory links
  • 100 ⎯ nextPage links
  • 100 to 199 ⎯ item links

@kmike kmike merged commit 755dbd8 into main Oct 27, 2023
7 checks passed
@wRAR wRAR deleted the redefined-request-proba branch October 27, 2023 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants