redefine how request probabilities are computed #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR proposes to increase the resolution of the request priorities from 10 to 100 bins.
One problem that this tries to address (although not fully) is that subCategory links extracted by ML-models may have 0.90 to 0.94 probabilities even though they're incorrect. On the other hand, good subCategory links tend to have 0.95 to 0.99 probabilities. The issue here is that the way the request probabilities were computed before is that these are binned in the same request priority = 9. This means that getting links from the Scheduler (which uses a priorityQueue) would treat these links as being in the same priority level.
Hopefully, in larger crawls, good subCategory links
>= 95
would always be processed and low quality ones< 95
would lay dormant in the scheduler until the spider has been manually stopped or has reached the max number of items, requests, etc.Obviously, this wouldn't work when we wait for the spider to exhaust all links.
Lastly, it also ensures that the nextPage link is explicitly higher than any of the subcategory links. This means that we ensure we paginate through categories with actual product links rather than traversing deeply into nestedCategory links that might not have any products (e.g. some sites have this set up where they don't show you the actual products until you've narrowed down your subCategories to the last level).
Before
Priorities:
0
⎯ normal links0 to 9
⎯ subCategory links9
⎯ nextPage links10
⎯ item linksAfter
Priorities:
0
⎯ normal links0 to 99
⎯ subCategory links100
⎯ nextPage links100 to 199
⎯ item links