-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to turn off density-weighting #534
Add an option to turn off density-weighting #534
Conversation
04d87d8
to
b48ea1c
Compare
I added documentation and a test. |
👋 @dscho welcome! Thanks for jumping in I like the direction here, though before releasing this as a feature I'd actually like to go further with the controls — for example, the function that calculates the balanced_score here imposes a certain style of ranking, and I'd love to expose numerical controls for how strong the I'm imagining something like: await pagefind.options({
ranking: {
word_distance: 0.5,
site_frequency: 1.0,
page_frequency: 1.0
}
}); Turning This isn't something I'm expecting you to do, I'm just brainstorming here for lack of a better place 😄 I'm on a soft break for a few more weeks, but wiring this up can be one of my initial jobs once I'm back on the tools. In the meantime, I'm interested to hear if this has made a meaningful impact to your rankings, if you have given it a run through? If not having this released is a blocker for you, I could merge this PR and release a tagged alpha version now, and then replace it with the implementation above once I have time. Let me know if that's required 🙂 |
It is mighty useful to have the exact binary that was tested in the GitHub workflow, for example to use it in other workflows when introducing features specifically for those other workflows' use case... Signed-off-by: Johannes Schindelin <[email protected]>
When searching, Pagefind applies a heuristic that often works quite well to boost pages with a higher density, i.e. a higher number of hits divided by the number of words on the page. This is called "density weighting". In some instances, it is desirable, though, to just use the number of hits directly, without dividing by the number of words on the page. Let's support this via the search option `use_weighting`, which default to `true` to maintain the current behavior. Signed-off-by: Johannes Schindelin <[email protected]>
In addition to controlling how much of a role the "page frequency" plays in ranking pages, let's add more ways to modify the way pages are ranked. Signed-off-by: Johannes Schindelin <[email protected]>
Add an option to stop scoring shorter pages higher When searching, Pagefind applies a heuristic that often works quite well to boost pages with a higher density, i.e. a higher number of hits divided by the number of words on the page. This is called "density weighting". In some instances, it is desirable, though, to just use the number of hits directly, without dividing by the number of words on the page. Let's support this via a new search option `ranking`, which as of right now contains a single field to specify how much "denser pages" should be favored. Signed-off-by: Johannes Schindelin <[email protected]>
So far, I seem to be unable to make this work, as it seems that `word_frequency` is always 0 in my tests... Signed-off-by: Johannes Schindelin <[email protected]>
b48ea1c
to
41bb37e
Compare
@bglw I tried my hand at what you suggested, but for some reason I cannot seem to even make the Could you help me by guiding me to the code location that needs to be updated? |
Hey @dscho — I'm back on the wires this week so will look into this promptly and get back to you! |
@bglw thank you so much! |
Hello! Feedback 🙂 Setting rankingsI'd like this to be an option set once, rather than for every search. So configured once with a: await pagefind.options({
ranking: { /* ranks */ }
}); Which every Making
|
Oh, and one extra note on naming:
|
Hey @bglw, do you think it would also make sense to consider minimum document length? We have a site that is currently under development where we've employed weighting for different content types. However, some pages are very short. They're so short in fact that they're essentially short stubs of reference info. This has presented a unique challenge for term ranking because it really invalidates our ranking for certain terms. Let me provide an example: There are roughly ten types of indexable content on the site, but "Case Types" AKA "Practice Areas" are by far the most important to us. Consequently, we've added the highest However, if you search for "car accidents", which is a case type with a lengthy associated document featuring "car accidents" as the h1, the search results provide a document from a different, lower-weighted content type as the top result. This is because this document contains a total of 14 indexable words (seen in the top header), one of which is "car accidents". Our desired top result is present in the results, but it's much, much lower down the results list. We'd like the current top result and similar results to continue to be indexed but at a lower priority reflected by the attached weight. I'm not sure if any of the proposed attributes could really help here. It feels like what we really need is the ability to penalize or promote rankings based on a quality metric which includes document length in the calculus. I'm happy to use this site as a test case. It has close to 1900 indexable pages and a little under 13 thousand words. Let me know your thoughts and any questions. If you feel that this is interesting, but is separate to the goals of this PR, just let me know and I'd be happy to move this discussion elsewhere as well. |
👋 @hu0p thanks for the feedback! I think this is definitely a good case to cover. Something that has been overdue for a while is changing the ranking from tf-idf to BM25, which helps with some of these issues around term saturation and page length. But it also might make sense to add a setting to adjust this further! I'll look into it. @dscho would you like me to take the PR over from this point and finish it off? |
@bglw Thanks for getting back to me! I looked into BM25 and wow, yeah, that sounds perfect. BM25F may be even better for sites that are semantically well-formed too. I ended up here due to interest in activity around |
That would be wonderful! |
On it! Aiming to have something this/next week, I'll just continue on this branch. Thinking I'll roll the BM25 change in which will change a few structural things, but result in a much better system. |
Excellent @bglw ! Thank you so much! |
Long time coming @dscho & @hu0p , but finally got this merged! It has changed a bit, so now Pagefind implements BM25 rankings, and provides roughly the previous controls plus the BM25 parameters. It's available on the Documentation for the ranking configuration can be seen here: https://unreleased.pagefind.app/docs/ranking/ (it's now a top-level Pagefind option rather than a search parameter). Need to do some housekeeping before a release, but I'll be looking to get |
@bglw thank you so much! I integrated this into my current work to migrate https://git-scm.com/ from a Heroku app to a static website hosted on GitHub Pages: https://git.github.io/git-scm.com/. It's pretty good! If you look at https://git.github.io/git-scm.com/search/results?search=log&language=en, you see that the I had to play a couple of tricks, though, such as adding a hidden Another issue I have: when I typed |
Awesome! I'll release what we have now tomorrow.
Thankfully I do still have ideas for the future — primarily the #532 and #437 issues which will help a lot. Essentially giving you a way to boost based on the
It shouldn't be, at least I haven't seen it be so. The implementation is pretty simple. Skimming through git/git-scm.com@68338ec, I think there's a window where:
Hopefully that makes sense. In essence, you'll want to assign that HTML template to a variable, and after it has awaited all of its result fragments you want to make sure that you're still in the latest |
Ranking changes released in v1.1.0 🎉 |
Awesome, thanks for all your hard work @bglw ! |
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
We definitely want the search results for, say, "commit" to show the manual page of that command first. This took quite a bit of work to accomplish, including a change to Pagefind itself (CloudCannon/pagefind#534) as well as some extensive fiddling with the ranking weights (including, I kid you not, running a Powell's optimization of the weights). To ensure that the results are shown in the desired order, let's add an explicit check for that in the PR builds. Signed-off-by: Johannes Schindelin <[email protected]>
Thanks in large part to these new options, https://git-scm.com/ was able to migrate from a Rails app using Elasticsearch to a Hugo site using Pagefind! @bglw thank you so, so much! |
Anything I can help with?
Thank you! 😊 It was a huge effort, and I wasn't alone, but yes, it took almost 8 years and tons of people to get it done. But done we got it! 🎉 |
When searching, Pagefind applies a heuristic that often works quite well to boost pages with a higher density, i.e. a higher number of hits divided by the number of words on the page. This is called "density weighting".
In some instances (as pointed out here), it is desirable, though, to just use the number of hits directly, without dividing by the number of words on the page.
Let's support this via the search option
use_weighting
, which default totrue
to maintain the current behavior.