Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to turn off density-weighting #534

Merged
merged 12 commits into from
Mar 29, 2024

Conversation

dscho
Copy link
Contributor

@dscho dscho commented Dec 24, 2023

When searching, Pagefind applies a heuristic that often works quite well to boost pages with a higher density, i.e. a higher number of hits divided by the number of words on the page. This is called "density weighting".

In some instances (as pointed out here), it is desirable, though, to just use the number of hits directly, without dividing by the number of words on the page.

Let's support this via the search option use_weighting, which default to true to maintain the current behavior.

@dscho
Copy link
Contributor Author

dscho commented Jan 3, 2024

I added documentation and a test.

@bglw
Copy link
Contributor

bglw commented Jan 6, 2024

👋 @dscho welcome! Thanks for jumping in ☺️

I like the direction here, though before releasing this as a feature I'd actually like to go further with the controls — for example, the function that calculates the balanced_score here imposes a certain style of ranking, and I'd love to expose numerical controls for how strong the word_length_bonus is, how the word_frequency modulates the score, etc. This would also be a great foundation for addressing some existing issues such as #437.

I'm imagining something like:

await pagefind.options({
  ranking: {
    word_distance: 0.5,
    site_frequency: 1.0,
    page_frequency: 1.0
  }
});

Turning page_frequency to 0.0 would achieve the same effect as use_weighting: false in this PR, but would also allow it to be tuned more finely up and down.

This isn't something I'm expecting you to do, I'm just brainstorming here for lack of a better place 😄 I'm on a soft break for a few more weeks, but wiring this up can be one of my initial jobs once I'm back on the tools.

In the meantime, I'm interested to hear if this has made a meaningful impact to your rankings, if you have given it a run through?

If not having this released is a blocker for you, I could merge this PR and release a tagged alpha version now, and then replace it with the implementation above once I have time. Let me know if that's required 🙂

dscho added 5 commits January 6, 2024 16:32
It is mighty useful to have the exact binary that was tested in the
GitHub workflow, for example to use it in other workflows when
introducing features specifically for those other workflows' use case...

Signed-off-by: Johannes Schindelin <[email protected]>
When searching, Pagefind applies a heuristic that often works quite well
to boost pages with a higher density, i.e. a higher number of hits
divided by the number of words on the page. This is called "density
weighting".

In some instances, it is desirable, though, to just use the number of
hits directly, without dividing by the number of words on the page.

Let's support this via the search option `use_weighting`, which
default to `true` to maintain the current behavior.

Signed-off-by: Johannes Schindelin <[email protected]>
In addition to controlling how much of a role the "page frequency"
plays in ranking pages, let's add more ways to modify the way pages are
ranked.

Signed-off-by: Johannes Schindelin <[email protected]>
Add an option to stop scoring shorter pages higher

When searching, Pagefind applies a heuristic that often works quite well
to boost pages with a higher density, i.e. a higher number of hits
divided by the number of words on the page. This is called "density
weighting".

In some instances, it is desirable, though, to just use the number of
hits directly, without dividing by the number of words on the page.

Let's support this via a new search option `ranking`, which as of right
now contains a single field to specify how much "denser pages" should be
favored.

Signed-off-by: Johannes Schindelin <[email protected]>
So far, I seem to be unable to make this work, as it seems that
`word_frequency` is always 0 in my tests...

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho force-pushed the optionally-turn-off-density-weighting branch from b48ea1c to 41bb37e Compare January 6, 2024 18:56
@dscho
Copy link
Contributor Author

dscho commented Jan 6, 2024

@bglw I tried my hand at what you suggested, but for some reason I cannot seem to even make the siteFrequency thing work, I must surely have tried to modify the wrong code for that, assuming that word_frequency would be the indicator (but it is 0 all the time for me).

Could you help me by guiding me to the code location that needs to be updated?

@bglw
Copy link
Contributor

bglw commented Jan 30, 2024

Hey @dscho — I'm back on the wires this week so will look into this promptly and get back to you!

@dscho
Copy link
Contributor Author

dscho commented Jan 30, 2024

@bglw thank you so much!

@bglw
Copy link
Contributor

bglw commented Jan 31, 2024

Hello! Feedback 🙂

Setting rankings

I'd like this to be an option set once, rather than for every search. So configured once with a:

await pagefind.options({
  ranking: { /* ranks */ }
});

Which every .search() uses thereafter. Also ideally this prevents the double call into WASM for every search, once to add the weights and once to run the search, and the rankings can just be stored in the SearchIndex struct we carry around.

Making siteFrequency work

The code looks correct here! It's just the test that has an issue. Site frequency can be interpreted as "number of pages that contain this word at least once" — regardless of how many times that word is on each page — and the test sets up two pages that both contain cat and dog — so site frequency in this test isn't contributing anything. For that test to show something you'll want the target word to exist on <10% of pages on the site.

Generally

Numbers look good! Napkin-ed the calculations and it definitely looks like exposing these three toggles will work well.

I might change from the wasm_bindgen new this.backend.RankingWeights flow to something more akin to the other function signatures — mainly to insulate the JavaScript calling conventions from the Rust types. But no need to worry about that, I'm happy to merge this with that and I'll change it at some point in the future.

I'll likely tweak the documentation wording after merge — so also no stress there.

Overall looking good! Let me know any Qs, and as always let me know if/when you want to tag me in to finish things up. I'll also be more responsive now that the holiday period has passed 😁

@bglw
Copy link
Contributor

bglw commented Jan 31, 2024

Oh, and one extra note on naming:

  • pageFrequency
    • I think this key is good — you're controlling how much we care about a term being frequent on a page
  • siteFrequency
    • I'd maybe like a different term so we can phrase all of these as the "boost" rather than the "reduction". In this case, you get penalised for a term being frequent on a site. So perhaps this could be siteRarity. Very open to suggestions here.
  • wordDistance
    • I definitely want to change this one — in the future I want to add weightings for proximity/distance between different search terms on a page, and this would be too close to the setting for that. Current best suggestion would be termSimilarity. Also open to suggestions here 🙂

@hu0p
Copy link

hu0p commented Feb 12, 2024

Hey @bglw, do you think it would also make sense to consider minimum document length?

We have a site that is currently under development where we've employed weighting for different content types. However, some pages are very short. They're so short in fact that they're essentially short stubs of reference info. This has presented a unique challenge for term ranking because it really invalidates our ranking for certain terms. Let me provide an example:

There are roughly ten types of indexable content on the site, but "Case Types" AKA "Practice Areas" are by far the most important to us. Consequently, we've added the highest data-weight properties to those documents and included a progressively descending scale of data-weight values to other content types based on priority.

However, if you search for "car accidents", which is a case type with a lengthy associated document featuring "car accidents" as the h1, the search results provide a document from a different, lower-weighted content type as the top result. This is because this document contains a total of 14 indexable words (seen in the top header), one of which is "car accidents". Our desired top result is present in the results, but it's much, much lower down the results list. We'd like the current top result and similar results to continue to be indexed but at a lower priority reflected by the attached weight.

I'm not sure if any of the proposed attributes could really help here. It feels like what we really need is the ability to penalize or promote rankings based on a quality metric which includes document length in the calculus.

I'm happy to use this site as a test case. It has close to 1900 indexable pages and a little under 13 thousand words. Let me know your thoughts and any questions. If you feel that this is interesting, but is separate to the goals of this PR, just let me know and I'd be happy to move this discussion elsewhere as well.

@bglw
Copy link
Contributor

bglw commented Feb 22, 2024

👋

@hu0p thanks for the feedback! I think this is definitely a good case to cover. Something that has been overdue for a while is changing the ranking from tf-idf to BM25, which helps with some of these issues around term saturation and page length. But it also might make sense to add a setting to adjust this further! I'll look into it.

@dscho would you like me to take the PR over from this point and finish it off?

@hu0p
Copy link

hu0p commented Feb 23, 2024

@bglw Thanks for getting back to me! I looked into BM25 and wow, yeah, that sounds perfect. BM25F may be even better for sites that are semantically well-formed too.

I ended up here due to interest in activity around pagefind_web/src/lib.rs and pagefind_web/src/search.rs. You're officially responsible for coercing an American into learning the term "Fossicking", so thanks for that. Let me know if there's anything I can do to help. I've been bouncing in and out of familiarizing myself with the codebase.

@dscho
Copy link
Contributor Author

dscho commented Feb 23, 2024

@dscho would you like me to take the PR over from this point and finish it off?

That would be wonderful!

@bglw
Copy link
Contributor

bglw commented Feb 26, 2024

On it! Aiming to have something this/next week, I'll just continue on this branch. Thinking I'll roll the BM25 change in which will change a few structural things, but result in a much better system.

@dscho
Copy link
Contributor Author

dscho commented Feb 26, 2024

On it! Aiming to have something this/next week, I'll just continue on this branch. Thinking I'll roll the BM25 change in which will change a few structural things, but result in a much better system.

Excellent @bglw ! Thank you so much!

bglw
bglw previously approved these changes Mar 28, 2024
@bglw bglw merged commit 2a66612 into CloudCannon:main Mar 29, 2024
4 checks passed
@bglw
Copy link
Contributor

bglw commented Mar 29, 2024

Long time coming @dscho & @hu0p , but finally got this merged! It has changed a bit, so now Pagefind implements BM25 rankings, and provides roughly the previous controls plus the BM25 parameters.

It's available on the v1.1.0-rc1 version right now, if you want to take it for a spin and slip any feedback in before a release!

Documentation for the ranking configuration can be seen here: https://unreleased.pagefind.app/docs/ranking/ (it's now a top-level Pagefind option rather than a search parameter).

Need to do some housekeeping before a release, but I'll be looking to get v1.1.0 out in the coming week.

@dscho dscho deleted the optionally-turn-off-density-weighting branch March 29, 2024 05:03
@dscho
Copy link
Contributor Author

dscho commented Apr 2, 2024

@bglw thank you so much!

I integrated this into my current work to migrate https://git-scm.com/ from a Heroku app to a static website hosted on GitHub Pages: https://git.github.io/git-scm.com/. It's pretty good! If you look at https://git.github.io/git-scm.com/search/results?search=log&language=en, you see that the git-log manual page is shown first.

I had to play a couple of tricks, though, such as adding a hidden <h3> with maximum data-pagefind-weight. It would be nice if there was a way to boost certain search terms on a given page without having to add some hidden text.

Another issue I have: when I typed log really fast in the search box of https://git.github.io/git-scm.com/, the first time, the order of the search results was quite off, it did not show git-log as first entry. Is it possible that debouncedSearch() is racy? Or maybe I am holding this thing wrong in git/git-scm.com@68338ec?

@bglw
Copy link
Contributor

bglw commented Apr 2, 2024

Awesome! I'll release what we have now tomorrow.

I had to play a couple of tricks, though

Thankfully I do still have ideas for the future — primarily the #532 and #437 issues which will help a lot. Essentially giving you a way to boost based on the title metadata, without it having to live in the data-pagefind-body area. No ETA on that but it's up the list, hopefully the current state is enough of an unblocker!

Is it possible that debouncedSearch() is racy?

It shouldn't be, at least I haven't seen it be so. The implementation is pretty simple.

Skimming through git/git-scm.com@68338ec, I think there's a window where:

  1. You got results from debouncedSearch that were not null
  2. Your $("#search-results").html(...) is now running, which contains asynchronous loading of results with await e.data()
  3. While those are loading, a new debouncedSearch result comes through. also not null.
  4. $("#search-results").html(...) runs again, with different result fragments. These resolve really fast and the html is inserted onto the page
  5. Some slow request from 2. finally resolves, causing the html from the first search to be inserted onto the page.

Hopefully that makes sense. In essence, you'll want to assign that HTML template to a variable, and after it has awaited all of its result fragments you want to make sure that you're still in the latest debouncedSearch results.

@bglw
Copy link
Contributor

bglw commented Apr 2, 2024

Ranking changes released in v1.1.0 🎉

@dscho
Copy link
Contributor Author

dscho commented Apr 4, 2024

Ranking changes released in v1.1.0 🎉

Awesome, thanks for all your hard work @bglw !

dscho added a commit to dscho/git-scm.com that referenced this pull request Jul 3, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Jul 14, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Jul 21, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Aug 6, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Aug 6, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git/git-scm.com that referenced this pull request Aug 6, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git/git-scm.com that referenced this pull request Aug 19, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Sep 4, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Sep 7, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to dscho/git-scm.com that referenced this pull request Sep 11, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git/git-scm.com that referenced this pull request Sep 11, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git/git-scm.com that referenced this pull request Sep 20, 2024
We definitely want the search results for, say, "commit" to show the
manual page of that command first.

This took quite a bit of work to accomplish, including a change to
Pagefind itself (CloudCannon/pagefind#534) as
well as some extensive fiddling with the ranking weights (including, I
kid you not, running a Powell's optimization of the weights).

To ensure that the results are shown in the desired order, let's add an
explicit check for that in the PR builds.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho
Copy link
Contributor Author

dscho commented Sep 25, 2024

Thanks in large part to these new options, https://git-scm.com/ was able to migrate from a Rails app using Elasticsearch to a Hugo site using Pagefind!

@bglw thank you so, so much!

@bglw
Copy link
Contributor

bglw commented Sep 25, 2024

Amazing! Great to hear 😄

Hopefully more improvements to come in the future too! Aiming to tee up #437 on the horizon.

Big congrats on getting the site across the line @dscho 🙇

@dscho
Copy link
Contributor Author

dscho commented Sep 25, 2024

Hopefully more improvements to come in the future too! Aiming to tee up #437 on the horizon.

Anything I can help with?

Big congrats on getting the site across the line @dscho 🙇

Thank you! 😊 It was a huge effort, and I wasn't alone, but yes, it took almost 8 years and tons of people to get it done. But done we got it! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants