🔨 Evaluate search performance for articles #3400

larsyencken · 2024-03-26T10:22:06Z

This PR introduces make bench.search, and a search evaluation script.

Overview

It currently fetches a dataset of synthetic queries and evaluates the extent to which we surface good articles for the given queries.

The scoring algorithm chosen for articles is precision@4, meaning the the proportion of the first four results that are relevant, averaged over a ton of queries. The best possible result is 1, the worse is 0.

This is chosen since at most four articles are presented un-collapsed, and the value of getting those four right is much much higher than getting any right further down in the ranking.

It does not yet score chart or explorer search. When we do that, we may move to a more holistic search ranking score like Mean-Average Precision (MAP).

Current results

{
  "name": "synthetic-queries-2024-03-25.json",
  "scope": "articles",
  "meanPrecision": 0.257,
  "numQueries": 4260,
  "algoliaApp": "74GKBOIDJQ",
  "algoliaIndex": "search-evaluation-algolia-pages"
}

Query datasets

Single word: https://pub-ec761fe0df554b02bc605610f3296000.r2.dev/synthetic-queries-single-2024-03-25.json
Single and multi-word: https://pub-ec761fe0df554b02bc605610f3296000.r2.dev/synthetic-queries-2024-03-25.json

It fetches a dataset of synthetic queries and evaluates the extent to which we surface good articles for the given queries. The scoring algorithm chosen for articles is `precision@4`, meaning the the proportion of the first four results that are relevant, averaged over a ton of queries. This is chosen since at most four articles are presented un-collapsed, and the value of getting those four right is much much higher than getting any right further down in the ranking. It does not yet score chart or explorer search.

The `precision@2` score reflects that we return two articles in the instant search results, so we want to know if we make that better or worse.

larsyencken · 2024-03-26T13:46:39Z

Following discussion of the instant search results on slack, it now reports precision@2 and precision@4 both.

{
  "name": "synthetic-queries-2024-03-25.json",
  "scope": "articles",
  "scores": {
    "precision@2": 0.338,
    "precision@4": 0.257
  },
  "numQueries": 4260,
  "algoliaApp": "74GKBOIDJQ",
  "algoliaIndex": "search-evaluation-algolia-pages"
}

marcelgerber

Brilliant work!
This works very well, and is super fast also.

Some of my review comments are only about using the Algolia types SearchClient and SearchIndex...

marcelgerber · 2024-03-27T07:59:21Z

site/search/evaluateSearch.ts

+    ALGOLIA_ID,
+    ALGOLIA_SEARCH_KEY,
+} from "../../settings/clientSettings.js"
+import { SEARCH_EVAL_URL } from "../../settings/serverSettings.js"


Wondering if it makes sense to set SEARCH_EVAL_URL in this file directly; because right now it's more of a constant than a setting?

marcelgerber · 2024-03-27T08:00:51Z

site/search/evaluateSearch.ts

Nice! I'm wondering if it would be helpful to have a verbose output (to a JSON file), that enumerates all the searches and the good/bad results.
Could be helpful to get an overview and find some low-hanging fruits for improvements.

marcelgerber · 2024-03-27T08:21:22Z

site/search/evaluateSearch.ts

+    let activeQueries = 0
+    let i = 0
+    const scores: ScoredQuery[] = []
+
+    const next = async () => {
+        if (i >= queries.length) return
+        const query = queries[i++]
+        activeQueries++
+        const score = await simulateQuery(index, query)
+        scores.push(score)
+        activeQueries--
+        if (i < queries.length) {
+            await next()
+        }
+    }
+
+    const promises = []
+    while (activeQueries < CONCURRENT_QUERIES && i < queries.length) {
+        promises.push(next())
+    }
+
+    await Promise.all(promises)


Hm, this code is a bit hard to follow in my mind.
Haven't tried the below, but I think we can make it nicer using the p-map dependency we're using.

Suggested change

let activeQueries = 0

let i = 0

const scores: ScoredQuery[] = []

const next = async () => {

if (i >= queries.length) return

const query = queries[i++]

activeQueries++

const score = await simulateQuery(index, query)

scores.push(score)

activeQueries--

if (i < queries.length) {

await next()

}

}

const promises = []

while (activeQueries < CONCURRENT_QUERIES && i < queries.length) {

promises.push(next())

}

await Promise.all(promises)

scores = await pMap(queries, (query) => simulateQuery(index, query), { concurrency: CONCURRENT_QUERIES })

marcelgerber · 2024-03-27T08:23:38Z

site/search/evaluateSearch.ts

+} from "../../settings/clientSettings.js"
+import { SEARCH_EVAL_URL } from "../../settings/serverSettings.js"
+import { getIndexName } from "./searchClient.js"
+import algoliasearch from "algoliasearch"


Suggested change

import algoliasearch from "algoliasearch"

import algoliasearch, { SearchClient, SearchIndex } from "algoliasearch"

marcelgerber · 2024-03-27T08:24:05Z

site/search/evaluateSearch.ts

+    }
+}
+
+const getClient = (): any => {


Suggested change

const getClient = (): any => {

const getClient = (): SearchClient => {

marcelgerber · 2024-03-27T08:24:20Z

site/search/evaluateSearch.ts

+}
+
+const simulateQuery = async (
+    index: any,


Suggested change

index: any,

index: SearchIndex,

marcelgerber · 2024-03-27T08:24:54Z

site/search/evaluateSearch.ts

+    index: any,
+    query: Query
+): Promise<ScoredQuery> => {
+    const { hits } = await index.search(query.query)


Just a suggestion, but since we only ever look at the first 4 slugs anyhow:

Suggested change

const { hits } = await index.search(query.query)

const { hits } = await index.search(query.query, {

attributesToRetrieve: ["slug"],

hitsPerPage: N_ARTICLES_LONG_RESULTS,

})

marcelgerber · 2024-03-27T08:25:12Z

site/search/evaluateSearch.ts

+}
+
+const simulateQueries = async (
+    index: any,


Suggested change

index: any,

index: SearchIndex,

marcelgerber · 2024-03-27T08:32:40Z

Some non-important remarks on the synthetic queries, and our way of scoring them:

We should be aware that it will be impossible to score a precision of 100%.
For example, for the search query 21st, the synthetic file expects to only return military-long-run-spending-perspective, which is unrealistic for this query.
This is - at least partially - a result of the query-prefix generating we're doing as part of the processing. Would be interesting to see the results just for the full-length queries.

Also, another slight flaw of the precision metric is that if for a query, 4 results are expected but we only return a single one (that is among them), then we achieve a score of 100%. But I guess that's just inherent to a precision metric - and we would need to implement recall also if we want to overcome this flaw.

github-actions · 2024-04-11T07:06:18Z

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

marcelgerber · 2024-04-15T14:07:32Z

@larsyencken I think it makes sense to merge this PR in the current state.
The tool, as it is, is really good and helpful already.

github-actions · 2024-05-01T07:06:47Z

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

marcelgerber · 2024-05-01T12:29:22Z

I'm just gonna merge this as-is now, seeing as it is already useful, and it doesn't touch any existing code.

larsyencken · 2024-05-02T10:05:13Z

Thanks Marcel!

github-actions bot assigned larsyencken Mar 26, 2024

larsyencken changed the title ~~🔨 Set up staging server~~ 🔨 Evaluate search performance Mar 26, 2024

larsyencken changed the title ~~🔨 Evaluate search performance~~ 🔨 Evaluate search performance for articles Mar 26, 2024

🔨 Only evaluate article search using the multi set

e5e1e85

larsyencken force-pushed the search-evaluation-algolia branch from 92e631e to e5e1e85 Compare March 26, 2024 11:41

🔨 Tag search evaluations with their algolia app and index

cb50a2b

larsyencken marked this pull request as ready for review March 26, 2024 11:44

larsyencken requested a review from marcelgerber March 26, 2024 11:44

🔨 Calculate precision@2 when benchmarking search

c637c92

The `precision@2` score reflects that we return two articles in the instant search results, so we want to know if we make that better or worse.

marcelgerber mentioned this pull request Mar 26, 2024

enhance(algolia): ranking tweaks #3407

Merged

marcelgerber approved these changes Mar 27, 2024

View reviewed changes

github-actions bot added the stale label Apr 11, 2024

github-actions bot closed this Apr 14, 2024

marcelgerber reopened this Apr 15, 2024

github-actions bot removed the stale label Apr 16, 2024

github-actions bot added the stale label May 1, 2024

Merge branch 'master' into search-evaluation-algolia

44b44d7

marcelgerber merged commit d30a2b7 into master May 1, 2024
11 of 17 checks passed

marcelgerber deleted the search-evaluation-algolia branch May 1, 2024 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔨 Evaluate search performance for articles #3400

🔨 Evaluate search performance for articles #3400

larsyencken commented Mar 26, 2024 •

edited

Loading

larsyencken commented Mar 26, 2024

marcelgerber left a comment

marcelgerber Mar 27, 2024

marcelgerber Mar 27, 2024

marcelgerber Mar 27, 2024

marcelgerber Mar 27, 2024

marcelgerber Mar 27, 2024

marcelgerber Mar 27, 2024

marcelgerber Mar 27, 2024 •

edited

Loading

marcelgerber Mar 27, 2024

marcelgerber commented Mar 27, 2024

github-actions bot commented Apr 11, 2024

marcelgerber commented Apr 15, 2024

github-actions bot commented May 1, 2024

marcelgerber commented May 1, 2024

larsyencken commented May 2, 2024

	import algoliasearch from "algoliasearch"
	import algoliasearch, { SearchClient, SearchIndex } from "algoliasearch"

	const getClient = (): any => {
	const getClient = (): SearchClient => {

-    const { hits } = await index.search(query.query)
+    const { hits } = await index.search(query.query, {
+    	attributesToRetrieve: ["slug"],
+        hitsPerPage: N_ARTICLES_LONG_RESULTS,
+    })

🔨 Evaluate search performance for articles #3400

🔨 Evaluate search performance for articles #3400

Conversation

larsyencken commented Mar 26, 2024 • edited Loading

Overview

Current results

Query datasets

larsyencken commented Mar 26, 2024

marcelgerber left a comment

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

marcelgerber commented Mar 27, 2024

github-actions bot commented Apr 11, 2024

marcelgerber commented Apr 15, 2024

github-actions bot commented May 1, 2024

marcelgerber commented May 1, 2024

larsyencken commented May 2, 2024

larsyencken commented Mar 26, 2024 •

edited

Loading

marcelgerber Mar 27, 2024 •

edited

Loading