Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔨 Evaluate search performance for articles #3400

Merged
merged 5 commits into from
May 1, 2024

Conversation

larsyencken
Copy link
Contributor

@larsyencken larsyencken commented Mar 26, 2024

This PR introduces make bench.search, and a search evaluation script.

Overview

It currently fetches a dataset of synthetic queries and evaluates the extent to which we surface good articles for the given queries.

The scoring algorithm chosen for articles is precision@4, meaning the the proportion of the first four results that are relevant, averaged over a ton of queries. The best possible result is 1, the worse is 0.

This is chosen since at most four articles are presented un-collapsed, and the value of getting those four right is much much higher than getting any right further down in the ranking.

It does not yet score chart or explorer search. When we do that, we may move to a more holistic search ranking score like Mean-Average Precision (MAP).

Current results

{
  "name": "synthetic-queries-2024-03-25.json",
  "scope": "articles",
  "meanPrecision": 0.257,
  "numQueries": 4260,
  "algoliaApp": "74GKBOIDJQ",
  "algoliaIndex": "search-evaluation-algolia-pages"
}

Query datasets

@larsyencken larsyencken changed the title 🔨 Set up staging server 🔨 Evaluate search performance Mar 26, 2024
It fetches a dataset of synthetic queries and evaluates the extent to
which we surface good articles for the given queries.

The scoring algorithm chosen for articles is `precision@4`, meaning the
the proportion of the first four results that are relevant, averaged
over a ton of queries.

This is chosen since at most four articles are
presented un-collapsed, and the value of getting those four right is
much much higher than getting any right further down in the ranking.

It does not yet score chart or explorer search.
@larsyencken larsyencken changed the title 🔨 Evaluate search performance 🔨 Evaluate search performance for articles Mar 26, 2024
@larsyencken larsyencken force-pushed the search-evaluation-algolia branch from 92e631e to e5e1e85 Compare March 26, 2024 11:41
@larsyencken larsyencken marked this pull request as ready for review March 26, 2024 11:44
The `precision@2` score reflects that we return two articles in the
instant search results, so we want to know if we make that better or
worse.
@larsyencken
Copy link
Contributor Author

Following discussion of the instant search results on slack, it now reports precision@2 and precision@4 both.

{
  "name": "synthetic-queries-2024-03-25.json",
  "scope": "articles",
  "scores": {
    "precision@2": 0.338,
    "precision@4": 0.257
  },
  "numQueries": 4260,
  "algoliaApp": "74GKBOIDJQ",
  "algoliaIndex": "search-evaluation-algolia-pages"
}

Copy link
Member

@marcelgerber marcelgerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant work!
This works very well, and is super fast also.

Some of my review comments are only about using the Algolia types SearchClient and SearchIndex...

ALGOLIA_ID,
ALGOLIA_SEARCH_KEY,
} from "../../settings/clientSettings.js"
import { SEARCH_EVAL_URL } from "../../settings/serverSettings.js"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it makes sense to set SEARCH_EVAL_URL in this file directly; because right now it's more of a constant than a setting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm wondering if it would be helpful to have a verbose output (to a JSON file), that enumerates all the searches and the good/bad results.
Could be helpful to get an overview and find some low-hanging fruits for improvements.

Comment on lines +145 to +166
let activeQueries = 0
let i = 0
const scores: ScoredQuery[] = []

const next = async () => {
if (i >= queries.length) return
const query = queries[i++]
activeQueries++
const score = await simulateQuery(index, query)
scores.push(score)
activeQueries--
if (i < queries.length) {
await next()
}
}

const promises = []
while (activeQueries < CONCURRENT_QUERIES && i < queries.length) {
promises.push(next())
}

await Promise.all(promises)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this code is a bit hard to follow in my mind.
Haven't tried the below, but I think we can make it nicer using the p-map dependency we're using.

Suggested change
let activeQueries = 0
let i = 0
const scores: ScoredQuery[] = []
const next = async () => {
if (i >= queries.length) return
const query = queries[i++]
activeQueries++
const score = await simulateQuery(index, query)
scores.push(score)
activeQueries--
if (i < queries.length) {
await next()
}
}
const promises = []
while (activeQueries < CONCURRENT_QUERIES && i < queries.length) {
promises.push(next())
}
await Promise.all(promises)
scores = await pMap(queries, (query) => simulateQuery(index, query), { concurrency: CONCURRENT_QUERIES })

} from "../../settings/clientSettings.js"
import { SEARCH_EVAL_URL } from "../../settings/serverSettings.js"
import { getIndexName } from "./searchClient.js"
import algoliasearch from "algoliasearch"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import algoliasearch from "algoliasearch"
import algoliasearch, { SearchClient, SearchIndex } from "algoliasearch"

}
}

const getClient = (): any => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const getClient = (): any => {
const getClient = (): SearchClient => {

}

const simulateQuery = async (
index: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
index: any,
index: SearchIndex,

index: any,
query: Query
): Promise<ScoredQuery> => {
const { hits } = await index.search(query.query)
Copy link
Member

@marcelgerber marcelgerber Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, but since we only ever look at the first 4 slugs anyhow:

Suggested change
const { hits } = await index.search(query.query)
const { hits } = await index.search(query.query, {
attributesToRetrieve: ["slug"],
hitsPerPage: N_ARTICLES_LONG_RESULTS,
})

}

const simulateQueries = async (
index: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
index: any,
index: SearchIndex,

@marcelgerber
Copy link
Member

Some non-important remarks on the synthetic queries, and our way of scoring them:

We should be aware that it will be impossible to score a precision of 100%.
For example, for the search query 21st, the synthetic file expects to only return military-long-run-spending-perspective, which is unrealistic for this query.
This is - at least partially - a result of the query-prefix generating we're doing as part of the processing. Would be interesting to see the results just for the full-length queries.

Also, another slight flaw of the precision metric is that if for a query, 4 results are expected but we only return a single one (that is among them), then we achieve a score of 100%. But I guess that's just inherent to a precision metric - and we would need to implement recall also if we want to overcome this flaw.

Copy link

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

@github-actions github-actions bot added the stale label Apr 11, 2024
@github-actions github-actions bot closed this Apr 14, 2024
@marcelgerber
Copy link
Member

@larsyencken I think it makes sense to merge this PR in the current state.
The tool, as it is, is really good and helpful already.

@marcelgerber marcelgerber reopened this Apr 15, 2024
@github-actions github-actions bot removed the stale label Apr 16, 2024
Copy link

github-actions bot commented May 1, 2024

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

@github-actions github-actions bot added the stale label May 1, 2024
@marcelgerber
Copy link
Member

I'm just gonna merge this as-is now, seeing as it is already useful, and it doesn't touch any existing code.

@marcelgerber marcelgerber merged commit d30a2b7 into master May 1, 2024
11 of 17 checks passed
@marcelgerber marcelgerber deleted the search-evaluation-algolia branch May 1, 2024 12:29
@larsyencken
Copy link
Contributor Author

Thanks Marcel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants