-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔨 Evaluate search performance for articles #3400
Conversation
It fetches a dataset of synthetic queries and evaluates the extent to which we surface good articles for the given queries. The scoring algorithm chosen for articles is `precision@4`, meaning the the proportion of the first four results that are relevant, averaged over a ton of queries. This is chosen since at most four articles are presented un-collapsed, and the value of getting those four right is much much higher than getting any right further down in the ranking. It does not yet score chart or explorer search.
92e631e
to
e5e1e85
Compare
The `precision@2` score reflects that we return two articles in the instant search results, so we want to know if we make that better or worse.
Following discussion of the instant search results on slack, it now reports {
"name": "synthetic-queries-2024-03-25.json",
"scope": "articles",
"scores": {
"precision@2": 0.338,
"precision@4": 0.257
},
"numQueries": 4260,
"algoliaApp": "74GKBOIDJQ",
"algoliaIndex": "search-evaluation-algolia-pages"
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant work!
This works very well, and is super fast also.
Some of my review comments are only about using the Algolia types SearchClient
and SearchIndex
...
ALGOLIA_ID, | ||
ALGOLIA_SEARCH_KEY, | ||
} from "../../settings/clientSettings.js" | ||
import { SEARCH_EVAL_URL } from "../../settings/serverSettings.js" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it makes sense to set SEARCH_EVAL_URL
in this file directly; because right now it's more of a constant than a setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'm wondering if it would be helpful to have a verbose output (to a JSON file), that enumerates all the searches and the good/bad results.
Could be helpful to get an overview and find some low-hanging fruits for improvements.
let activeQueries = 0 | ||
let i = 0 | ||
const scores: ScoredQuery[] = [] | ||
|
||
const next = async () => { | ||
if (i >= queries.length) return | ||
const query = queries[i++] | ||
activeQueries++ | ||
const score = await simulateQuery(index, query) | ||
scores.push(score) | ||
activeQueries-- | ||
if (i < queries.length) { | ||
await next() | ||
} | ||
} | ||
|
||
const promises = [] | ||
while (activeQueries < CONCURRENT_QUERIES && i < queries.length) { | ||
promises.push(next()) | ||
} | ||
|
||
await Promise.all(promises) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this code is a bit hard to follow in my mind.
Haven't tried the below, but I think we can make it nicer using the p-map
dependency we're using.
let activeQueries = 0 | |
let i = 0 | |
const scores: ScoredQuery[] = [] | |
const next = async () => { | |
if (i >= queries.length) return | |
const query = queries[i++] | |
activeQueries++ | |
const score = await simulateQuery(index, query) | |
scores.push(score) | |
activeQueries-- | |
if (i < queries.length) { | |
await next() | |
} | |
} | |
const promises = [] | |
while (activeQueries < CONCURRENT_QUERIES && i < queries.length) { | |
promises.push(next()) | |
} | |
await Promise.all(promises) | |
scores = await pMap(queries, (query) => simulateQuery(index, query), { concurrency: CONCURRENT_QUERIES }) |
} from "../../settings/clientSettings.js" | ||
import { SEARCH_EVAL_URL } from "../../settings/serverSettings.js" | ||
import { getIndexName } from "./searchClient.js" | ||
import algoliasearch from "algoliasearch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import algoliasearch from "algoliasearch" | |
import algoliasearch, { SearchClient, SearchIndex } from "algoliasearch" |
} | ||
} | ||
|
||
const getClient = (): any => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const getClient = (): any => { | |
const getClient = (): SearchClient => { |
} | ||
|
||
const simulateQuery = async ( | ||
index: any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index: any, | |
index: SearchIndex, |
index: any, | ||
query: Query | ||
): Promise<ScoredQuery> => { | ||
const { hits } = await index.search(query.query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, but since we only ever look at the first 4 slugs anyhow:
const { hits } = await index.search(query.query) | |
const { hits } = await index.search(query.query, { | |
attributesToRetrieve: ["slug"], | |
hitsPerPage: N_ARTICLES_LONG_RESULTS, | |
}) |
} | ||
|
||
const simulateQueries = async ( | ||
index: any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index: any, | |
index: SearchIndex, |
Some non-important remarks on the synthetic queries, and our way of scoring them: We should be aware that it will be impossible to score a precision of 100%. Also, another slight flaw of the precision metric is that if for a query, 4 results are expected but we only return a single one (that is among them), then we achieve a score of 100%. But I guess that's just inherent to a precision metric - and we would need to implement recall also if we want to overcome this flaw. |
This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected. |
@larsyencken I think it makes sense to merge this PR in the current state. |
This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected. |
I'm just gonna merge this as-is now, seeing as it is already useful, and it doesn't touch any existing code. |
Thanks Marcel! |
This PR introduces
make bench.search
, and a search evaluation script.Overview
It currently fetches a dataset of synthetic queries and evaluates the extent to which we surface good articles for the given queries.
The scoring algorithm chosen for articles is
precision@4
, meaning the the proportion of the first four results that are relevant, averaged over a ton of queries. The best possible result is 1, the worse is 0.This is chosen since at most four articles are presented un-collapsed, and the value of getting those four right is much much higher than getting any right further down in the ranking.
It does not yet score chart or explorer search. When we do that, we may move to a more holistic search ranking score like Mean-Average Precision (MAP).
Current results
Query datasets