You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
bm25 on fields with no value returns 0.0, just as other fields which values are not matching.
This is an issue in learning to rank, as a missing value (Nan or null) does not necessarily mean that the field does not match at all, usually it just did not have time to be fed yet.
In my specific case, my issue arose with the field related_queries, a list of queries that could have been generated by the document (https://arxiv.org/abs/1904.08375). This field is therefore an array of strings. It is costly to compute, meaning only selected high performing documents get this signal (e.g. high PageRank first).
We however observed that the bm25(related_queries) induces a huge overfitting bias because of the confusion between a missing value and an array of non-relevant queries.
Describe the solution you'd like
BM25 should not return 0.0 if there was no computation involved.null or Nan or -1 or just any value which could later be understood as a marker for a missing value, not a mismatch with the query.
The same behavior happens for the following fields:
Describe alternatives you've considered
I was able to circumvent the issue but at quite a cost, both in implementation and efficiency:
expose the field as an attribute (memory overhead)
use tensorFromLabel to convert the array to tensor (memory + cpu overhead)
declare a specific ranking expression to check for tensor size and return desired value (cpu overhead)
Example for the news tutorial application :
schema news {
document news {
field related_queries type array<string> {
indexing: summary | index | attribute
index: enable-bm25
match: text
}
}
rank-profile test_related_queries inherits default {
function tensor_related_queries() {
expression: tensorFromLabels(attribute(related_queries), rel_q)
}
function bm25_related_queries_with_nulls() {
# If related_queries is specified, returns bm25
# else returns -1 (and not 0.0, the bm25() default value)
expression: if (reduce(tensor_related_queries(), count, rel_q)==0, -1, bm25(related_queries))
}
first-phase {
expression: bm25_related_queries_with_nulls
}
summary-features {
tensor_related_queries
bm25_related_queries_with_nulls
}
}
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Plenitude-ai
changed the title
Bm25 on missing/unspecified fields returns 0.0, when it should return null
Bm25 (and other lexical features) on missing/unspecified fields returns 0.0, when it should return nullNov 20, 2024
Is your feature request related to a problem? Please describe.
bm25 on fields with no value returns 0.0, just as other fields which values are not matching.
This is an issue in learning to rank, as a missing value (Nan or null) does not necessarily mean that the field does not match at all, usually it just did not have time to be fed yet.
In my specific case, my issue arose with the field
related_queries
, a list of queries that could have been generated by the document (https://arxiv.org/abs/1904.08375). This field is therefore an array of strings. It is costly to compute, meaning only selected high performing documents get this signal (e.g. high PageRank first).We however observed that the bm25(related_queries) induces a huge overfitting bias because of the confusion between a missing value and an array of non-relevant queries.
Describe the solution you'd like
BM25 should not return 0.0 if there was no computation involved.
null
orNan
or-1
or just any value which could later be understood as a marker for a missing value, not a mismatch with the query.The same behavior happens for the following fields:
Describe alternatives you've considered
I was able to circumvent the issue but at quite a cost, both in implementation and efficiency:
Example for the news tutorial application :
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: