Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null` #32905

Plenitude-ai · 2024-11-20T14:46:24Z

Is your feature request related to a problem? Please describe.
bm25 on fields with no value returns 0.0, just as other fields which values are not matching.
This is an issue in learning to rank, as a missing value (Nan or null) does not necessarily mean that the field does not match at all, usually it just did not have time to be fed yet.

In my specific case, my issue arose with the field related_queries, a list of queries that could have been generated by the document (https://arxiv.org/abs/1904.08375). This field is therefore an array of strings. It is costly to compute, meaning only selected high performing documents get this signal (e.g. high PageRank first).
We however observed that the bm25(related_queries) induces a huge overfitting bias because of the confusion between a missing value and an array of non-relevant queries.

Describe the solution you'd like
BM25 should not return 0.0 if there was no computation involved.null or Nan or -1 or just any value which could later be understood as a marker for a missing value, not a mismatch with the query.

The same behavior happens for the following fields:

elementCompleteness(related_queries).completeness
elementCompleteness(related_queries).fieldCompleteness
elementCompleteness(related_queries).queryCompleteness
elementCompleteness(related_queries).elementWeight
elementSimilarity(related_queries)

Describe alternatives you've considered
I was able to circumvent the issue but at quite a cost, both in implementation and efficiency:

expose the field as an attribute (memory overhead)
use tensorFromLabel to convert the array to tensor (memory + cpu overhead)
declare a specific ranking expression to check for tensor size and return desired value (cpu overhead)

Example for the news tutorial application :

schema news {
    document news {
        field related_queries type array<string> {
            indexing: summary | index | attribute
            index: enable-bm25
            match: text
        }
   }
  rank-profile test_related_queries inherits default {
      function tensor_related_queries() {
            expression: tensorFromLabels(attribute(related_queries), rel_q)
      }
      function bm25_related_queries_with_nulls() {
            # If related_queries is specified, returns bm25
            # else returns -1 (and not 0.0, the bm25() default value)
            expression: if (reduce(tensor_related_queries(), count, rel_q)==0,  -1, bm25(related_queries))
      }
      first-phase {
            expression: bm25_related_queries_with_nulls
      }
      summary-features {
            tensor_related_queries
            bm25_related_queries_with_nulls
      }
}

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

jobergum · 2024-11-21T09:14:23Z

Although we cannot change the feature's default value in Vespa 8, that could be considered for Vespa 9.

A much cheaper workaround avoiding putting all the queries in memory with attribute and the cost of converting it to a tensor.

schema news {
    document news {
        field related_queries type array<string> {
            indexing: summary | index 
            index: enable-bm25
            match: text
        }
        field number_related_queries type int { indexing: summary | attribute}
   }
  rank-profile test_related_queries inherits default {
      
      function bm25_related_queries_with_nulls() {
            # If related_queries is specified, returns bm25
            # else returns -1 (and not 0.0, the bm25() default value)
            expression: if(attribute(number_related_queries) == 0, -1, bm25(related_queries))
      }
      first-phase {
            expression: bm25_related_queries_with_nulls
      }
      
}

Plenitude-ai · 2024-11-21T12:59:43Z

Yes that's a way better idea
Thank you ! 💯

jobergum · 2024-11-27T10:29:57Z

Note that I made the mistake of including attribute in the example above, I've edited it

 document news {
        field related_queries type array<string> {
            indexing: summary | index 
            index: enable-bm25
            match: text
        }
}

Plenitude-ai · 2024-11-27T13:01:35Z

Yes I saw that, no worries

Plenitude-ai changed the title ~~Bm25 on missing/unspecified fields returns 0.0, when it should return null~~ Bm25 (and other lexical features) on missing/unspecified fields returns 0.0, when it should return null Nov 20, 2024

hmusum added the Vespa 9 label Nov 27, 2024

hmusum added this to the later milestone Nov 27, 2024

hmusum assigned geirst Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null` #32905

Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null` #32905

Plenitude-ai commented Nov 20, 2024 •

edited

Loading

jobergum commented Nov 21, 2024 •

edited

Loading

Plenitude-ai commented Nov 21, 2024

jobergum commented Nov 27, 2024

Plenitude-ai commented Nov 27, 2024

Bm25 (and other lexical features) on missing/unspecified fields returns 0.0, when it should return null #32905

Bm25 (and other lexical features) on missing/unspecified fields returns 0.0, when it should return null #32905

Comments

Plenitude-ai commented Nov 20, 2024 • edited Loading

jobergum commented Nov 21, 2024 • edited Loading

Plenitude-ai commented Nov 21, 2024

jobergum commented Nov 27, 2024

Plenitude-ai commented Nov 27, 2024

Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null` #32905

Bm25 (and other lexical features) on missing/unspecified fields returns `0.0`, when it should return `null` #32905

Plenitude-ai commented Nov 20, 2024 •

edited

Loading

jobergum commented Nov 21, 2024 •

edited

Loading