Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bm25 (and other lexical features) on missing/unspecified fields returns 0.0, when it should return null #32905

Open
Plenitude-ai opened this issue Nov 20, 2024 · 4 comments
Assignees
Labels
Milestone

Comments

@Plenitude-ai
Copy link

Plenitude-ai commented Nov 20, 2024

Is your feature request related to a problem? Please describe.
bm25 on fields with no value returns 0.0, just as other fields which values are not matching.
This is an issue in learning to rank, as a missing value (Nan or null) does not necessarily mean that the field does not match at all, usually it just did not have time to be fed yet.

In my specific case, my issue arose with the field related_queries, a list of queries that could have been generated by the document (https://arxiv.org/abs/1904.08375). This field is therefore an array of strings. It is costly to compute, meaning only selected high performing documents get this signal (e.g. high PageRank first).
We however observed that the bm25(related_queries) induces a huge overfitting bias because of the confusion between a missing value and an array of non-relevant queries.

Describe the solution you'd like
BM25 should not return 0.0 if there was no computation involved.null or Nan or -1 or just any value which could later be understood as a marker for a missing value, not a mismatch with the query.

The same behavior happens for the following fields:

  • elementCompleteness(related_queries).completeness
  • elementCompleteness(related_queries).fieldCompleteness
  • elementCompleteness(related_queries).queryCompleteness
  • elementCompleteness(related_queries).elementWeight
  • elementSimilarity(related_queries)

Describe alternatives you've considered
I was able to circumvent the issue but at quite a cost, both in implementation and efficiency:

  • expose the field as an attribute (memory overhead)
  • use tensorFromLabel to convert the array to tensor (memory + cpu overhead)
  • declare a specific ranking expression to check for tensor size and return desired value (cpu overhead)

Example for the news tutorial application :

schema news {
    document news {
        field related_queries type array<string> {
            indexing: summary | index | attribute
            index: enable-bm25
            match: text
        }
   }
  rank-profile test_related_queries inherits default {
      function tensor_related_queries() {
            expression: tensorFromLabels(attribute(related_queries), rel_q)
      }
      function bm25_related_queries_with_nulls() {
            # If related_queries is specified, returns bm25
            # else returns -1 (and not 0.0, the bm25() default value)
            expression: if (reduce(tensor_related_queries(), count, rel_q)==0,  -1, bm25(related_queries))
      }
      first-phase {
            expression: bm25_related_queries_with_nulls
      }
      summary-features {
            tensor_related_queries
            bm25_related_queries_with_nulls
      }
}

Additional context
Add any other context or screenshots about the feature request here.

@Plenitude-ai Plenitude-ai changed the title Bm25 on missing/unspecified fields returns 0.0, when it should return null Bm25 (and other lexical features) on missing/unspecified fields returns 0.0, when it should return null Nov 20, 2024
@jobergum
Copy link
Member

jobergum commented Nov 21, 2024

Although we cannot change the feature's default value in Vespa 8, that could be considered for Vespa 9.

A much cheaper workaround avoiding putting all the queries in memory with attribute and the cost of converting it to a tensor.

schema news {
    document news {
        field related_queries type array<string> {
            indexing: summary | index 
            index: enable-bm25
            match: text
        }
        field number_related_queries type int { indexing: summary | attribute}
   }
  rank-profile test_related_queries inherits default {
      
      function bm25_related_queries_with_nulls() {
            # If related_queries is specified, returns bm25
            # else returns -1 (and not 0.0, the bm25() default value)
            expression: if(attribute(number_related_queries) == 0, -1, bm25(related_queries))
      }
      first-phase {
            expression: bm25_related_queries_with_nulls
      }
      
}

@Plenitude-ai
Copy link
Author

Yes that's a way better idea
Thank you ! 💯

@jobergum
Copy link
Member

Note that I made the mistake of including attribute in the example above, I've edited it

 document news {
        field related_queries type array<string> {
            indexing: summary | index 
            index: enable-bm25
            match: text
        }
}

@Plenitude-ai
Copy link
Author

Yes I saw that, no worries

@hmusum hmusum added the Vespa 9 label Nov 27, 2024
@hmusum hmusum added this to the later milestone Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants