-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
topk_queue::finalize sorts by both score and ID (#508) #515
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportBase: 93.06% // Head: 93.06% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #515 +/- ##
==========================================
- Coverage 93.06% 93.06% -0.01%
==========================================
Files 90 90
Lines 4497 4511 +14
==========================================
+ Hits 4185 4198 +13
- Misses 312 313 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! These changes look reasonable to me.
// elements to the heap. | ||
std::sort(m_q.begin(), m_q.end(), sort_order); | ||
|
||
auto search_order = [](auto entry, auto score) { return entry.first > score; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a comment here to explain what this is doing? I know this is unmodified behaviour, but since you are on it would be better to document the code.
}; | ||
// We have to do a full sort because it is not the exact same ordering as when pushing | ||
// elements to the heap. | ||
std::sort(m_q.begin(), m_q.end(), sort_order); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are switching to sort
then why don't we just use stable_sort
on the scores such that we maintain as second sorting condition the insertion time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure stable_sort
will preserve that. I actually haven't analyzed this in detail, but are you saying that given our implementation, we can be sure that the more recently inserted entry is always above in the heap or at the same level but on the left side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that heap_sort
creates some issues with reproducibility since two documents with equal score have no guarantees on the order they will appear in the list of results.
Stable sort will keep the relative order of elements with equivalent values, fixing the reproducibility issue.
I prefer this solution because:
(1) it is easier and less error prone
(2) it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you're suggesting is to change the semantic: instead of ensuring that within the top-k results ties are broken by the docID order, you're proposing to not give any such guarantee, but rather say that the order of elements of with the same score is undefined, but deterministic, i.e., reproducible across multiple runs.
Do I understand correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment
You'll have to be more explicit here, not sure which issue we're resolving. From my understanding, you're solution simply changes the semantic of what we do, but maybe I'm missing something, so if you could be a bit more clear about what you mean and how it solves it, that would help.
@@ -172,6 +181,26 @@ struct topk_queue { | |||
} | |||
} | |||
|
|||
// We use our own function (as opposed to `std::heap_push`), to ensure that | |||
// heap implementation is consistent across standard libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does to ensure that heap implementation is consistent across standard libraries
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Means that if I compile on libstdc++ 12 and you compile on libc++ 10, then both of us have the same logic, and thus if you run the same exact algorithm on the same exact collection, etc., you will get the same results.
With std::push
, the implementation may change from version to version and implementation to implementation (LLVM/GNU/MS/Apple) so it's not necessarily reproducible across environments.
Using pair ordering (score, doc) woulds of course solve all of that, but that's not efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand! Makes sense. Right now we are not aware of differences in the implementation of libc++ 10 and libstdc++ 12, right?
My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL
Normally, I would agree, but this is somewhat special circumstances. Technically speaking, we probably would want to strictly define the order (score + docid) but we're not because of efficiency. But also we are already implementing the more complex pop operation, so we might as well finish the job. It's very little effort for possibly a significant improvement for someone down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put it this way: you're doing research, and have some results, and then you recompile after system upgrade or something, and it doesn't match; it's not life changing, but that's very annoying imo, and I personally see no reason not to fix that with one little function. It could save someone a headache down the line.
The final sorting order is now by score (descending) and docid (ascending). Furthermore, `std::push_heap` is replaced with our own implementation to maintain consistency across standard libraries.
The final sorting order is now by score (descending) and docid (ascending). Furthermore,
std::push_heap
is replaced with our own implementation to maintain consistency across standard libraries.Fixes #508