-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MaxScore TAAT and blocked accumulators #14
base: main
Are you sure you want to change the base?
Conversation
@amallia I didn't have as much time as I thought I would, so I only managed to start some coding. Is |
Codecov Report
@@ Coverage Diff @@
## master #14 +/- ##
==========================================
+ Coverage 92.62% 92.81% +0.19%
==========================================
Files 87 90 +3
Lines 4960 5106 +146
==========================================
+ Hits 4594 4739 +145
- Misses 366 367 +1
Continue to review full report at Codecov.
|
From my understanding |
exhaustive_taat_query(Index const &index, WandType const &wdata, uint64_t k) | ||
: m_index(index), m_wdata(wdata), m_topk(k), m_accumulators(index.num_docs()) | ||
{ | ||
static_assert(accumulator_block_size >= 0, "must be non-negative"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always true since it is size_t, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess. I think I'm just not a big fan of unsigned integers for arithmetic. But I used it coz all the other sizes seem to be unsigned, and I left the assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it should be a signed integer. Just think about it. Someone could very well declare exhaustive_taat_query<.., .., -1>
thinking it's "no blocks" or by accident. It will compile and define the block size to a huge number. That's just weird and unexpected. That's precisely why you shouldn't use unsigned integers unless you do bit manipulation. Sadly, someone thought a while ago that a container size is size_t
, which is stupid.
auto cws = query::cursors_with_scores(m_index, m_wdata, terms); | ||
return taat(std::move(cws.first), std::move(cws.second)); | ||
} | ||
using score_function_type = std::function<float(uint64_t, uint64_t)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I agree the return type is always a float, not sure that it will take always the same input. Should we include this change in all the query algorithms? Maybe in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things here:
-
As to the separate PR, note that this is all internal, doesn't really affect anything outside. I wanted to extract it to another function so we have logic separated. You can revert it now but it makes little sense. Exposing it as an interface should, however, be done in a separate PR, I agree.
-
As to argument types, my thinking is the following: this is always the function of a posting: (doc, freq), which is (
int
,int
). The idea is for another function/method to return these functions. They might be, say, lambdas with captured index, which resolve the document size. It can be something completely different. The point is that by separating these two things, your processing algorithm cares not about what it must pass, it always passes the posting. Figuring out what to do with it falls onto index/scoring function, as it should.
for (uint32_t term = 0; term < cursors.size(); ++term) { | ||
auto &cursor = cursors[term]; | ||
auto &score_function = score_functions[term]; | ||
if constexpr (std::is_same_v<typename Cursor::enumerator_category, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not allow us to use encodings that are not block-based
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence the if constexpr
here. This distinction is for three reasons:
- If we have blocks, it is more efficient to exhaustively iterate over a range instead of repeatedly calling
next()
. - I wanted to test out prefetching (not finished there yet).
- I want at some point I wanted to figure out if there's something we can do with SIMD there.
The 'standard' way would be implemented in else
.
auto const &freqs = cursor.frequency_buffer(); | ||
for (uint32_t idx = 0; idx < documents.size(); ++idx) { | ||
intrinsics::prefetch(&m_accumulators[documents[idx + 3]]); | ||
auto& accumulator = m_accumulators[documents[idx]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure any improvements here? There shouldn’t I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I haven't really found any yet.
There's simply no reason for you calling a pure function or a const method and not use the result. E.g., if you use: docenum.document_buffer(); it makes no sense, either you forgot to use it or it simply should be removed. The most reasonable approach is to give a warning there, which is precisely what
I would go even further:
Of course, there might be exceptions to the rules but I see no reason to avoid |
One thing to note here is that this is not finished. I'll push stuff sometimes to get it to the server. But I'm working slowly over Christmas so changes are slow to appear. That doesn't mean you can't comment, feel free, just keep that in mind that this is still work in progress to test out some stuff. |
I'm wondering about a few things.
|
In this case we need to make OpenMP required and link to the target:
ATM is not required because it is only used by |
src/queries.cpp
Outdated
@@ -107,32 +107,32 @@ void perftest(const std::string &index_filename, | |||
logger() << "Query type: " << t << std::endl; | |||
std::function<uint64_t(term_id_vec)> query_fun; | |||
if (t == "and") { | |||
query_fun = [&](term_id_vec query) { return and_query<false>()(index, query); }; | |||
query_fun = [&](term_id_vec query) { return and_query<IndexType, false>(index)(query); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the lambda here, do you? and_query(index)
is already a right type of function.
Removed "block traversal" since it was wrong and fixing it does not make any difference in terms of performance. pisa/include/pisa/block_posting_list.hpp Line 114 in 3672c4c
A block of values needs to be shifted by its universe. |
@amallia Should we merge this? |
I need to look at Maxscore and the blocked/lazy accumulators, but I already started. |
Nah, it's fine, just do your thing and let's merge once done. |
I have merged in a separate PR (#73) the simple |
Here's what should be done in this PR:
SIMD?