Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaxScore TAAT and blocked accumulators #14

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open

MaxScore TAAT and blocked accumulators #14

wants to merge 39 commits into from

Conversation

elshize
Copy link
Member

@elshize elshize commented Dec 20, 2018

Here's what should be done in this PR:

  • Exhaustive TAAT
    • Lazy initialization
    • Prefetching
    • SIMD?
    • Block-wise aggregation
  • TAAT MaxScore

@elshize elshize added the enhancement New feature or request label Dec 20, 2018
@elshize elshize self-assigned this Dec 20, 2018
@elshize
Copy link
Member Author

elshize commented Dec 20, 2018

@amallia I didn't have as much time as I thought I would, so I only managed to start some coding.

Is queries program supposed to be used for benchmarking?

@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #14 into master will increase coverage by 0.19%.
The diff coverage is 99.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #14      +/-   ##
==========================================
+ Coverage   92.62%   92.81%   +0.19%     
==========================================
  Files          87       90       +3     
  Lines        4960     5106     +146     
==========================================
+ Hits         4594     4739     +145     
- Misses        366      367       +1
Impacted Files Coverage Δ
include/pisa/topk_queue.hpp 100% <ø> (ø) ⬆️
include/pisa/accumulator/simple_accumulator.hpp 100% <ø> (ø) ⬆️
include/pisa/query/queries.hpp 91.42% <ø> (ø) ⬆️
test/test_ranked_queries.cpp 100% <100%> (ø) ⬆️
include/pisa/accumulator/blocked_accumulator.hpp 100% <100%> (ø)
include/pisa/accumulator/lazy_accumulator.hpp 100% <100%> (ø)
...clude/pisa/query/algorithm/maxscore_taat_query.hpp 98.59% <98.59%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23ba4a3...48a0f80. Read the comment docs.

@amallia
Copy link
Member

amallia commented Dec 23, 2018

From my understanding nodiscard is used to enforce the programmer to use the returned value of a function. For example, this can be useful when an error is returned and the error checking has to be enforced. In this context I believe nodiscard can be avoided.

exhaustive_taat_query(Index const &index, WandType const &wdata, uint64_t k)
: m_index(index), m_wdata(wdata), m_topk(k), m_accumulators(index.num_docs())
{
static_assert(accumulator_block_size >= 0, "must be non-negative");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always true since it is size_t, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess. I think I'm just not a big fan of unsigned integers for arithmetic. But I used it coz all the other sizes seem to be unsigned, and I left the assert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it should be a signed integer. Just think about it. Someone could very well declare exhaustive_taat_query<.., .., -1> thinking it's "no blocks" or by accident. It will compile and define the block size to a huge number. That's just weird and unexpected. That's precisely why you shouldn't use unsigned integers unless you do bit manipulation. Sadly, someone thought a while ago that a container size is size_t, which is stupid.

auto cws = query::cursors_with_scores(m_index, m_wdata, terms);
return taat(std::move(cws.first), std::move(cws.second));
}
using score_function_type = std::function<float(uint64_t, uint64_t)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree the return type is always a float, not sure that it will take always the same input. Should we include this change in all the query algorithms? Maybe in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. As to the separate PR, note that this is all internal, doesn't really affect anything outside. I wanted to extract it to another function so we have logic separated. You can revert it now but it makes little sense. Exposing it as an interface should, however, be done in a separate PR, I agree.

  2. As to argument types, my thinking is the following: this is always the function of a posting: (doc, freq), which is (int, int). The idea is for another function/method to return these functions. They might be, say, lambdas with captured index, which resolve the document size. It can be something completely different. The point is that by separating these two things, your processing algorithm cares not about what it must pass, it always passes the posting. Figuring out what to do with it falls onto index/scoring function, as it should.

include/query/algorithm/exhaustive_taat_query.hpp Outdated Show resolved Hide resolved
for (uint32_t term = 0; term < cursors.size(); ++term) {
auto &cursor = cursors[term];
auto &score_function = score_functions[term];
if constexpr (std::is_same_v<typename Cursor::enumerator_category,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not allow us to use encodings that are not block-based

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence the if constexpr here. This distinction is for three reasons:

  1. If we have blocks, it is more efficient to exhaustively iterate over a range instead of repeatedly calling next().
  2. I wanted to test out prefetching (not finished there yet).
  3. I want at some point I wanted to figure out if there's something we can do with SIMD there.

The 'standard' way would be implemented in else.

auto const &freqs = cursor.frequency_buffer();
for (uint32_t idx = 0; idx < documents.size(); ++idx) {
intrinsics::prefetch(&m_accumulators[documents[idx + 3]]);
auto& accumulator = m_accumulators[documents[idx]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure any improvements here? There shouldn’t I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I haven't really found any yet.

@elshize
Copy link
Member Author

elshize commented Dec 23, 2018

From my understanding nodiscard is used to enforce the programmer to use the returned value of a function. For example, this can be useful when an error is returned and the error checking has to be enforced. In this context I believe nodiscard can be avoided.

There's simply no reason for you calling a pure function or a const method and not use the result. E.g., if you use:

docenum.document_buffer();

it makes no sense, either you forgot to use it or it simply should be removed. The most reasonable approach is to give a warning there, which is precisely what nodiscard will do. So:

  • All pure free functions (not changing any state) should be nodiscard by default.
  • All const methods should be nodiscard by default.

I would go even further:

  • Any function whose return value in 99% of cases should be used, should be nodiscard by default. In the rare cases when you might not want to do that, there's always [[maybe_unused]] attribute.

Of course, there might be exceptions to the rules but I see no reason to avoid nodiscard. In fact, this might be a language default if not for backward-compatibility. For example, it is default in Rust.

@elshize
Copy link
Member Author

elshize commented Dec 23, 2018

One thing to note here is that this is not finished. I'll push stuff sometimes to get it to the server. But I'm working slowly over Christmas so changes are slow to appear.

That doesn't mean you can't comment, feel free, just keep that in mind that this is still work in progress to test out some stuff.

@elshize
Copy link
Member Author

elshize commented Jan 14, 2019

I'm wondering about a few things.

  1. Why two of the tests don't find OpenMP but the rest does?
  2. If we intend to use it, we should just require it in CMake. I don't see a reason not to, and optional dependencies complicate things that should be simple (like including a header file).

@amallia
Copy link
Member

amallia commented Jan 14, 2019

Why two of the tests don't find OpenMP but the rest does?

In this case we need to make OpenMP required and link to the target:

find_package(OpenMP REQUIRED)
target_link_libraries(target_name PRIVATE OpenMP::OpenMP_CXX)

If we intend to use it, we should just require it in CMake. I don't see a reason not to, and optional dependencies complicate things that should be simple (like including a header file).

ATM is not required because it is only used by stxxl and it is optional there, since it works without OpenMP too.

src/queries.cpp Outdated
@@ -107,32 +107,32 @@ void perftest(const std::string &index_filename,
logger() << "Query type: " << t << std::endl;
std::function<uint64_t(term_id_vec)> query_fun;
if (t == "and") {
query_fun = [&](term_id_vec query) { return and_query<false>()(index, query); };
query_fun = [&](term_id_vec query) { return and_query<IndexType, false>(index)(query); };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the lambda here, do you? and_query(index) is already a right type of function.

@amallia
Copy link
Member

amallia commented Jan 16, 2019

Removed "block traversal" since it was wrong and fixing it does not make any difference in terms of performance.
BTW the problem was:

m_cur_docid = m_universe;

A block of values needs to be shifted by its universe.

@elshize
Copy link
Member Author

elshize commented Jan 17, 2019

@amallia Should we merge this?

@amallia
Copy link
Member

amallia commented Jan 17, 2019

I need to look at Maxscore and the blocked/lazy accumulators, but I already started.
We could create a separate branch containing simple accumulator and exhaustive taat and we merge it, in the meantime I look at the rest. What do you think?

@elshize
Copy link
Member Author

elshize commented Jan 17, 2019

I need to look at Maxscore and the blocked/lazy accumulators, but I already started.
We could create a separate branch containing simple accumulator and exhaustive taat and we merge it, in the meantime I look at the rest. What do you think?

Nah, it's fine, just do your thing and let's merge once done.

@amallia amallia mentioned this pull request Jan 21, 2019
3 tasks
@amallia amallia changed the title First take at exhaustive TAAT MaxScore TAAT and blocked/lazy accumulators Jan 21, 2019
@amallia
Copy link
Member

amallia commented Jan 21, 2019

I have merged in a separate PR (#73) the simple ranked_or_taat algorithm.
There were also some general improvements to the query algorithms, so I thought it was worth to merge separately.

@amallia amallia changed the title MaxScore TAAT and blocked/lazy accumulators MaxScore TAAT and blocked accumulators Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wip
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants