New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

MaxScore TAAT and blocked accumulators #14

Open

elshize wants to merge 39 commits into main from taat

Member

elshize commented Dec 20, 2018 •

edited

Loading

Here's what should be done in this PR:

elshize added the enhancement label

elshize self-assigned this

Member Author

elshize commented Dec 20, 2018

@amallia I didn't have as much time as I thought I would, so I only managed to start some coding.

Is queries program supposed to be used for benchmarking?

codecov bot commented Dec 20, 2018 •

edited

Loading

Codecov Report

Merging #14 into master will increase coverage by 0.19%.
The diff coverage is 99.31%.

@@            Coverage Diff             @@
##           master      #14      +/-   ##
==========================================
+ Coverage   92.62%   92.81%   +0.19%     
==========================================
  Files          87       90       +3     
  Lines        4960     5106     +146     
==========================================
+ Hits         4594     4739     +145     
- Misses        366      367       +1

Impacted Files	Coverage Δ
include/pisa/topk_queue.hpp	`100% <ø> (ø)`	⬆️
include/pisa/accumulator/simple_accumulator.hpp	`100% <ø> (ø)`	⬆️
include/pisa/query/queries.hpp	`91.42% <ø> (ø)`	⬆️
test/test_ranked_queries.cpp	`100% <100%> (ø)`	⬆️
include/pisa/accumulator/blocked_accumulator.hpp	`100% <100%> (ø)`
include/pisa/accumulator/lazy_accumulator.hpp	`100% <100%> (ø)`
...clude/pisa/query/algorithm/maxscore_taat_query.hpp	`98.59% <98.59%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23ba4a3...48a0f80. Read the comment docs.

Member

amallia commented Dec 23, 2018

From my understanding nodiscard is used to enforce the programmer to use the returned value of a function. For example, this can be useful when an error is returned and the error checking has to be enforced. In this context I believe nodiscard can be avoided.

amallia reviewed

View reviewed changes

include/query/algorithm/exhaustive_taat_query.hpp Outdated

+                  exhaustive_taat_query(Index const &index, WandType const &wdata, uint64_t k)
+                      : m_index(index), m_wdata(wdata), m_topk(k), m_accumulators(index.num_docs())
+                  {
+                      static_assert(accumulator_block_size >= 0, "must be non-negative");

Member

amallia Dec 23, 2018

Always true since it is size_t, right?

Member Author

elshize Dec 23, 2018

I guess. I think I'm just not a big fan of unsigned integers for arithmetic. But I used it coz all the other sizes seem to be unsigned, and I left the assert.

Member Author

elshize Dec 23, 2018

Actually, it should be a signed integer. Just think about it. Someone could very well declare exhaustive_taat_query<.., .., -1> thinking it's "no blocks" or by accident. It will compile and define the block size to a huge number. That's just weird and unexpected. That's precisely why you shouldn't use unsigned integers unless you do bit manipulation. Sadly, someone thought a while ago that a container size is size_t, which is stupid.

include/query/algorithm/exhaustive_taat_query.hpp Outdated

+                      auto cws = query::cursors_with_scores(m_index, m_wdata, terms);
+                      return taat(std::move(cws.first), std::move(cws.second));
+                  }
+                  using score_function_type = std::function<float(uint64_t, uint64_t)>;

Member

amallia Dec 23, 2018

Although I agree the return type is always a float, not sure that it will take always the same input. Should we include this change in all the query algorithms? Maybe in a separate PR?

Member Author

elshize Dec 23, 2018

Two things here:

As to the separate PR, note that this is all internal, doesn't really affect anything outside. I wanted to extract it to another function so we have logic separated. You can revert it now but it makes little sense. Exposing it as an interface should, however, be done in a separate PR, I agree.
As to argument types, my thinking is the following: this is always the function of a posting: (doc, freq), which is (int, int). The idea is for another function/method to return these functions. They might be, say, lambdas with captured index, which resolve the document size. It can be something completely different. The point is that by separating these two things, your processing algorithm cares not about what it must pass, it always passes the posting. Figuring out what to do with it falls onto index/scoring function, as it should.

include/query/algorithm/exhaustive_taat_query.hpp Outdated Show resolved Hide resolved

include/query/algorithm/exhaustive_taat_query.hpp Outdated

+                      for (uint32_t term = 0; term < cursors.size(); ++term) {
+                          auto &cursor         = cursors[term];
+                          auto &score_function = score_functions[term];
+                          if constexpr (std::is_same_v<typename Cursor::enumerator_category,

Member

amallia Dec 23, 2018

This will not allow us to use encodings that are not block-based

Member Author

elshize Dec 23, 2018

Hence the if constexpr here. This distinction is for three reasons:

If we have blocks, it is more efficient to exhaustively iterate over a range instead of repeatedly calling next().
I wanted to test out prefetching (not finished there yet).
I want at some point I wanted to figure out if there's something we can do with SIMD there.

The 'standard' way would be implemented in else.

amallia reviewed

View reviewed changes

include/query/algorithm/exhaustive_taat_query.hpp Outdated

+                                  auto const &freqs     = cursor.frequency_buffer();
+                                  for (uint32_t idx = 0; idx < documents.size(); ++idx) {
+                                      intrinsics::prefetch(&m_accumulators[documents[idx + 3]]);
+                                      auto& accumulator = m_accumulators[documents[idx]];

Member

amallia Dec 23, 2018

Did you measure any improvements here? There shouldn’t I think

Member Author

elshize Dec 23, 2018

Yeah, I haven't really found any yet.

Member Author

elshize commented Dec 23, 2018

From my understanding nodiscard is used to enforce the programmer to use the returned value of a function. For example, this can be useful when an error is returned and the error checking has to be enforced. In this context I believe nodiscard can be avoided.

There's simply no reason for you calling a pure function or a const method and not use the result. E.g., if you use:

docenum.document_buffer();

it makes no sense, either you forgot to use it or it simply should be removed. The most reasonable approach is to give a warning there, which is precisely what nodiscard will do. So:

All pure free functions (not changing any state) should be nodiscard by default.
All const methods should be nodiscard by default.

I would go even further:

Any function whose return value in 99% of cases should be used, should be nodiscard by default. In the rare cases when you might not want to do that, there's always [[maybe_unused]] attribute.

Of course, there might be exceptions to the rules but I see no reason to avoid nodiscard. In fact, this might be a language default if not for backward-compatibility. For example, it is default in Rust.

Member Author

elshize commented Dec 23, 2018

One thing to note here is that this is not finished. I'll push stuff sometimes to get it to the server. But I'm working slowly over Christmas so changes are slow to appear.

That doesn't mean you can't comment, feel free, just keep that in mind that this is still work in progress to test out some stuff.

elshize mentioned this pull request

Enable using threshold/threshold estimates for performance/correctness test #23

Closed

amallia added the wip label

Member Author

elshize commented Jan 14, 2019

I'm wondering about a few things.

Why two of the tests don't find OpenMP but the rest does?
If we intend to use it, we should just require it in CMake. I don't see a reason not to, and optional dependencies complicate things that should be simple (like including a header file).

Member

amallia commented Jan 14, 2019 •

edited

Loading

Why two of the tests don't find OpenMP but the rest does?

In this case we need to make OpenMP required and link to the target:

find_package(OpenMP REQUIRED)
target_link_libraries(target_name PRIVATE OpenMP::OpenMP_CXX)

If we intend to use it, we should just require it in CMake. I don't see a reason not to, and optional dependencies complicate things that should be simple (like including a header file).

ATM is not required because it is only used by stxxl and it is optional there, since it works without OpenMP too.

elshize and others added 17 commits

January 16, 2019 12:34


          First take at exhaustive TAAT

653d21e


          Fetch an entire block at a time for TAAT

8f0b54a


          Add prefetching for TAAT

9f4e325


          Return buffer references from posting lists instead of moved buffers.

c28e5d2


          Blocked accumulator array for TAAT

5d5519b


          TAAT maxscore and lazy accumulator

886e580


          TAAT MaxScore and Blocked Accumulator

b0e2824


          Remove heap stuff

7737d1f


          TAAT optimizations

0664cba


          Vectorize lookup traversal.

9ae6b1a


          Simple but effective ranked_or with taat

26a05f1


          Remove OpenMP

41ca198


          Remove OpenMP

200be05


          Use template rather than std::function for faster processing

c3400d1


          Lazy accumulator fixed

87252f3


          Fix block traversal issue

b9dd1fd


          Removed ds2i namespace

5c539b4

amallia force-pushed the taat branch from 44a5b03 to 6e96d16 Compare

January 16, 2019 12:22

amallia added 3 commits

January 16, 2019 13:31


          code cleanup

f2c93f9


          Added comment [skip ci]

7ca53d5


          Improved queries interface

30782ac

elshize commented

View reviewed changes

src/queries.cpp Outdated

    
            @@ -107,32 +107,32 @@ void perftest(const std::string &index_filename,
          
                      logger() << "Query type: " << t << std::endl;

                      std::function<uint64_t(term_id_vec)> query_fun;

                      if (t == "and") {

                          query_fun = [&](term_id_vec query) { return and_query<false>()(index, query); };

                          query_fun = [&](term_id_vec query) { return and_query<IndexType, false>(index)(query); };

Member Author

elshize Jan 16, 2019

You don't need the lambda here, do you? and_query(index) is already a right type of function.

amallia added 2 commits

January 16, 2019 16:14


          removed unused lambdas

3649a7d


          Removed buffers

81db028

Member

amallia commented Jan 16, 2019

Removed "block traversal" since it was wrong and fixing it does not make any difference in terms of performance.
BTW the problem was:

pisa/include/pisa/block_posting_list.hpp

Line 114 in 3672c4c

m_cur_docid = m_universe;

A block of values needs to be shifted by its universe.


          Merge branch 'master' into taat

09fbc02

Member Author

elshize commented Jan 17, 2019

@amallia Should we merge this?

Member

amallia commented Jan 17, 2019

I need to look at Maxscore and the blocked/lazy accumulators, but I already started.
We could create a separate branch containing simple accumulator and exhaustive taat and we merge it, in the meantime I look at the rest. What do you think?

Member Author

elshize commented Jan 17, 2019

I need to look at Maxscore and the blocked/lazy accumulators, but I already started.
We could create a separate branch containing simple accumulator and exhaustive taat and we merge it, in the meantime I look at the rest. What do you think?

Nah, it's fine, just do your thing and let's merge once done.


          Simplified Maxscore

f1a8b64

amallia mentioned this pull request

Ranked OR TAAT #73

Merged

3 tasks

amallia added 2 commits

January 21, 2019 16:14


          Merge branch 'master' into taat

eeacec4


          Delete exhaustive_taat_query.hpp

d08a063

amallia changed the title ~~First take at exhaustive TAAT~~ MaxScore TAAT and blocked/lazy accumulators

Member

amallia commented Jan 21, 2019

I have merged in a separate PR (#73) the simple ranked_or_taat algorithm.
There were also some general improvements to the query algorithms, so I thought it was worth to merge separately.

amallia and others added 11 commits

January 21, 2019 17:51


          Update test_ranked_queries.cpp

f254277


          Update test_ranked_queries.cpp

b92e93d


          Faster MaxScore

c8c1b35


          Merge branch 'master' into taat

e7cd360


          Added extra check

e60142b


          Merge branch 'master' into taat

a1d25d4


          Merge remote-tracking branch 'origin/master' into taat

2e12eff


          Use int for 8 bits

18be6a0


          Bug fix: reset counter

48a0f80


          Merge branch 'master' into taat

abf75fc


          Merge branch 'master' into taat

cb67ba9

amallia changed the title ~~MaxScore TAAT and blocked/lazy accumulators~~ MaxScore TAAT and blocked accumulators

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement wip