Implement group by #198

arj03 · 2022-03-22T23:59:29Z

Background

@mixmix was working on adding an option to crut list where you would get results ordered by updateTime. We would like to avoid running through all the results (db1 has to do this) in order to display say the latest 5 buttcasts. For this what we really need is a group by operator similar to what you have in a normal database. So this is quite general, you could also use this functionality to show the latest post for each feed. I tried modelling this similar to the seek function. This means we still have to run through results to extract what we need, but since we are working directly on the bipf data, it's not really that bad. This can still be combined with regular where filtering and when used with limit, it will stop once enough results are found.

arj03 · 2022-03-23T00:00:31Z

index.js

+            if (uniqueByValue.size == seq + limit) return false
+          },
+          (early) => {
+            if (early) return


not sure why we need this with push.drain, without this, we'll get called twice when bailing out early.

github-actions · 2022-03-23T00:09:24Z

Benchmark results

Part	Speed	Heap Change	Samples
Count 1 big index (3rd run)	0.45ms ± 0.2ms	18.33 kB ± 19.81 kB	50
Create an index twice concurrently	593.75ms ± 6.74ms	-33.13 kB ± 56.61 kB	92
Load core indexes	1.32ms ± 0.02ms	97.3 B ± 224.57 B	6960
Load two indexes concurrently	730.78ms ± 14.76ms	-23.32 kB ± 266.25 kB	17
Paginate 10 results	24.92ms ± 1.02ms	19.38 kB ± 13.07 kB	27
Paginate 20000 msgs with pageSize=5	5916.4ms ± 150.89ms	2.61 MB ± 1.91 MB	5
Paginate 20000 msgs with pageSize=500	556.99ms ± 29.87ms	-523.32 kB ± 796.37 kB	20
Query 1 big index (1st run)	824.39ms ± 9.06ms	26.18 kB ± 58.09 kB	66
Query 1 big index (2nd run)	257.78ms ± 0.85ms	67.36 kB ± 51.14 kB	52
Query 3 indexes (1st run)	870.22ms ± 8.33ms	24.72 kB ± 78.52 kB	63
Query 3 indexes (2nd run)	210.77ms ± 1.2ms	36.53 kB ± 168.67 kB	54
Query a prefix map (1st run)	255.5ms ± 31.03ms	-16.3 kB ± 248.32 kB	24
Query a prefix map (2nd run)	11.68ms ± 1.01ms	4.39 kB ± 24.52 kB	21

mixmix · 2022-03-23T02:35:22Z

I don't quite understand the need for this but trust you.

to be clear, the way I get the 5 most recently updated eg podcasts is pull for type podcast (getting updates and roots). then run this through crut.read and throw out the duds. once I have 5 good ones kill the stream.

oh, and we have a pull.unique in there too.
is this what groupBy would be helping with.?

arj03 · 2022-03-23T07:18:46Z

to be clear, the way I get the 5 most recently updated eg podcasts is pull for type podcast (getting updates and roots). then run this through crut.read and throw out the duds. once I have 5 good ones kill the stream.

oh, and we have a pull.unique in there too. is this what groupBy would be helping with.?

It is the same algorithm, now that you explain it. So right now this is probably only marginally faster. It does maybe improve the discoverability of these kind of things.

I do have an idea to use an index for this, then it should be faster. So challegence accepted :-) Changing this to draft.

Implement group by

1df13ca

arj03 commented Mar 23, 2022

View reviewed changes

arj03 marked this pull request as draft March 23, 2022 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement group by #198

Implement group by #198

arj03 commented Mar 22, 2022

arj03 Mar 23, 2022 •

edited

Loading

github-actions bot commented Mar 23, 2022

mixmix commented Mar 23, 2022

arj03 commented Mar 23, 2022

Implement group by #198

Are you sure you want to change the base?

Implement group by #198

Conversation

arj03 commented Mar 22, 2022

Background

arj03 Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 23, 2022

Benchmark results

mixmix commented Mar 23, 2022

arj03 commented Mar 23, 2022

arj03 Mar 23, 2022 •

edited

Loading