Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement group by #198

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Implement group by #198

wants to merge 1 commit into from

Conversation

arj03
Copy link
Member

@arj03 arj03 commented Mar 22, 2022

Background

@mixmix was working on adding an option to crut list where you would get results ordered by updateTime. We would like to avoid running through all the results (db1 has to do this) in order to display say the latest 5 buttcasts. For this what we really need is a group by operator similar to what you have in a normal database. So this is quite general, you could also use this functionality to show the latest post for each feed. I tried modelling this similar to the seek function. This means we still have to run through results to extract what we need, but since we are working directly on the bipf data, it's not really that bad. This can still be combined with regular where filtering and when used with limit, it will stop once enough results are found.

if (uniqueByValue.size == seq + limit) return false
},
(early) => {
if (early) return
Copy link
Member Author

@arj03 arj03 Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need this with push.drain, without this, we'll get called twice when bailing out early.

@github-actions
Copy link

Benchmark results

Part Speed Heap Change Samples
Count 1 big index (3rd run) 0.45ms ± 0.2ms 18.33 kB ± 19.81 kB 50
Create an index twice concurrently 593.75ms ± 6.74ms -33.13 kB ± 56.61 kB 92
Load core indexes 1.32ms ± 0.02ms 97.3 B ± 224.57 B 6960
Load two indexes concurrently 730.78ms ± 14.76ms -23.32 kB ± 266.25 kB 17
Paginate 10 results 24.92ms ± 1.02ms 19.38 kB ± 13.07 kB 27
Paginate 20000 msgs with pageSize=5 5916.4ms ± 150.89ms 2.61 MB ± 1.91 MB 5
Paginate 20000 msgs with pageSize=500 556.99ms ± 29.87ms -523.32 kB ± 796.37 kB 20
Query 1 big index (1st run) 824.39ms ± 9.06ms 26.18 kB ± 58.09 kB 66
Query 1 big index (2nd run) 257.78ms ± 0.85ms 67.36 kB ± 51.14 kB 52
Query 3 indexes (1st run) 870.22ms ± 8.33ms 24.72 kB ± 78.52 kB 63
Query 3 indexes (2nd run) 210.77ms ± 1.2ms 36.53 kB ± 168.67 kB 54
Query a prefix map (1st run) 255.5ms ± 31.03ms -16.3 kB ± 248.32 kB 24
Query a prefix map (2nd run) 11.68ms ± 1.01ms 4.39 kB ± 24.52 kB 21

@mixmix
Copy link
Member

mixmix commented Mar 23, 2022

I don't quite understand the need for this but trust you.

to be clear, the way I get the 5 most recently updated eg podcasts is pull for type podcast (getting updates and roots). then run this through crut.read and throw out the duds. once I have 5 good ones kill the stream.

oh, and we have a pull.unique in there too.
is this what groupBy would be helping with.?

@arj03
Copy link
Member Author

arj03 commented Mar 23, 2022

to be clear, the way I get the 5 most recently updated eg podcasts is pull for type podcast (getting updates and roots). then run this through crut.read and throw out the duds. once I have 5 good ones kill the stream.

oh, and we have a pull.unique in there too. is this what groupBy would be helping with.?

It is the same algorithm, now that you explain it. So right now this is probably only marginally faster. It does maybe improve the discoverability of these kind of things.

I do have an idea to use an index for this, then it should be faster. So challegence accepted :-) Changing this to draft.

@arj03 arj03 marked this pull request as draft March 23, 2022 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants