-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡️ Indexd bulk fetching #448
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
from functools import wraps | ||
from dateutil import parser | ||
from datetime import datetime | ||
from dataservice.extensions import db, indexd | ||
|
||
|
||
def paginated(f): | ||
|
@@ -101,25 +102,41 @@ def indexd_pagination(q, after, limit): | |
|
||
:returns: A Pagination object | ||
""" | ||
def prefetch_indexd(after): | ||
""" Compute dids for the page and have indexd fetch them in bulk """ | ||
model = q._entities[0].mapper.entity | ||
gfs = (q.order_by(model.created_at.asc()) | ||
.filter(model.created_at > after) | ||
.with_entities(model.latest_did) | ||
.limit(limit).all()) | ||
dids = [gf[0] for gf in gfs] | ||
indexd.prefetch(dids) | ||
|
||
indexd.clear_cache() | ||
prefetch_indexd(after) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if indexd bulk fetch were to happen after the Pagination object is initialized instead of before, then you could also speed up empty returns by not checking indexd this first time when total is 0. Is that doable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah there's probably a couple optimizations to be made around reducing the number of db queries here. It would involve adding some sort of branching into the Pagination object, though, and I'd prefer to keep it simple as it's generalized to all entities at the moment. Right now, if the Pagination object were constructed first, it would result in the old behavior of constructing each object with its own request to indexd. Note that if there is a total=0, indexd won't actually be called as the query for gfs will return empty. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. got it. All clear from me, then. |
||
pager = Pagination(q, after, limit) | ||
keep = [] | ||
refresh = True | ||
next_after = None | ||
# Continue updating the page until we get a page with no deleted files | ||
while (pager.total > 0 and refresh): | ||
refresh = False | ||
# Move the cursor ahead to the last valid file | ||
next_after = keep[-1].created_at if len(keep) > 0 else after | ||
# Number of results needed to fulfill the original limit | ||
remain = limit - len(keep) | ||
pager = Pagination(q, next_after, remain) | ||
|
||
for st in pager.items: | ||
if hasattr(st, 'was_deleted') and st.was_deleted: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this never going to happen now? Before the loop, you first populate the indexd page cache. Then you construct the Pagination obj, which results in a bunch of calls to indexd.get() which results in a bunch of indexd page cache lookups which means the requests to indexd never go out. So then you won't know if a file was deleted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this will still happen because the |
||
refresh = True | ||
else: | ||
keep.append(st) | ||
|
||
# Only fetch more if we saw there were some items that were deleted | ||
if refresh: | ||
# Move the cursor ahead to the last valid file | ||
next_after = keep[-1].created_at if len(keep) > 0 else after | ||
# Number of results needed to fulfill the original limit | ||
remain = limit - len(keep) | ||
|
||
prefetch_indexd(next_after) | ||
pager = Pagination(q, next_after, remain) | ||
|
||
# Replace original page's items with new list of valid files | ||
pager.items = keep | ||
pager.after = next_after if next_after else after | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm indexd_pagination is confusing to me now.. What I would think you want is:
I realize you'd have to refactor (maybe remove merge_indexd from constructor) and decouple things quite a bit (separate the indexd get request from the actual merging of an indexd doc with an indexd model instance). So maybe that's why you didn't do it this way.
But if you were able to implement the above, then you could probably also get rid of the entire while loop that checks for deleted files right? You would just be able to return the page right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but it would require decoupling the indexd property loading on construction of the object. Perhaps you could load the indexd properties only when you attempt to access them, but I'm not sure if that is possible.