-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pagination and base_url to JSON query_runner #6499
add pagination and base_url to JSON query_runner #6499
Conversation
@justinclift Here's the new pr. I'm not sure if this is the best way to configure it, especially if someone wants to support other patterns later, like page numbers or offsets without anything in response for what to pass next. I was wondering if there were other ideas there or if there's something like that on another data source. |
@guidopetri @eradman Any interest in reviewing this? 😄 |
Ahhh, looks like this has a formatting issue that needs fixing:
|
Could we also add a test to make it clearer what is going on, and what the expected behavior is? |
90bed4e
to
14b10ba
Compare
@guidopetri I restructured the configuration, which hopefully makes a bit more sense. For a config like this: pagination:
type: url
#path: _links.next.href As long as the API response has something in For the other pattern: pagination:
type: token
#fields: [next_page_token, page_token] It does the same kind of thing, where as long as I looked into adding a test, but it doesn't look like there are any tests for the json data source at all yet, and this would be pretty complicated to mock. I was just testing it manually by querying a Chargebee API and a Vimeo VHX API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much don't feel comfortable merging this in without a test. To be honest I'm not sold on the idea of implementing pagination here in the first place. As mentioned in the comments below, I'm not sure there's a standard for pagination, and if we were to implement every pagination ruleset out there we'd have a very large codebase 😅 but if we do want to merge this in I would very kindly request that we somehow add tests, even if we have to write some from scratch for the JSON query runner to start out with.
has_more = False | ||
|
||
result = _normalize_json(response, result_path) | ||
if result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this many nested if
statements is kind of a code smell. Maybe we should split this up into multiple functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but there are kind of a lot of variables to pass around to do so. I personally would just pass query
around instead of breaking it up on line 168 and then passing that around everywhere. I was just trying to change as little as possible. Should I do that switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, yes - definitely worth it if it makes the code easier to follow.
redash/query_runner/json_ds.py
Outdated
if result: | ||
results.extend(result) | ||
|
||
if pagination["type"] == "url": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never seen either of these two pagination patterns. I don't think there's a "standard" either, is there? I imagine every API has its own set of pagination rules that it follows. If that's the case, I'm not sure it makes sense to write in pagination rules, and rather we should emphasize to the end users that querying an API for JSONs directly like this is meant for testing and not large-scale analysis.
Especially since if too much data is returned, it's possible that the worker goes OOM. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some variant of these 2 seem to be the most common, because they don't require specifics of implementation, like having to add to the page number or offset yourself. (The type
could be named something different if you have a better idea though.) That's why Google outlines the next_page_token/page_token as a best practice. It's just that sometimes the names are different, like Chargebee uses an offset parameter, but rather than having to calculate it, it returns the next offset in the response. Google Analytics, AWS, and K8s also follow that pattern.
Shopify, SalesForce, Facebook's Graph API, and Trino all pass a full URL for the next page in some field of the response, which would be covered by the url
type. HATEOAS is a standard and does define that the next page would be a link somewhere in the response. Although, at some point, it might be good to use a library instead of _apply_path_search
to handle more complex cases.
Other types could be added later for pageNumber
or something, but I don't really have a need for them right now, and that is where it could be pretty difficult to cover different APIs.
I was wondering about the memory thing, but it doesn't look like there's any other way that large result sets are handled elsewhere. The Trino data source does a fetchall
and then processes all of the results at once. This at least does the _normalize_json
per page, which could trim things down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points; thanks for elaborating.
Codecov Report
@@ Coverage Diff @@
## master #6499 +/- ##
==========================================
+ Coverage 61.53% 61.88% +0.35%
==========================================
Files 158 158
Lines 12898 12965 +67
Branches 1756 1770 +14
==========================================
+ Hits 7937 8024 +87
+ Misses 4704 4666 -38
- Partials 257 275 +18
|
I can work on adding some tests. |
14b10ba
to
5cfb03d
Compare
5cfb03d
to
2fc5633
Compare
Odd, I didn't see any linting errors, but yeah, black still changed some things. |
return data, error | ||
|
||
|
||
class TestJSON(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I should have specified. We've been aiming for pytest tests instead of unittest; if it's not too much trouble, would you mind converting to those? I won't hold it against you if you're not feeling up to it though. I appreciate the work you've put into this PR and already recognize you've gone above and beyond :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just used the other tests as an example. It could always be refactored later. It seems like some of that should be added to the contributing guide, like the testing, ruff, and rebasing process. Like are you looking for a single commit per pr, or just going for a linear history?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point. I don't think we have a good handle on what our contribution guidelines are; I'll try and add some documentation for that to the wiki soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but we'll need the PR to be updated with the latest master
to merge. I'll enable automerge whenever you push a merge commit updating the branch, so it should merge automatically once the tests pass.
Thanks again for all your work @Ken-Michalak !
2fc5633
to
6902fd4
Compare
@Ken-Michalak We have an option turned on for this repo, which requires PR's to be up-to-date with the master branch before they can be merged. Normally PRs have a button to "Update to the master branch" (or similar wording), that maintainers can click to automate that. It seems to be missing from this PR though. My guess is the PR was created without the "Allow maintainers to update this PR?" setting enabled. So... I think that means you'll need to update this PR yourself. That might let it all just get nicely merged automatically after the tests pass. 😄 |
I bypassed it - we didn't touch the poetry files so it should be ok. Thanks for your contribution @Ken-Michalak ! |
@justinclift, I actually did that in that last force-push yesterday, but it looks like before @guidopetri approved it again, there were a few other commits added. (Bump @babel/traverse from 7.22.8 to 7.23.2) 🤷🏻 Thanks though. |
No worries, it all worked out. Thanks heaps for getting this done @Ken-Michalak, it seems like this will be useful for people. 😄 |
What type of PR is this?
Description
Add the ability to paginate the JSON Query Runner and configure a base URL.
Support Google's pagination design pattern, with
pagination: [next_page_token, page_token]
.Support a next URL, which is sometimes used in HATEOAS, with
pagination: _links.next.href
.How is this tested?
Related Tickets & Documents
Original PR: #5067
Mobile & Desktop Screenshots/Recordings (if there are UI changes)