Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pagination and base_url to JSON query_runner #6499

Conversation

Ken-Michalak
Copy link
Contributor

@Ken-Michalak Ken-Michalak commented Oct 4, 2023

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • New Query Runner (Data Source)
  • New Alert Destination
  • Other

Description

Add the ability to paginate the JSON Query Runner and configure a base URL.
Support Google's pagination design pattern, with pagination: [next_page_token, page_token].
Support a next URL, which is sometimes used in HATEOAS, with pagination: _links.next.href.

How is this tested?

  • Unit tests (pytest, jest)
  • E2E Tests (Cypress)
  • Manually
  • N/A

Related Tickets & Documents

Original PR: #5067

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

@Ken-Michalak
Copy link
Contributor Author

@justinclift Here's the new pr.

I'm not sure if this is the best way to configure it, especially if someone wants to support other patterns later, like page numbers or offsets without anything in response for what to pass next. I was wondering if there were other ideas there or if there's something like that on another data source.

@justinclift justinclift added python Pull requests that update Python code Team Review Meets PR criteria, ready for full review labels Oct 5, 2023
@justinclift
Copy link
Member

@guidopetri @eradman Any interest in reviewing this? 😄

@justinclift
Copy link
Member

Ahhh, looks like this has a formatting issue that needs fixing:

Run ruff check .
redash/query_runner/json_ds.py:1:1: I001 [*] Import block is un-sorted or un-formatted

@guidopetri
Copy link
Contributor

Could we also add a test to make it clearer what is going on, and what the expected behavior is?

@Ken-Michalak Ken-Michalak force-pushed the add-json-query-runner-pagination branch from 90bed4e to 14b10ba Compare October 9, 2023 18:55
@Ken-Michalak
Copy link
Contributor Author

@guidopetri I restructured the configuration, which hopefully makes a bit more sense.

For a config like this:

pagination:
  type: url
  #path: _links.next.href

As long as the API response has something in _links.next.href, it will follow it, do another request, and include those results.

For the other pattern:

pagination:
    type: token
    #fields: [next_page_token, page_token]

It does the same kind of thing, where as long as next_page_token isn't "" or null, it continues getting the next page of results by putting that token in the page_token parameter.

I looked into adding a test, but it doesn't look like there are any tests for the json data source at all yet, and this would be pretty complicated to mock. I was just testing it manually by querying a Chargebee API and a Vimeo VHX API.

Copy link
Contributor

@guidopetri guidopetri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much don't feel comfortable merging this in without a test. To be honest I'm not sold on the idea of implementing pagination here in the first place. As mentioned in the comments below, I'm not sure there's a standard for pagination, and if we were to implement every pagination ruleset out there we'd have a very large codebase 😅 but if we do want to merge this in I would very kindly request that we somehow add tests, even if we have to write some from scratch for the JSON query runner to start out with.

has_more = False

result = _normalize_json(response, result_path)
if result:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this many nested if statements is kind of a code smell. Maybe we should split this up into multiple functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but there are kind of a lot of variables to pass around to do so. I personally would just pass query around instead of breaking it up on line 168 and then passing that around everywhere. I was just trying to change as little as possible. Should I do that switch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, yes - definitely worth it if it makes the code easier to follow.

if result:
results.extend(result)

if pagination["type"] == "url":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen either of these two pagination patterns. I don't think there's a "standard" either, is there? I imagine every API has its own set of pagination rules that it follows. If that's the case, I'm not sure it makes sense to write in pagination rules, and rather we should emphasize to the end users that querying an API for JSONs directly like this is meant for testing and not large-scale analysis.

Especially since if too much data is returned, it's possible that the worker goes OOM. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some variant of these 2 seem to be the most common, because they don't require specifics of implementation, like having to add to the page number or offset yourself. (The type could be named something different if you have a better idea though.) That's why Google outlines the next_page_token/page_token as a best practice. It's just that sometimes the names are different, like Chargebee uses an offset parameter, but rather than having to calculate it, it returns the next offset in the response. Google Analytics, AWS, and K8s also follow that pattern.

Shopify, SalesForce, Facebook's Graph API, and Trino all pass a full URL for the next page in some field of the response, which would be covered by the url type. HATEOAS is a standard and does define that the next page would be a link somewhere in the response. Although, at some point, it might be good to use a library instead of _apply_path_search to handle more complex cases.

Other types could be added later for pageNumber or something, but I don't really have a need for them right now, and that is where it could be pretty difficult to cover different APIs.

I was wondering about the memory thing, but it doesn't look like there's any other way that large result sets are handled elsewhere. The Trino data source does a fetchall and then processes all of the results at once. This at least does the _normalize_json per page, which could trim things down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points; thanks for elaborating.

redash/query_runner/json_ds.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

Merging #6499 (6902fd4) into master (69d1e03) will increase coverage by 0.35%.
Report is 3 commits behind head on master.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #6499      +/-   ##
==========================================
+ Coverage   61.53%   61.88%   +0.35%     
==========================================
  Files         158      158              
  Lines       12898    12965      +67     
  Branches     1756     1770      +14     
==========================================
+ Hits         7937     8024      +87     
+ Misses       4704     4666      -38     
- Partials      257      275      +18     
Files Coverage Δ
redash/query_runner/json_ds.py 64.20% <66.66%> (+40.35%) ⬆️

@Ken-Michalak
Copy link
Contributor Author

I very much don't feel comfortable merging this in without a test. To be honest I'm not sold on the idea of implementing pagination here in the first place. As mentioned in the comments below, I'm not sure there's a standard for pagination, and if we were to implement every pagination ruleset out there we'd have a very large codebase 😅 but if we do want to merge this in I would very kindly request that we somehow add tests, even if we have to write some from scratch for the JSON query runner to start out with.

I can work on adding some tests.

@Ken-Michalak Ken-Michalak force-pushed the add-json-query-runner-pagination branch from 14b10ba to 5cfb03d Compare October 10, 2023 20:44
@Ken-Michalak Ken-Michalak force-pushed the add-json-query-runner-pagination branch from 5cfb03d to 2fc5633 Compare October 13, 2023 16:25
@Ken-Michalak
Copy link
Contributor Author

Odd, I didn't see any linting errors, but yeah, black still changed some things.

return data, error


class TestJSON(TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I should have specified. We've been aiming for pytest tests instead of unittest; if it's not too much trouble, would you mind converting to those? I won't hold it against you if you're not feeling up to it though. I appreciate the work you've put into this PR and already recognize you've gone above and beyond :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used the other tests as an example. It could always be refactored later. It seems like some of that should be added to the contributing guide, like the testing, ruff, and rebasing process. Like are you looking for a single commit per pr, or just going for a linear history?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. I don't think we have a good handle on what our contribution guidelines are; I'll try and add some documentation for that to the wiki soon.

Copy link
Contributor

@guidopetri guidopetri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but we'll need the PR to be updated with the latest master to merge. I'll enable automerge whenever you push a merge commit updating the branch, so it should merge automatically once the tests pass.

Thanks again for all your work @Ken-Michalak !

@Ken-Michalak Ken-Michalak force-pushed the add-json-query-runner-pagination branch from 2fc5633 to 6902fd4 Compare October 17, 2023 14:35
@guidopetri guidopetri enabled auto-merge (squash) October 18, 2023 02:55
@justinclift
Copy link
Member

justinclift commented Oct 18, 2023

@Ken-Michalak We have an option turned on for this repo, which requires PR's to be up-to-date with the master branch before they can be merged.

Normally PRs have a button to "Update to the master branch" (or similar wording), that maintainers can click to automate that.

It seems to be missing from this PR though. My guess is the PR was created without the "Allow maintainers to update this PR?" setting enabled.

So... I think that means you'll need to update this PR yourself. That might let it all just get nicely merged automatically after the tests pass. 😄

@guidopetri guidopetri disabled auto-merge October 18, 2023 13:08
@guidopetri guidopetri merged commit 7b03e60 into getredash:master Oct 18, 2023
13 checks passed
@guidopetri
Copy link
Contributor

I bypassed it - we didn't touch the poetry files so it should be ok.

Thanks for your contribution @Ken-Michalak !

@Ken-Michalak
Copy link
Contributor Author

@Ken-Michalak We have an option turned on for this repo, which requires PR's to be uo-to-date with the master branch before they can be merged.

Normally PRs have a button to "Update to the master branch" (or similar wording), that maintainers can click to automate that.

It seems to be missing from this PR though. My guess is the PR was created without the "Allow maintainers to update this PR?" setting enabled.

So... I think that means you'll need to update this PR yourself. That might let it all just get nicely merged automatically after the tests pass. 😄

@justinclift, I actually did that in that last force-push yesterday, but it looks like before @guidopetri approved it again, there were a few other commits added. (Bump @babel/traverse from 7.22.8 to 7.23.2) 🤷🏻 Thanks though.

@Ken-Michalak Ken-Michalak deleted the add-json-query-runner-pagination branch October 18, 2023 14:59
@justinclift
Copy link
Member

No worries, it all worked out. Thanks heaps for getting this done @Ken-Michalak, it seems like this will be useful for people. 😄

harveyrendell pushed a commit to pushpay/redash that referenced this pull request Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python code Team Review Meets PR criteria, ready for full review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants