Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify an optional page index to help clients get data more efficiently #109

Open
jfuerth opened this issue Mar 5, 2021 · 1 comment
Open
Labels
feature New feature or request

Comments

@jfuerth
Copy link
Contributor

jfuerth commented Mar 5, 2021

This issue was raised during discussion on #98.

The goals of this optional "page index" would be:

  1. tell consumers what partitions exist in the data, so they can skip partitions they don't care about
  2. tell consumers which attributes the data is already sorted by
  3. allow consumers to jump directly to the pages of interest
  4. allow consumers to pull data in parallel (the strict singly-linked pagination structure in Search 1.0 is not amenable to consuming pages in parallel)
  5. Fully optional for data producers/publishers (a page index is optional, not required)
  6. Fully optional for data consumers (even if a page index is present, the consumer can still encounter all the data in order by following the singly-linked next_page_url sequence)

Background

We've experimented with something very basic along these lines with the index attribute here. This index (a nonstandard experimental extension to Search) allows consumers to jump directly to a catalog of interest without having to page through all preceding catalogs in the sequence.

Imagine something like this on a Table object, which advertises information about partitions and sortedness already built into the data:

  "index": {
    "ordered_by": [ "id" ],
    "pages": [
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_1",
        "partitions": {
          "id": {
            "min": "PMID:27435956-Naz_Villalba-2016-NLRP3-proband",
            "max": "PMID:29174093-Szczałuba-2018-GNB1-proband"
          }
        }
      },
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_2",
        "partitions": {
          "id": {
            "min": "PMID:26833330-Jansen-2016-TMEM199-F1-II2",
            "max": "PMID:27974811-Haliloglu-2017-PIEZO2-Patient"
          }
        }
      }
    ]
  }
}

This would indicate the data is partitioned by id and that the data consumer could skip partitions it does not need, and consume those that it does need in parallel.

This could be extended to partitions across multiple attributes, nested attributes, and even multiple available sort orders. We will have to balance complexity against performance benefits. Thanks to @ifokkema for initial feedback!

The above is just a rough illustration of the idea; don't take the exact format too seriously.

@ifokkema
Copy link

This looks great! It'll allow me to indicate how to skip to other chromosomes so that clients don't need to paginate through the entire data. Three suggestions and one question:

  • Indicate that data producers may provide additional pages not represented in the index (e.g. my index will contain various chromosomes, but all chromosomes can still be paginated further)
  • Related: make partitions optional or at least the min/max so that producers don't need to pre-sort their data to figure out all pages before. To still indicate what the URL points to, you could allow for something like (using your example data):
"index": [
  {
    "ordered_by": [ "id" ],
    "pages": [
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_1",
        "id": "PMID:26833330-Jansen-2016-TMEM199-F1-II2"
      },
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_2",
        "id": "PMID:27435956-Naz_Villalba-2016-NLRP3-proband"
      },
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_3",
        "id": "PMID:27974811-Haliloglu-2017-PIEZO2-Patient"
      }
    ]
  }
]

For me, id would be chromosome, which would probably make more sense as an example.

  • Make index an array of objects rather than an object, allowing for multiple sorting schemes. Clients not interested in the sorting may then just pick the first index, while those interested in sorting can choose from the options provided. Not needed for me, but this makes the format more flexible for others.

And a question: your example (likely just because of the mockup) provides min/max values that don't seem to be sorted; they should be, right?

@mcupak mcupak modified the milestones: 1.0.0, 1.1.0 Apr 14, 2021
@mcupak mcupak added the feature New feature or request label Apr 28, 2021
@mcupak mcupak modified the milestone: 1.1.0 May 26, 2021
@mcupak mcupak removed this from the 1.1.0 milestone Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants