Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connector/Source: Publish the *request* object in the record transformation like the config, stream_slice, stream_interval, etc. are #50395

Open
rpopov opened this issue Dec 22, 2024 · 2 comments

Comments

@rpopov
Copy link

rpopov commented Dec 22, 2024

Discussed in #49971

Originally posted by rpopov December 20, 2024
Status in Airbyte 1.2.0
The record transformation allows removing existing fields from the record and adding new fields by calculating them using:

  • JINJA templates
  • pre-defined objects found in the transformation's context:
    • record
    • stream_state
    • stream_slice
    • stream_interval
    • stream_partition
  • the access to the response raw data is redirected through the record object, which by definition is only part of the response. Converting the response into valuable records sometimes it needs data beyond the scope of the record object, which is not accessible.

Example
The JIRA /issue response:

{
  "expand": "renderedFields,names,schema,operations,editmeta,changelog,versionedRepresentations",
  "id": "2759961",
  "self": "https://jira-test.paysafe.cloud/rest/api/2/issue/2759961",
  "key": "EF-3131",
  "fields": {
    "customfield_19440": {
      "self": "https://jira-test.paysafe.cloud/rest/api/2/customFieldOption/23882",
      "value": "Perú",
      "id": "23882",
      "disabled": false
    },
...
    "customfield_12073": "EF-3130"
  },
  "names": {
    "customfield_19442": "End Time",
...
    "customfield_17709": "BU Legal Lead"
  },
  "schema": {
    "customfield_19442": {
      "type": "date",
      "custom": "com.atlassian.jira.plugin.system.customfieldtypes:datepicker",
      "customId": 19442
    },
...
    "customfield_17709": {
      "type": "string",
      "custom": "com.atlassian.jira.plugin.system.customfieldtypes:textfield",
      "customId": 17709
    }
  },
}
]

In JIRA the list of custom fields is highly dynamic, and it makes no practical sense to have a single record per issue, combining in it all standard and custom fields.
Instead, the JIRA issue can be represented using 2 tables: ISSUE 1 --- * CUSTOM_FIELD in one-to-many / master-detail relation.
Turning the response into a list of records, one per custom field, could be done by iterating over the schema.* sub-list in the response, but taking their values from the fields map and taking their human-readble names from the name map.

Problem
In this configuration, the fields and names maps are not accessible. They could be, if the response object were available in the Tranformations' context. The response object exists in the context of the Pagination section.

Suggestion

  • Publish the response object in the context of the Transformations section/phase, as available in the Pagination section.
rpopov added a commit to rpopov/airbyte-python-cdk that referenced this issue Jan 13, 2025
At record extraction step, in each record add the service field $root holding a reference to:
* the root response object, when parsing JSON format
* the original record, when parsing JSONL format
that each record to process is extracted from.
More service fields could be added in future.
The service fields are available in the record's filtering and transform steps.

Avoid:
* reusing the maps/dictionaries produced, thus avoid building cyclic structures
* transforming the service fields in the Flatten transformation.

Explicitly cleanup the service field(s) after the transform step, thus making them:
* local for the filter and transform steps
* not visible to the next mapping and store steps (as they should be)
* not visible in the tests beyond the test_record_selector (as they should be)
This allows the record transformation logic to define its "local variables" to reuse
some interim calculations.

The contract of body parsing seems irregular in representing the cases of bad JSON, no JSON and empty JSON.
Cannot be unified as that that irregularity is already used.

Update the development environment setup documentation
* to organize and present the setup steps explicitly
* to avoid misunderstandings and wasted efforts.

Update CONTRIBUTING.md to
* collect and organize the knowledge on running the test locally.
* state the actual testing steps.
* clarify and make explicit the procedures and steps.

The unit, integration, and acceptance tests in this exactly version succeed under Fedora 41, while
one of them fails under Oracle Linux 8.7. not related to the contents of this PR.
The integration tests of the CDK fail due to missing `secrets/config.json` file for the Shopify source.
See airbytehq#197
@rpopov
Copy link
Author

rpopov commented Jan 13, 2025

Replaced with airbytehq/airbyte-python-cdk#214

@rpopov rpopov closed this as completed Jan 13, 2025
@rpopov
Copy link
Author

rpopov commented Jan 17, 2025

The JIRA API can surprise you with another case that is impossible to store in (second-level) tables:
API call: GET https:///rest/api/2/issue/PPP-49571?fields=key&expand=changelog
Real-life response:

 "expand": "renderedFields,names,schema,operations,editmeta,changelog,versionedRepresentations",
  "id": "1589758",
  "self": "https://<JIRA host>/rest/api/2/issue/1589758",
  "key": "PPP-49571",
  "changelog": {
    "startAt": 0,
    "maxResults": 1,
    "total": 1,
    "histories": [
      {
        "id": "12613893",
        "author": {
          "self": "https://jira.paysafe.cloud/rest/api/2/user?username=patrickpoell",
          "name": "patrickpoell",
          "key": "patpoe",
          "emailAddress": "[email protected]",
          "avatarUrls": {
            "48x48": "https://<JIRA host>/secure/useravatar?ownerId=patpoe&avatarId=18562",
            "24x24": "https://<JIRA host>/secure/useravatar?size=small&ownerId=patpoe&avatarId=18562",
            "16x16": "https://<JIRA host>/secure/useravatar?size=xsmall&ownerId=patpoe&avatarId=18562",
            "32x32": "https://<JIRA host>/secure/useravatar?size=medium&ownerId=patpoe&avatarId=18562"
          },
          "displayName": "Patrick Poell",
          "active": true,
          "timeZone": "Europe/London"
        },
        "created": "2021-03-11T15:27:33.000+0000",
        "items": [
          {
            "field": "summary",
            "fieldtype": "jira",
            "from": null,
            "fromString": "CLONE - investigate",
            "to": null,
            "toString": "remove from mypins-integration-test-components"
          },
          {
            "field": "labels",
            "fieldtype": "jira",
            "from": null,
            "fromString": "Appium",
            "to": null,
            "toString": null
          }]}]}}

The attempt to store the elements of the change log (path:histories[].items[]) is not possible even with the suggested change above, as each such element requires the timestamp from the parent histories[*] and the issue key from the root JSON object.

Suggestion

  • In each nested JSON object add a $parent reference to the owner JSON object.

Pros and cons

  • (pro) This change would allow navigating from any object no matter how deeply (but statically) nested in the response to access and inherit data from the parent objects thus storing a single response in several separate tables.
  • (con) Automated traversal of such enhanced JSON objects (like deep copy, print, etc.) would go in infinite recursion.

@rpopov rpopov reopened this Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants