Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Target assumes first record has same headers as the rest #3

Open
anthonyp opened this issue Apr 9, 2017 · 2 comments
Open

CSV Target assumes first record has same headers as the rest #3

anthonyp opened this issue Apr 9, 2017 · 2 comments

Comments

@anthonyp
Copy link

anthonyp commented Apr 9, 2017

This target uses the flattened keys from the first record as the headers for the entire spreadsheet. However, in some cases, a tap will produce records with varying keys (for example, this happens with many streams in the HubSpot tap). When this occurs, the data rows in the CSV will mismatch the headers.

@timvisher
Copy link

I see two separate solutions here.

  1. If a tap emits schemas for it's records, this target should use the
    schema to generate the header for the file and all records should go
    to that file. If the records are truly non-rectangular (meaning that
    schema is a superset of various kinds of data returned from the
    API, then missing columns for a given record should be marked with
    _SINGER_MISSING_COLUMN in the resulting CSV to disambiguate between
    real NULL values in a record and records that were of a different
    shape than other records in the same stream.

  2. If a tap does not emit schemas, then this target should cut a new
    file with new headers matching the given record each time it
    encounters records of a different shape than what it had seen already
    for that stream. As @micaelbergeron suggested, it would probably be
    good for the target to track header configurations as it proceeds so
    that it can append to an existing file if the API is returning
    records of differing shapes in an interleaved fashion.

    Micaël Bergeron [3:06 PM] I would probably keep a schema -> file
    reference somewhere so you can append back
    if it comes back the the old schema

    md5(schema) -> file, bingo

Note: No solution that requires buffering the entire resultset in
memory should be considered acceptable. We still need to stream. :)

PRs are very welcome for either or both of those solutions.

@abij
Copy link

abij commented Mar 16, 2020

Running into same issue. The main issue in my case, are nested fields which could have content, resulting in new keys
"logic_path": { "1": "7624040", "2": "7624106" }

resulting in :
logic_path__1, logic_path__2

The first record do not have those fields.

I made a nice work around, calling it option 3: Add missing fields to the end of the header.

This solves my problem of shifted fields while reading the CSV. I was diving into the solution for hashing, but the SCHEMA is send only once,

No need for schema housekeeping and hashing.
Downside, need to check current header is a superset of current fields, when new fields are found, the whole file (header-actually) is rewritten.

I'll make a PR for my solution so you can check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants