CSV Target assumes first record has same headers as the rest #3

anthonyp · 2017-04-09T22:46:16Z

This target uses the flattened keys from the first record as the headers for the entire spreadsheet. However, in some cases, a tap will produce records with varying keys (for example, this happens with many streams in the HubSpot tap). When this occurs, the data rows in the CSV will mismatch the headers.

timvisher · 2018-10-26T19:58:58Z

I see two separate solutions here.

If a tap emits schemas for it's records, this target should use the
schema to generate the header for the file and all records should go
to that file. If the records are truly non-rectangular (meaning that
schema is a superset of various kinds of data returned from the
API, then missing columns for a given record should be marked with
_SINGER_MISSING_COLUMN in the resulting CSV to disambiguate between
real NULL values in a record and records that were of a different
shape than other records in the same stream.
If a tap does not emit schemas, then this target should cut a new
file with new headers matching the given record each time it
encounters records of a different shape than what it had seen already
for that stream. As @micaelbergeron suggested, it would probably be
good for the target to track header configurations as it proceeds so
that it can append to an existing file if the API is returning
records of differing shapes in an interleaved fashion.

Micaël Bergeron [3:06 PM] I would probably keep a schema -> file
reference somewhere so you can append back
if it comes back the the old schema

md5(schema) -> file, bingo

Note: No solution that requires buffering the entire resultset in
memory should be considered acceptable. We still need to stream. :)

PRs are very welcome for either or both of those solutions.

abij · 2020-03-16T13:17:16Z

Running into same issue. The main issue in my case, are nested fields which could have content, resulting in new keys
"logic_path": { "1": "7624040", "2": "7624106" }

resulting in :
logic_path__1, logic_path__2

The first record do not have those fields.

I made a nice work around, calling it option 3: Add missing fields to the end of the header.

This solves my problem of shifted fields while reading the CSV. I was diving into the solution for hashing, but the SCHEMA is send only once,

No need for schema housekeeping and hashing.
Downside, need to check current header is a superset of current fields, when new fields are found, the whole file (header-actually) is rewritten.

I'll make a PR for my solution so you can check it out.

timvisher added bug help wanted labels Oct 26, 2018

This was referenced Mar 16, 2020

Handle records with different flattened fields #26

Closed

Rewrite files with correct header at end of stream #27

Closed

cgardens mentioned this issue Oct 2, 2020

Respect the schema passed in the stream. #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV Target assumes first record has same headers as the rest #3

CSV Target assumes first record has same headers as the rest #3

anthonyp commented Apr 9, 2017

timvisher commented Oct 26, 2018

abij commented Mar 16, 2020

CSV Target assumes first record has same headers as the rest #3

CSV Target assumes first record has same headers as the rest #3

Comments

anthonyp commented Apr 9, 2017

timvisher commented Oct 26, 2018

abij commented Mar 16, 2020