Why does the target convert to Avro? #85

judahrand · 2022-02-21T10:03:06Z

BigQuery is capable of batch ingesting newline delimited JSON files; the Singer spec is essentially exactly this.

Is there any reason not to extract the record field from each Singer message, dump the record to file, upload to GCS, and let BigQuery sort out ingesting the data in parallel?

This would simplify the code in this target significantly and possibly even be faster.

One downside would be the inability to directly support nested as the target does now. However, given that the FastSync implementation already uses CSVs which do not support nested or repeated data and the recent introduction of BigQuey's JSON data type this may be acceptable? However, until this JSON data type is supported by JSON load batch jobs CSVs may be a reasonable stop gap?

The text was updated successfully, but these errors were encountered:

jmriego · 2022-02-21T11:10:21Z

that was because when I tried doing that 2 years ago I had lots of problems ingesting it and it was much slower than Avro. If I recall correctly, BigQuery complains if you have things like empty dictionaries so there were several changes to be made before ingesting it. Also only uncompressed JSON data is possible to load in parallel so it required more network usage.

Some of these things might have changed since then anyway so it might be worth checking. I totally agree with you that Avro makes the code more complex.

What I wouldn't use for now is CSV because that would break nested and possibly array types. Actually if you check the older commits you can see that was the first approach. As you mention, the FastSync has a similar issue but not everyone using this target is using FastSync anyway so I think it's best not to remove that capability and instead improve FastSync if we can.
Now that BigQuery has added the JSON data type some of these things might be easier to do

judahrand · 2022-02-21T11:16:39Z

that was because when I tried doing that 2 years ago I had lots of problems ingesting it and it was much slower than Avro. If I recall correctly, BigQuery complains if you have things like empty dictionaries so there were several changes to be made before ingesting it.

Interesting. I'll have a bit of a play with this.

Also only uncompressed JSON data is possible to load in parallel so it required more network usage.

Although this is a bit annoying I'd expect in most cases people using this target would be able to run the target in the same GCP region as their BQ dataset? This would make bandwidth less of an issue I'd have thought? There's pretty high bandwidth available internally in GCP!

As you mention, the FastSync has a similar issue but not everyone using this target is using FastSync anyway so I think it's best not to remove that capability and instead improve FastSync if we can.

That's fair. Though it seems like it is odd that FULL_TABLE (which would use FastSync) and INCREMENTAL will have different behaviour. Also that LOG_BASED will behave differently in FastSync and Singer modes even now I suspect?

These inconsistencies may be enough for us to just maintain our own simpler fork which uses JSONL - depending on the results of some initial performance investigations.

jmriego · 2022-02-21T12:46:33Z

As far as I know LOG_BASED will first trigger a FastSync and on next runs it will trigger a normal run. The main gotcha with it is that if the LOG_BASED is pointing to a log that no longer exists (let's say you didn't run PipelineWise for a whole day and the log retention time of your database is just 12 hours) the tap would fail. You'd have to force a full rerun by, for example, deleting the state.json

If you already have some code for the JSONL I'm happy to do some testing there. If we can use that instead of Avro and there are no issues I'm all for it

judahrand · 2022-02-21T12:52:43Z

The issue with LOG_BASED I'm getting at is that if you have a RECORD column in your database FastSync will convert it to a STRING but the Singer implementation can parse it as a RECORD.

judahrand · 2022-02-21T18:54:54Z

@jmriego I had a go at implementing the current behaviour with JSONL. Seems to be fine. Keen to get your thoughts on it. I also implemented the dump to GCS that Wise uses for Snowflake (well, they dump to S3).

judahrand mentioned this issue Feb 21, 2022

Use JSONL to upload to BigQuery #86

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the target convert to Avro? #85

Why does the target convert to Avro? #85

judahrand commented Feb 21, 2022 •

edited

Loading

jmriego commented Feb 21, 2022 •

edited

Loading

judahrand commented Feb 21, 2022 •

edited

Loading

jmriego commented Feb 21, 2022

judahrand commented Feb 21, 2022

judahrand commented Feb 21, 2022 •

edited

Loading

Why does the target convert to Avro? #85

Why does the target convert to Avro? #85

Comments

judahrand commented Feb 21, 2022 • edited Loading

jmriego commented Feb 21, 2022 • edited Loading

judahrand commented Feb 21, 2022 • edited Loading

jmriego commented Feb 21, 2022

judahrand commented Feb 21, 2022

judahrand commented Feb 21, 2022 • edited Loading

judahrand commented Feb 21, 2022 •

edited

Loading

jmriego commented Feb 21, 2022 •

edited

Loading

judahrand commented Feb 21, 2022 •

edited

Loading

judahrand commented Feb 21, 2022 •

edited

Loading