Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the target convert to Avro? #85

Open
judahrand opened this issue Feb 21, 2022 · 5 comments
Open

Why does the target convert to Avro? #85

judahrand opened this issue Feb 21, 2022 · 5 comments

Comments

@judahrand
Copy link
Contributor

judahrand commented Feb 21, 2022

BigQuery is capable of batch ingesting newline delimited JSON files; the Singer spec is essentially exactly this.

Is there any reason not to extract the record field from each Singer message, dump the record to file, upload to GCS, and let BigQuery sort out ingesting the data in parallel?

This would simplify the code in this target significantly and possibly even be faster.

One downside would be the inability to directly support nested as the target does now. However, given that the FastSync implementation already uses CSVs which do not support nested or repeated data and the recent introduction of BigQuey's JSON data type this may be acceptable? However, until this JSON data type is supported by JSON load batch jobs CSVs may be a reasonable stop gap?

@jmriego
Copy link
Owner

jmriego commented Feb 21, 2022

that was because when I tried doing that 2 years ago I had lots of problems ingesting it and it was much slower than Avro. If I recall correctly, BigQuery complains if you have things like empty dictionaries so there were several changes to be made before ingesting it. Also only uncompressed JSON data is possible to load in parallel so it required more network usage.

Some of these things might have changed since then anyway so it might be worth checking. I totally agree with you that Avro makes the code more complex.

What I wouldn't use for now is CSV because that would break nested and possibly array types. Actually if you check the older commits you can see that was the first approach. As you mention, the FastSync has a similar issue but not everyone using this target is using FastSync anyway so I think it's best not to remove that capability and instead improve FastSync if we can.
Now that BigQuery has added the JSON data type some of these things might be easier to do

@judahrand
Copy link
Contributor Author

judahrand commented Feb 21, 2022

that was because when I tried doing that 2 years ago I had lots of problems ingesting it and it was much slower than Avro. If I recall correctly, BigQuery complains if you have things like empty dictionaries so there were several changes to be made before ingesting it.

Interesting. I'll have a bit of a play with this.

Also only uncompressed JSON data is possible to load in parallel so it required more network usage.

Although this is a bit annoying I'd expect in most cases people using this target would be able to run the target in the same GCP region as their BQ dataset? This would make bandwidth less of an issue I'd have thought? There's pretty high bandwidth available internally in GCP!

As you mention, the FastSync has a similar issue but not everyone using this target is using FastSync anyway so I think it's best not to remove that capability and instead improve FastSync if we can.

That's fair. Though it seems like it is odd that FULL_TABLE (which would use FastSync) and INCREMENTAL will have different behaviour. Also that LOG_BASED will behave differently in FastSync and Singer modes even now I suspect?

These inconsistencies may be enough for us to just maintain our own simpler fork which uses JSONL - depending on the results of some initial performance investigations.

@jmriego
Copy link
Owner

jmriego commented Feb 21, 2022

As far as I know LOG_BASED will first trigger a FastSync and on next runs it will trigger a normal run. The main gotcha with it is that if the LOG_BASED is pointing to a log that no longer exists (let's say you didn't run PipelineWise for a whole day and the log retention time of your database is just 12 hours) the tap would fail. You'd have to force a full rerun by, for example, deleting the state.json

If you already have some code for the JSONL I'm happy to do some testing there. If we can use that instead of Avro and there are no issues I'm all for it

@judahrand
Copy link
Contributor Author

The issue with LOG_BASED I'm getting at is that if you have a RECORD column in your database FastSync will convert it to a STRING but the Singer implementation can parse it as a RECORD.

@judahrand
Copy link
Contributor Author

judahrand commented Feb 21, 2022

@jmriego I had a go at implementing the current behaviour with JSONL. Seems to be fine. Keen to get your thoughts on it. I also implemented the dump to GCS that Wise uses for Snowflake (well, they dump to S3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants