-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does the target convert to Avro? #85
Comments
that was because when I tried doing that 2 years ago I had lots of problems ingesting it and it was much slower than Avro. If I recall correctly, BigQuery complains if you have things like empty dictionaries so there were several changes to be made before ingesting it. Also only uncompressed JSON data is possible to load in parallel so it required more network usage. Some of these things might have changed since then anyway so it might be worth checking. I totally agree with you that Avro makes the code more complex. What I wouldn't use for now is CSV because that would break nested and possibly array types. Actually if you check the older commits you can see that was the first approach. As you mention, the FastSync has a similar issue but not everyone using this target is using FastSync anyway so I think it's best not to remove that capability and instead improve FastSync if we can. |
Interesting. I'll have a bit of a play with this.
Although this is a bit annoying I'd expect in most cases people using this target would be able to run the target in the same GCP region as their BQ dataset? This would make bandwidth less of an issue I'd have thought? There's pretty high bandwidth available internally in GCP!
That's fair. Though it seems like it is odd that These inconsistencies may be enough for us to just maintain our own simpler fork which uses JSONL - depending on the results of some initial performance investigations. |
As far as I know If you already have some code for the JSONL I'm happy to do some testing there. If we can use that instead of Avro and there are no issues I'm all for it |
The issue with |
@jmriego I had a go at implementing the current behaviour with JSONL. Seems to be fine. Keen to get your thoughts on it. I also implemented the dump to GCS that Wise uses for Snowflake (well, they dump to S3). |
BigQuery is capable of batch ingesting newline delimited JSON files; the Singer spec is essentially exactly this.
Is there any reason not to extract the
record
field from each Singer message, dump the record to file, upload to GCS, and let BigQuery sort out ingesting the data in parallel?This would simplify the code in this target significantly and possibly even be faster.
One downside would be the inability to directly support nested as the target does now. However, given that the FastSync implementation already uses CSVs which do not support nested or repeated data and the recent introduction of BigQuey's
JSON
data type this may be acceptable? However, until thisJSON
data type is supported by JSON load batch jobs CSVs may be a reasonable stop gap?The text was updated successfully, but these errors were encountered: