-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce schema evolution via the -S flag #164
base: main
Are you sure you want to change the base?
Commits on Jan 9, 2024
-
Demonstrate a bug for schema "validation" when writing JSON to a table
I stumbled into this while pilfering code from kafka-delta-ingest for another project and discovered that the code in `write_values` which does `record_batch.schema() != arrow_schema` doesn't do what we think it does. Basically if `Decoder` "works" the schema it's going to return is just the schema passed into it. It has no bearing on whether the JSON has the same schema. Don't ask me why. Using the reader's `infer_json_schema_*` functions can provide a Schema that is useful for comparison: let mut value_iter = json_buffer.iter().map(|j| Ok(j.to_owned())); let json_schema = infer_json_schema_from_iterator(value_iter.clone()).expect("Failed to infer!"); let decoder = Decoder::new(Arc::new(json_schema), options); if let Some(batch) = decoder.next_batch(&mut value_iter).expect("Failed to create RecordBatch") { assert_eq!(batch.schema(), arrow_schema_ref, "Schemas don't match!"); } What's even more interesting, is that after a certain number of fields are removed, the Decoder no longer pretends it can Decode the JSON. I am baffled as to why.
Configuration menu - View commit details
-
Copy full SHA for a860dcb - Browse repository at this point
Copy the full SHA a860dcbView commit details -
Begin introducing schema conformance testing ahead of more substantia…
…l refactor The intention here is to enable more consistent schema handling within the writers Sponsored-by: Raft LLC
Configuration menu - View commit details
-
Copy full SHA for 93e3d73 - Browse repository at this point
Copy the full SHA 93e3d73View commit details -
Introduce DeserializedMessage for carrying schema information into th…
…e writers The DeserializedMessage carries optional inferred schema information along with the message itself. This is useful for understanding whether schema evolution hould happen "later" in the message processing pipeline. The downside of this behavior is that there will be performance impact as arrow_json does schema inference. Sponsored-by: Raft LLC
Configuration menu - View commit details
-
Copy full SHA for f49b4f7 - Browse repository at this point
Copy the full SHA f49b4f7View commit details -
Put avro dependenciies behind the
avro
feature flagTurning avro off drops about 50 crates from the default build, so useful for development, but the code would need to be cleaned up to remove this from the default features list See #163
Configuration menu - View commit details
-
Copy full SHA for f8eb561 - Browse repository at this point
Copy the full SHA f8eb561View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4ed2c8a - Browse repository at this point
Copy the full SHA 4ed2c8aView commit details -
Remove avro from the default features and feature gate its cpde
This change is a little wrapped up in the introduction of DeserializedMessage but the trade-off for development targeting S3 is that I am linking 382 crates every cycle as opposed to 451. Fixes #163
Configuration menu - View commit details
-
Copy full SHA for 72aff33 - Browse repository at this point
Copy the full SHA 72aff33View commit details -
Relocate DataWriter in the writer.rs file for easier readability
I don't know why the impl was way down there 😄
Configuration menu - View commit details
-
Copy full SHA for 071a6a2 - Browse repository at this point
Copy the full SHA 071a6a2View commit details -
Configuration menu - View commit details
-
Copy full SHA for baf9696 - Browse repository at this point
Copy the full SHA baf9696View commit details -
Configuration menu - View commit details
-
Copy full SHA for 72c2f0b - Browse repository at this point
Copy the full SHA 72c2f0bView commit details -
Introduce schema evolution into the IngestProcessor runlop
This commit introduces some interplay between the IngestProcessor and DataWriter, the latter of which needs to keep track of whether or not it has a changed schema. What should be done with that changed schema must necessarily live in IngestProcessor since that will perform the Delta transaction commits at the tail end of batch processing. There is some potential mismatches between the schema in storage and what the DataWriter has, so this change tries to run the runloop again if the current schema and the evolved schema are incompatible Closes #131 Sponsored-by: Raft LLC
Configuration menu - View commit details
-
Copy full SHA for e4c0f1e - Browse repository at this point
Copy the full SHA e4c0f1eView commit details -
Disable scheam inference when schema evolution is disabled
This will ensure the non-evolution case stands relatively speedy! Sponsored-by: Raft LLC
Configuration menu - View commit details
-
Copy full SHA for 50e81da - Browse repository at this point
Copy the full SHA 50e81daView commit details
Commits on Jul 8, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 1983bb0 - Browse repository at this point
Copy the full SHA 1983bb0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 144513a - Browse repository at this point
Copy the full SHA 144513aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 706163c - Browse repository at this point
Copy the full SHA 706163cView commit details