Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce schema evolution via the -S flag #164

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Commits on Jan 9, 2024

  1. Demonstrate a bug for schema "validation" when writing JSON to a table

    I stumbled into this while pilfering code from kafka-delta-ingest for another
    project and discovered that the code in `write_values` which does
    `record_batch.schema() != arrow_schema` doesn't do what we think it does.
    
    Basically if `Decoder` "works" the schema it's going to return is just the
    schema passed into it. It has no bearing on whether the JSON has the same
    schema. Don't ask me why.
    
    Using the reader's `infer_json_schema_*` functions can provide a Schema that is
    useful for comparison:
    
            let mut value_iter = json_buffer.iter().map(|j| Ok(j.to_owned()));
            let json_schema = infer_json_schema_from_iterator(value_iter.clone()).expect("Failed to infer!");
            let decoder = Decoder::new(Arc::new(json_schema), options);
            if let Some(batch) = decoder.next_batch(&mut value_iter).expect("Failed to create RecordBatch") {
                assert_eq!(batch.schema(), arrow_schema_ref, "Schemas don't match!");
            }
    
    What's even more interesting, is that after a certain number of fields are
    removed, the Decoder no longer pretends it can Decode the JSON. I am baffled as
    to why.
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    a860dcb View commit details
    Browse the repository at this point in the history
  2. Begin introducing schema conformance testing ahead of more substantia…

    …l refactor
    
    The intention here is to enable more consistent schema handling within
    the writers
    
    Sponsored-by: Raft LLC
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    93e3d73 View commit details
    Browse the repository at this point in the history
  3. Introduce DeserializedMessage for carrying schema information into th…

    …e writers
    
    The DeserializedMessage carries optional inferred schema information
    along with the message itself. This is useful for understanding whether
    schema evolution hould happen "later" in the message processing
    pipeline.
    
    The downside of this behavior is that there will be performance impact
    as arrow_json does schema inference.
    
    Sponsored-by: Raft LLC
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    f49b4f7 View commit details
    Browse the repository at this point in the history
  4. Put avro dependenciies behind the avro feature flag

    Turning avro off drops about 50 crates from the default build, so useful
    for development, but the code would need to be cleaned up to remove this
    from the default features list
    
    See #163
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    f8eb561 View commit details
    Browse the repository at this point in the history
  5. Remove unused dependencies

    Identified by `cargo +nightly udeps`
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    4ed2c8a View commit details
    Browse the repository at this point in the history
  6. Remove avro from the default features and feature gate its cpde

    This change is a little wrapped up in the introduction of
    DeserializedMessage but the trade-off for development targeting S3 is
    that I am linking 382 crates every cycle as opposed to 451.
    
    Fixes #163
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    72aff33 View commit details
    Browse the repository at this point in the history
  7. Relocate DataWriter in the writer.rs file for easier readability

    I don't know why the impl was way down there 😄
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    071a6a2 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    baf9696 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    72c2f0b View commit details
    Browse the repository at this point in the history
  10. Introduce schema evolution into the IngestProcessor runlop

    This commit introduces some interplay between the IngestProcessor and
    DataWriter, the latter of which needs to keep track of whether or not it
    has a changed schema.
    
    What should be done with that changed schema must necessarily live in
    IngestProcessor since that will perform the Delta transaction commits at
    the tail end of batch processing.
    
    There is some potential mismatches between the schema in storage and
    what the DataWriter has, so this change tries to run the runloop again
    if the current schema and the evolved schema are incompatible
    
    Closes #131
    
    Sponsored-by: Raft LLC
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    e4c0f1e View commit details
    Browse the repository at this point in the history
  11. Disable scheam inference when schema evolution is disabled

    This will ensure the non-evolution case stands relatively speedy!
    
    Sponsored-by: Raft LLC
    rtyler committed Jan 9, 2024
    Configuration menu
    Copy the full SHA
    50e81da View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2024

  1. Configuration menu
    Copy the full SHA
    1983bb0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    144513a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    706163c View commit details
    Browse the repository at this point in the history