-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reported and actual arrow schema of the table can be different #813
Comments
I see two ways of resolving this
|
So we have the table schema, and we have the file schema, and we must assume that they are different. This can be in the case that you mentioned above, where there is a different arrow type, but more obvious is in the case of schema evolution. Iceberg is lazy, and will not rewrite the historical data until it needs to update the Parquet file. For the data itself, we should project it into the schema of the table. For the filtering situation, we want to cast the type to the physical type. This ties also in with the second part of #777 (comment) |
I was also able to replicate this issue here: iceberg-rust/crates/integration_tests/tests/read_evolved_schema.rs Lines 72 to 74 in 74a85e7
|
Ok, I've also taken a liberty of opening a PR with a pyiceberg -> iceberg-rust -> datafusion integration test #825. Besides showing that the basics work, it also reproduces this issue. |
Is something like this close to what you had in mind /// Convert Iceberg Datum to Arrow Datum.
-pub(crate) fn get_arrow_datum(datum: &Datum) -> Result<Box<dyn ArrowDatum + Send>> {
+pub(crate) fn get_arrow_datum(
+ datum: &Datum,
+ arrow_type: &DataType,
+) -> Result<Box<dyn ArrowDatum + Send>> {
match (datum.data_type(), datum.literal()) {
(PrimitiveType::Boolean, PrimitiveLiteral::Boolean(value)) => {
Ok(Box::new(BooleanArray::new_scalar(*value)))
}
- (PrimitiveType::Int, PrimitiveLiteral::Int(value)) => {
- Ok(Box::new(Int32Array::new_scalar(*value)))
- }
+ (PrimitiveType::Int, PrimitiveLiteral::Int(value)) => match arrow_type {
+ DataType::Int8 => Ok(Box::new(Int8Array::new_scalar(*value as i8))),
+ DataType::Int16 => Ok(Box::new(Int16Array::new_scalar(*value as i16))),
+ DataType::Int32 => Ok(Box::new(Int32Array::new_scalar(*value))),
+ _ => Err(Error::new(
+ ErrorKind::DataInvalid,
+ format!("Can't convert {datum} to type {arrow_type}"),
+ )),
+ },
(PrimitiveType::Long, PrimitiveLiteral::Long(value)) => {
Ok(Box::new(Int64Array::new_scalar(*value)))
and then for if let Some(idx) = self.bound_reference(reference)? {
- let literal = get_arrow_datum(literal)?;
-
Ok(Box::new(move |batch| {
let left = project_column(&batch, idx)?;
+ let literal = get_arrow_datum(literal, left.data_type())?;
lt_eq(&left, literal.as_ref())
})) I found that this also resolves the reported problem. Though arguably less general than just casting the batches at the arrow/parquet level, it is a less invasive fix. |
Even more liberally, one could do without changing if let Some(idx) = self.bound_reference(reference)? {
- let literal = get_arrow_datum(literal)?;
+ let mut literal = get_arrow_datum(literal)?;
Ok(Box::new(move |batch| {
let left = project_column(&batch, idx)?;
+ if left.data_type() != literal.get().0.data_type() {
+ literal = Box::new(Scalar::new(cast(literal.get().0, left.data_type())?));
+ }
lt_eq(&left, literal.as_ref())
})) |
This is related to #783.
Namely what happens is
pyiceberg
to create an Iceberg table from a Parquet file.DataType::Int16
(required int32 c1 (INTEGER(16,true)) = 1;
).Int8
andInt16
toPrimitiveType::Int
#783 we now upcast that to the native 32-bit Int type and can read it.TableProvider::schema
.DataType::Int16
, leading to reported and actual schema mismatch.SELECT c1 FROM t where c1 <= 2
crashing withInvalid comparison operation: Int16 <= Int32
2
literal) intoDataType::Int32
(from the reported schema) the comparison will be fine.The text was updated successfully, but these errors were encountered: