Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ExtensionType to DataType enum #4472

Open
wjones127 opened this issue Jun 30, 2023 · 17 comments · May be fixed by #5822
Open

Add an ExtensionType to DataType enum #4472

wjones127 opened this issue Jun 30, 2023 · 17 comments · May be fixed by #5822
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@wjones127
Copy link
Member

wjones127 commented Jun 30, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Extension types are annotated in field metadata. This works well with record batches, but when exporting/importing an array over the C data interface, the extension type metadata is lost.

The C++ implementation solves this by having an ExtensionType class and always exports that metadata over C data interface here:

https://github.com/apache/arrow/blob/b9aec9ad2b655817b8925462e4e2dd6973807e23/cpp/src/arrow/c/bridge.cc#L243-L252

Describe the solution you'd like

I'd propose adding a new enum variant to DataType:

struct ExtensionType {
   name: String,
   metadata: String,
   storage_type: Box<DataType>,
}

enum DataType {
    ...
    ExtensionType(ExtensionType)
}

Then make sure the C data interface implementation handles exporting and importing this type.

Describe alternatives you've considered

We could add an extension type registry like C++, but that seems heavier than we really need.

Additional context

https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays

Previous discussions:

@wjones127 wjones127 added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Jun 30, 2023
@tustvold
Copy link
Contributor

tustvold commented Jun 30, 2023

but when exporting/importing an array over the C data interface, the extension type metadata is lost

I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?

I'd propose adding a new enum variant to DataType:

My major objection to this approach is that it undermines the transparency of extension types. I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types. However, adding a DataType::Extension would instead force kernels to "see through" DataType::Extension when downcasting or performing operations such as extracting decimal precision. I would much prefer an approach that does not require this, by instead propagating extension metadata out-of-band.

@wjones127
Copy link
Member Author

I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?

This works well for RecordBatch, but not for an individual array transported independent from any batch. Basically, arrays themselves have no way to be tagged as an extension array, since those don't contain a field where that metadata is stored; they are only extension arrays in the context of a batch.

I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types.

I definitely agree, and don't want to make these operations more complex than they ought to be.

If we can think of another place to put this information, I'm open to that.

(A bit of a tangent, but...) In my ideal world, there would be a logical type enum and a physical type enum. Physical types would be the current DataType. Then logical types would be things like String (just one, regardless of offset size and encoding) and then a generic ExtensionType variant. Sort of like what Sasha was talking about a long time ago: https://lists.apache.org/thread/357z4587dczho4x1257ttf0b4o9302co

@wjones127
Copy link
Member Author

FWIW my workaround for now is to just wrap it in a record batch and unwrap on the other side. But it would be nice to find a way to not have to do that.

https://github.com/lancedb/lance/blob/3e1ed67acf7d1de336b6d211647d9581c64c3bed/python/src/arrow.rs#L53-L67

@tustvold
Copy link
Contributor

tustvold commented Jul 1, 2023

If we can think of another place to put this information, I'm open to that.

I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level. Whilst it isn't a true logical vs physical separation, I think that is a helpful way to conceptualize it.

But it would be nice to find a way to not have to do that.

This feels like a limitation of the way the python conversion is implemented

impl ToPyArrow for ArrayData {
    fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> {
        let array = FFI_ArrowArray::new(self);
        let schema = FFI_ArrowSchema::try_from(self.data_type()).map_err(to_py_err)?;

        let module = py.import("pyarrow")?;
        let class = module.getattr("Array")?;
        let array = class.call_method1(
            "_import_from_c",
            (
                addr_of!(array) as Py_uintptr_t,
                addr_of!(schema) as Py_uintptr_t,
            ),
        )?;
        Ok(array.to_object(py))
    }
}

In particular the schema is inferred from the array's data type. If instead there were a way to provide a Field then this would allow propagating not only extension metadata, but also nullability, dictionary ordering, etc... It would also potentially allow performing the schema conversion once and using the result for multiple arrays.

@wjones127
Copy link
Member Author

I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level.

I think that could be a decent approach, although I'm still trying to understand what that would look like. It sounds like the arrow-rs type system is closed, but can be wrapped in a higher-level type system. (whereas, the C++ kernels are extension-aware).

So it sounds like the point to add extension types is when building extensions in datafusion. Eventually, I think it would be nice to have an example in arrow-datafusion showing how to add support for a simple extension type (such as UUID) in the engine. Basically the end result would be showing that

SELECT gen_random_uuid()

outputs

+--------------------------------------+
| gen_random_uuid()                    |
+--------------------------------------+
| eeccb8c5-9943-b2bb-bb5e-222f4e14b687 |
+--------------------------------------+

Showing that you can add methods that output extension types (gen_random_uuid()) and that you can control how those extension types are displayed.

@yukkit
Copy link
Contributor

yukkit commented Aug 30, 2023

I encountered the same trouble when trying to add a custom type. Although the extension type can be marked through the metadata of FIeld, the metadata will be lost in the scene of array processing, for example: udf in datafusion

@kylebarron
Copy link
Contributor

kylebarron commented Oct 5, 2023

I'm taking a stab at migrating my geoarrow-rs crate (which implements the GeoArrow extension array spec) from arrow2 to arrow-rs, and wanted to add that I'm feeling the pain of omission of a DataType::Extension variant in arrow-rs.

In particular, a geospatial algorithm would have to return a Field with every operation, because the physical layout of a LineStringArray is exactly the same as that of a MultiPointArray (and PolygonArray/MultiLineStringArray). Maybe this is nitpicking, but it I've liked the level of abstraction of having the extension metadata on the DataType, because the operations on the array are separate from a named column in a table.

Edit: If I'm understanding correctly, it's also impossible to implement

impl TryFrom<&dyn Array> for GeometryArray

like I could in arrow2, because dyn Array never has any extension type information, so I wouldn't be able to know what type of geometries the array is holding...

@alamb
Copy link
Contributor

alamb commented Oct 25, 2023

@yukkit is contemplating User Defined Types in DataFusion, and the arrow extension type mechanism is the obvious implementation I think -- see apache/datafusion#7923

I personally think Field is the correct location for such metadata,

@tustvold are you proposing something like the following?

enum DataType { 
....
   List(FieldRef),
   /// Extension type, with potentially embedded metadata in the field reference
   Extension(FieldRef)
}

This proposal runs afoul of how DataType::List works today (where the field name is mostly irrelevant ("item")), but I don't really have any better ideas.

I think this structure would allow @kylebarron to implement

impl TryFrom<&dyn Array> for GeometryArray {
  fn try_from(arr: &dyn Array>) -> Result<Self> {
    match arr.data_type() {
      DataType::Extension(field) if is_geo_type(field.metadata()) => {
        .... do the conversion ...
      }
      dt => Err("Unsupported datatype: {dt}
   }
}

@tustvold
Copy link
Contributor

tustvold commented Oct 25, 2023

are you proposing something like the following?

No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information. This is the same mechanism we use in IOx to encode the logical notions of tag vs field columns

@alamb
Copy link
Contributor

alamb commented Oct 25, 2023

No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information.

So how would we implement @kylebarron 's use case? Perhaps via a RecordBatch (with a single column)?

@kylebarron
Copy link
Contributor

My interpretation of this is that it's a "zero-sum" architecture decision, in the sense that if you don't want to conflate logical and physical types in the DataType enum, then there's intentionally no way to implement From on &dyn Array; instead it's only possible to implement it on (&dyn Array, &FieldRef)

@tustvold
Copy link
Contributor

tustvold commented Oct 25, 2023

You would need whatever performs the kernel selection to have the Field, most likely via Schema, I'm not sure you necessarily need this information simultaneously with the Array?

For example, the DF PhysicalExpr could have already extracted the necessary metadata at plan time (although I think it has access to the schema anyway).

@wjones127
Copy link
Member Author

@tustvold do you think RecordBatch is something that end-users should be seeing? Or do you imagine this should be a hidden implementation detail in all cases?

If it's the latter, I think I can understand the position to keep extension type separate. But if it's the former, it's hard to see how we can provide a decent UX without bringing the extension type into the array itself. For example, if we return a RecordBatches with a UUID, a user might reasonable surprised that the UUID column prints as the raw bytes and not a hyphenated UUID string.

@tustvold
Copy link
Contributor

RecordBatch/Schema is but one way that users might choose to expose logical type information, they might also define their own array abstractions that wrap arrow arrays, or their own schema abstraction at plan time, etc...

As @kylebarron rightly states it's a zero-sum API challenge, either all of arrow must become aware of extension types, or it is confined to the areas that actually care. It seems odd to me to optimize the design here for things that are not present in the specification at the expense of everything else, further it seems unfortunate to optimise for one particular way of encoding logical type information.

I don't have all the answers here, I don't know what a general purpose logical type abstraction looks like, if such a thing even exists, but it does seem that the core library shouldn't be opinionated in this regard

@kylebarron
Copy link
Contributor

This feels like a limitation of the way the python conversion is implemented

For Python conversion specifically, this might be solved with the new PyCapsule interface (ref #5067) because the ArrowSchema FFI struct is generated by pyarrow itself, and so it doesn't have to be inferred from array.type. (I haven't verified how it works with extension arrays yet)

@alamb
Copy link
Contributor

alamb commented Nov 14, 2023

FYI @yukkit has created a PR showing how LogicalTypes might work in DataFusion: apache/datafusion#8143. It is a pretty neat idea.

jleibs added a commit to rerun-io/rerun that referenced this issue Sep 9, 2024
### What
Basic type conversions from TransportChunk to RecordBatch and back.

Adding the round-trip test turned up an interesting issue.

TransportChunk <-> RecordBatch fails to round-trip successfully because
we lose the ExtensionType encapsulation that used to be encoded by
arrow2.

While on the surface this isn't immediately problematic, as we don't
care about ExtensionTypes, the discussion indicates there are in fact
going to be very real pain points when it comes to writing semantic data
processing engines using arrow-rs. This is because the metadata is
attached to the FIELD, not the DATATYPE, and there exist many processing
contexts where the context of that field itself is lost.
apache/arrow-rs#4472

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested the web demo (if applicable):
* Using examples from latest `main` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7355?manifest_url=https://app.rerun.io/version/main/examples_manifest.json)
* Using full set of examples from `nightly` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7355?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG
* [x] If applicable, add a new check to the [release
checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)!
* [x] If have noted any breaking changes to the log API in
`CHANGELOG.md` and the migration guide

- [PR Build Summary](https://build.rerun.io/pr/7355)
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

To run all checks from `main`, comment on the PR with `@rerun-bot
full-check`.
@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

@mbrobbel has a PR that I think implements the consensus on this ticket (and is a pleasure to read)

I also renamed this ticket to reflect adding support for extension types (rather than the specific idea of ExtensionType to DataType)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants