Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: add variant type #10831

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Spec: add variant type #10831

wants to merge 6 commits into from

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Jul 31, 2024

Help: #10392

Spec: add variant type

Proposal: https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit

This is to layout the spec for variant type. The specs are placed in Parquet project (see variant spec and shredding spec.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jul 31, 2024
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jul 31, 2024

cc @rdblue, @RussellSpitzer and @flyrain

format/spec.md Outdated Show resolved Hide resolved
@RussellSpitzer
Copy link
Member

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

@sfc-gh-aixu
Copy link

I do want to make sure we don't do a hostile fork here of the spec from Spark so let's make sure we get support from them to move the spec here before we merge. At the same time we should start going through wordings and continue to discuss the specs. I still think that would be easier to do in a public Google Doc though than in Github IMHO.

Definitely. It's not for merge yet. I'm mostly trying to get the comments in place. Make sense to move that to google doc and link here.

@aihuaxu aihuaxu marked this pull request as ready for review October 9, 2024 20:59
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
@RussellSpitzer
Copy link
Member

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

@RussellSpitzer
Copy link
Member

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 18, 2024

This needs some notes in Partition Transforms , I think explicitly we should disallow identity

For Appendix B - We should define something or state explicitly we don't define it for variant.

Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think

Under Sort Orders we should probably note you cannot sort on a Variant?

Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry

Thanks @RussellSpitzer I missed those sections and just updated.

I mark Partition Transforms, sorting and hashing not supported/allowed for now.
For Appendix C, I think it should be just variant, similar to primitive type, since it's Iceberg schema as I understand the section.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < `

A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

Note:

1. The ability to sort `variant` columns and the specific sort order is determined by the engines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I think anything we don't specify is up to engines already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will remove that then. Do we need to call out "Variant values cannot be present in an Iceberg sort order"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should specifically forbid sort orders containing a variant. I think we actually are underdetermined in the spec here.

We have the following checks in the Reference Implementation

ValidationException.check(
sourceType != null, "Cannot find source column for sort field: %s", field);
ValidationException.check(
sourceType.isPrimitiveType(),
"Cannot sort by non-primitive source field: %s",
sourceType);
ValidationException.check(
field.transform().canTransform(sourceType),
"Invalid source type %s for transform: %s",
sourceType,
field.transform());

So currently, even though we don't specify this here, you cannot make a sort order with array or map. I think we should explicitly call this out and add variant as well. My real concern here is that we add the ability to sort on something but don't define what that sorting actually looks like.

format/spec.md Outdated Show resolved Hide resolved
@rdblue
Copy link
Contributor

rdblue commented Oct 24, 2024

Oops. I didn't mean to close this.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 25, 2024

@aihuaxu, I think there are a couple of things missing:

  • The Avro appendix should be updated to state that a Variant is stored as a Record with two fields, a required binary metadata and a required binary value. Shredding is not supported in Avro.
  • The ORC appendix should be updated to state that a Variant is stored as a struct with two fields, a required binary metadata and a required binary value. Type attribute should be iceberg.struct-type=variant. Shredding is not supported in ORC.

Thanks @rdblue I thought we will make changes when we start to work on Avro/ORC. I added that.

I don't have much context for Json conversion. Not sure if we need to add more info.

@aihuaxu aihuaxu force-pushed the variant-type-spec branch 2 times, most recently from d953b6e to 67df611 Compare October 29, 2024 06:24
format/spec.md Outdated Show resolved Hide resolved
@@ -1436,6 +1457,7 @@ This serialization scheme is for storing single values as individual binary valu
| **`struct`** | Not supported |
| **`list`** | Not supported |
| **`map`** | Not supported |
| **`variant`** | Not supported |
Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree this should be not-supported for now. Then when shredding is included say something like for Shredded variants only, binary value concatenation of metadata and value + separator byte or something. We can figure that out with the shredding addition though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't include Variant here, then we don't need to include it in the JSON section either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems binary representation is used for lower bound and upper bound and JSON single-value serialization is used for default value. Looks like they are defined independently? But since there is no default value for Variant, i will remove from JSON section.

format/spec.md Outdated Show resolved Hide resolved
@aihuaxu aihuaxu force-pushed the variant-type-spec branch 2 times, most recently from c8f9e7e to 0550fcf Compare November 5, 2024 17:58
@@ -178,6 +178,21 @@ A **`list`** is a collection of values with some element type. The element field

A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

#### Semi-structured Types

A **`variant`** is a value that stores semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the [Parquet project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Support for Variant is added in Iceberg v3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link should be to main/master rather than a specific sha right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I worried about, since aren't we syncing with a specific iteration of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we link to a specific version to be implemented in Iceberg. Then later e.g., when we add additional data types, we should also update here.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
@@ -444,7 +459,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ

| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| **`identity`** | Source value, unmodified | Any | Source type |
| **`identity`** | Source value, unmodified | Any other than `variant` | Source type |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What out "except for"? I think that is more clear than "other than"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also fix the bucket function definition by using an except for list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Let me fix in the following PR.

Should we also fix the bucket function definition by using an except for list?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me other than a few minor comments.

@@ -1154,6 +1169,7 @@ Maps with non-string keys must use an array representation with the `map` logica
|**`struct`**|`record`||
|**`list`**|`array`||
|**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.|
|**`variant`**|`record` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs. |Shredding is not supported in Avro.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably be consistent here and make the note that field IDs should not be assigned to the fields across all formats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the same note for parquet and ORC format. I see ORC use ICEBERG_ID_ATTRIBUTE to track the fieldId but the concept seems to be similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants