The ContentHash implementation for std::Option is inconsistent with the documentation #3051

emesterhazy · 2024-02-14T22:44:00Z

The documentation for the ContentHash trait states that:

Enums should hash a 32-bit little-endian encoding of the ordinal number of the enum variant, then the variant's fields in lexical order.

This does not match the implementation for std::Option, which actually hashes the enum variant's ordinal number as u8, presumably with architecture specific endianness.

https://github.com/martinvonz/jj/blob/2e64bf83fd9b8abf4c9880482ea4ce19492f3139/lib/src/content_hash.rs#L76-L86

In state.update(&[0]), the &[0] is inferred as a &[u8] due to the function signature of digest::Update.update).

I'm not sure what the implication is of changing the implementation after the fact. Presumably things will break. I think therefore we should update the documentation to say that the ordinal number should be hashed as a LE u8, and we should write the implementation as state.update(&0.as_le_bytes()) to ensure portability.

The text was updated successfully, but these errors were encountered:

emesterhazy · 2024-02-14T22:47:44Z

I should add, if we make this change enums that implement ContentHash will be limited to 256 variants. That is probably fine and better than breaking backwards compatibility.

ISSUE=#3051

This commit also updates the documentation regarding how ContentHash should be implemented for enums to address the concerns raised in #3051.

martinvonz · 2024-02-15T01:28:06Z

FYI, @Ralith (who wrote the current version of the macro)

emesterhazy · 2024-02-15T02:01:49Z

I'm not sure this is an issue with the current macro actually since I don't think it supports enum types. It's only an issue for the hand coded implementation for std::Optional. That gives me an idea actually. We could implement the proc macro per the "spec" and just allow the divergence for std::Optional. I don't think there should be any ill effects from allowing this divergence unless we need a guarantee that different enum types with the same layouts and contents will have equivalent hashes.

yuja · 2024-02-15T03:36:48Z

I'm not sure what the implication is of changing the implementation after the fact. Presumably things will break.

iirc, changing the ContentHash value won't cause problems. New serialized view files may be created with the same content, but that's okay. We don't use the hash value to verify the file content.

martinvonz · 2024-02-15T04:58:14Z

I'm not sure this is an issue with the current macro actually since I don't think it supports enum types.

Oh, I meant to CC @Ralith on #3054 as the overall tracking issue, in case he would find any of the linked PRs interesting.

emesterhazy · 2024-02-15T16:04:11Z

RE: #3051 (comment), commit 2447dfe suggests that you're right and changing the way a type hashes won't break anything. Perhaps we should document above the trait that the hash value should be computed once and persisted and not used for integrity checking or re-verification.

Assuming this is correct, then I'd suggest we just change the implementation for std::Option to match the documentation and use a u32 of the ordinals instead of u8.

This is probably a question for @Ralith, but I'm wondering why we can't simply use the built in Hash trait and need ContentHash instead. Is it just for more control over the hash algorithm?

emesterhazy · 2024-02-15T22:58:18Z

This also affects the ContentHash impl for MergedTreeId, which is hand coded. We can replace this with the new proc macro as well.

https://github.com/martinvonz/jj/blob/a9278f50b133bdd9fc2855456ee9cbe60f50352c/lib/src/backend.rs#L113-L126

emesterhazy · 2024-02-15T23:09:40Z

It looks like the impl for TreeValue actually does use a u32:

https://github.com/martinvonz/jj/blob/a9278f50b133bdd9fc2855456ee9cbe60f50352c/lib/src/backend.rs#L249-L276

Ralith · 2024-02-16T00:11:13Z

why we can't simply use the built in Hash trait and need ContentHash instead. Is it just for more control over the hash algorithm?

The main thing is that std::hash::Hash doesn't guarantee stability or portability:

Serialization formats intended to be portable between platforms or compiler versions should either avoid encoding hashes or only rely on Hash and Hasher implementations that provide additional guarantees.

If we can't use off-the-shelf Hash implementations, then a custom trait is easier, because we have to implement a trait regardless, but at least there's no newtypes.

Ralith · 2024-02-16T00:12:34Z

Also, +1 to normalizing the assorted enum impls.

emesterhazy · 2024-02-16T02:05:13Z

I'll mail a change to fix the u8 / u32 issue first, and then replace the custom impls with the macro afterwards.

How important is the stability and portability currently? Portability seems broken since we're using the hardware endianness (although most systems are le). It sounds like consistency isn't that important currently either since we can apparently change the hash impl for enums without breaking anything.

Edit: I'm not suggesting we use std::hash::Hash instead, I'm just curious for my own understanding what the current requirements are and how they might change.

yuja · 2024-02-16T02:23:33Z

How important is the stability and portability currently?

At minimum, the hash generated by using the same jj codebase should be stable across platforms. Otherwise the test would break. In addition to that, it's probably better to not change the output unless necessary (to save disk space.)

Portability seems broken since we're using the hardware endianness

Don't we use to_le_bytes()?

The `ContentHash` documentation specifies that implementations for enums should hash the ordinal number of the variant contained in the enum as a 32-bit little-endian number and then hash the contents of the variant, if any. The current implementations for `std::Option` and `MergedTreeId` are non-conformant since they hash the ordinal number as a u8 with platform specific endianness. Fixes #3051

emesterhazy · 2024-02-16T03:43:18Z

Don't we use to_le_bytes()?

As far as I can tell this is only used for the impl for TreeValue. It's not used for std::Option or MergedTreeId, which are the only other enums that implement ContentHash as far as I can tell.

I mailed out #3061 to fix this in the current implementations, which will be superseded by the proc macro once it's submitted.

@yuja I think your explanation makes sense. If we didn't have consistency and portability the snapshot tests would break on other platforms even if jj would continue to work fine for actual repos.

The `ContentHash` documentation specifies that implementations for enums should hash the ordinal number of the variant contained in the enum as a 32-bit little-endian number and then hash the contents of the variant, if any. The current implementations for `std::Option` and `MergedTreeId` are non-conformant since they hash the ordinal number as a u8 with platform specific endianness. Fixes #3051

Ralith · 2024-02-16T10:14:55Z

Portability seems broken since we're using the hardware endianness

All multi-byte integers' implementations should be calling to_le_bytes.

As far as I can tell this is only used for the impl for TreeValue. It's not used for std::Option or MergedTreeId, which are the only other enums that implement ContentHash as far as I can tell.

These are currently (erroneously) using u8 discriminants, whose representation is not sensitive to endianness.

emesterhazy · 2024-02-16T12:47:48Z

These are currently (erroneously) using u8 discriminants, whose representation is not sensitive to endianness.

Right.. there's only a single byte so the platform endianness doesn't matter.

…d RemoteRefState The `ContentHash` documentation specifies that implementations for enums should hash the ordinal number of the variant contained in the enum as a 32-bit little-endian number and then hash the contents of the variant, if any. The current implementations for `std::Option`, `MergedTreeId`, and `RemoteRefState` are non-conformant since they hash the ordinal number as a u8 with platform specific endianness. Fixes #3051

emesterhazy added a commit that referenced this issue Feb 14, 2024

WIP: Derive ContentHash for Enums

dd64ee0

ISSUE=#3051

emesterhazy added a commit that referenced this issue Feb 14, 2024

Add support for deriving ContentHash for Enums

6ac1940

This commit also updates the documentation regarding how ContentHash should be implemented for enums to address the concerns raised in #3051.

This was referenced Feb 14, 2024

Add support for deriving ContentHash for Enums #3052

Merged

FR: Procedural macro to derive ContentHash for structs and enums #3054

Closed

emesterhazy changed the title ~~The ContentHash implementation for std::Optional<T> is inconsistent with the documentation~~ The ContentHash implementation for std::Option is inconsistent with the documentation Feb 15, 2024

emesterhazy mentioned this issue Feb 16, 2024

Fix the ContentHash implementations for std::Option and MergedTreeId and RemoteRefState #3061

Merged

emesterhazy closed this as completed in #3061 Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The ContentHash implementation for std::Option is inconsistent with the documentation #3051

The ContentHash implementation for std::Option is inconsistent with the documentation #3051

emesterhazy commented Feb 14, 2024

emesterhazy commented Feb 14, 2024

martinvonz commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

yuja commented Feb 15, 2024

martinvonz commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

Ralith commented Feb 16, 2024

Ralith commented Feb 16, 2024

emesterhazy commented Feb 16, 2024 •

edited

Loading

yuja commented Feb 16, 2024

emesterhazy commented Feb 16, 2024 •

edited

Loading

Ralith commented Feb 16, 2024

emesterhazy commented Feb 16, 2024

The ContentHash implementation for std::Option is inconsistent with the documentation #3051

The ContentHash implementation for std::Option is inconsistent with the documentation #3051

Comments

emesterhazy commented Feb 14, 2024

emesterhazy commented Feb 14, 2024

martinvonz commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

yuja commented Feb 15, 2024

martinvonz commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

emesterhazy commented Feb 15, 2024

Ralith commented Feb 16, 2024

Ralith commented Feb 16, 2024

emesterhazy commented Feb 16, 2024 • edited Loading

yuja commented Feb 16, 2024

emesterhazy commented Feb 16, 2024 • edited Loading

Ralith commented Feb 16, 2024

emesterhazy commented Feb 16, 2024

emesterhazy commented Feb 16, 2024 •

edited

Loading

emesterhazy commented Feb 16, 2024 •

edited

Loading