Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ContentHash implementation for std::Option is inconsistent with the documentation #3051

Closed
emesterhazy opened this issue Feb 14, 2024 · 15 comments · Fixed by #3061
Closed

Comments

@emesterhazy
Copy link
Contributor

The documentation for the ContentHash trait states that:

Enums should hash a 32-bit little-endian encoding of the ordinal number of the enum variant, then the variant's fields in lexical order.

This does not match the implementation for std::Option, which actually hashes the enum variant's ordinal number as u8, presumably with architecture specific endianness.

https://github.com/martinvonz/jj/blob/2e64bf83fd9b8abf4c9880482ea4ce19492f3139/lib/src/content_hash.rs#L76-L86

In state.update(&[0]), the &[0] is inferred as a &[u8] due to the function signature of digest::Update.update).

image

I'm not sure what the implication is of changing the implementation after the fact. Presumably things will break. I think therefore we should update the documentation to say that the ordinal number should be hashed as a LE u8, and we should write the implementation as state.update(&0.as_le_bytes()) to ensure portability.

@emesterhazy
Copy link
Contributor Author

I should add, if we make this change enums that implement ContentHash will be limited to 256 variants. That is probably fine and better than breaking backwards compatibility.

emesterhazy added a commit that referenced this issue Feb 14, 2024
emesterhazy added a commit that referenced this issue Feb 14, 2024
This commit also updates the documentation regarding how ContentHash should be
implemented for enums to address the concerns raised in #3051.
@martinvonz
Copy link
Member

FYI, @Ralith (who wrote the current version of the macro)

@emesterhazy
Copy link
Contributor Author

I'm not sure this is an issue with the current macro actually since I don't think it supports enum types. It's only an issue for the hand coded implementation for std::Optional. That gives me an idea actually. We could implement the proc macro per the "spec" and just allow the divergence for std::Optional. I don't think there should be any ill effects from allowing this divergence unless we need a guarantee that different enum types with the same layouts and contents will have equivalent hashes.

@yuja
Copy link
Contributor

yuja commented Feb 15, 2024

I'm not sure what the implication is of changing the implementation after the fact. Presumably things will break.

iirc, changing the ContentHash value won't cause problems. New serialized view files may be created with the same content, but that's okay. We don't use the hash value to verify the file content.

@martinvonz
Copy link
Member

I'm not sure this is an issue with the current macro actually since I don't think it supports enum types.

Oh, I meant to CC @Ralith on #3054 as the overall tracking issue, in case he would find any of the linked PRs interesting.

@emesterhazy emesterhazy changed the title The ContentHash implementation for std::Optional<T> is inconsistent with the documentation The ContentHash implementation for std::Option is inconsistent with the documentation Feb 15, 2024
@emesterhazy
Copy link
Contributor Author

RE: #3051 (comment), commit 2447dfe suggests that you're right and changing the way a type hashes won't break anything. Perhaps we should document above the trait that the hash value should be computed once and persisted and not used for integrity checking or re-verification.

Assuming this is correct, then I'd suggest we just change the implementation for std::Option to match the documentation and use a u32 of the ordinals instead of u8.

This is probably a question for @Ralith, but I'm wondering why we can't simply use the built in Hash trait and need ContentHash instead. Is it just for more control over the hash algorithm?

@emesterhazy
Copy link
Contributor Author

This also affects the ContentHash impl for MergedTreeId, which is hand coded. We can replace this with the new proc macro as well.

https://github.com/martinvonz/jj/blob/a9278f50b133bdd9fc2855456ee9cbe60f50352c/lib/src/backend.rs#L113-L126

@emesterhazy
Copy link
Contributor Author

It looks like the impl for TreeValue actually does use a u32:

https://github.com/martinvonz/jj/blob/a9278f50b133bdd9fc2855456ee9cbe60f50352c/lib/src/backend.rs#L249-L276

@Ralith
Copy link
Contributor

Ralith commented Feb 16, 2024

why we can't simply use the built in Hash trait and need ContentHash instead. Is it just for more control over the hash algorithm?

The main thing is that std::hash::Hash doesn't guarantee stability or portability:

Serialization formats intended to be portable between platforms or compiler versions should either avoid encoding hashes or only rely on Hash and Hasher implementations that provide additional guarantees.

If we can't use off-the-shelf Hash implementations, then a custom trait is easier, because we have to implement a trait regardless, but at least there's no newtypes.

@Ralith
Copy link
Contributor

Ralith commented Feb 16, 2024

Also, +1 to normalizing the assorted enum impls.

@emesterhazy
Copy link
Contributor Author

emesterhazy commented Feb 16, 2024

I'll mail a change to fix the u8 / u32 issue first, and then replace the custom impls with the macro afterwards.

How important is the stability and portability currently? Portability seems broken since we're using the hardware endianness (although most systems are le). It sounds like consistency isn't that important currently either since we can apparently change the hash impl for enums without breaking anything.

Edit: I'm not suggesting we use std::hash::Hash instead, I'm just curious for my own understanding what the current requirements are and how they might change.

@yuja
Copy link
Contributor

yuja commented Feb 16, 2024

How important is the stability and portability currently?

At minimum, the hash generated by using the same jj codebase should be stable across platforms. Otherwise the test would break. In addition to that, it's probably better to not change the output unless necessary (to save disk space.)

Portability seems broken since we're using the hardware endianness

Don't we use to_le_bytes()?

emesterhazy added a commit that referenced this issue Feb 16, 2024
The `ContentHash` documentation specifies that implementations for enums should
hash the ordinal number of the variant contained in the enum as a 32-bit
little-endian number and then hash the contents of the variant, if any.

The current implementations for `std::Option` and `MergedTreeId` are
non-conformant since they hash the ordinal number as a u8 with platform
specific endianness.


Fixes #3051
emesterhazy added a commit that referenced this issue Feb 16, 2024
The `ContentHash` documentation specifies that implementations for enums should
hash the ordinal number of the variant contained in the enum as a 32-bit
little-endian number and then hash the contents of the variant, if any.

The current implementations for `std::Option` and `MergedTreeId` are
non-conformant since they hash the ordinal number as a u8 with platform
specific endianness.


Fixes #3051
@emesterhazy
Copy link
Contributor Author

emesterhazy commented Feb 16, 2024

Don't we use to_le_bytes()?

As far as I can tell this is only used for the impl for TreeValue. It's not used for std::Option or MergedTreeId, which are the only other enums that implement ContentHash as far as I can tell.

I mailed out #3061 to fix this in the current implementations, which will be superseded by the proc macro once it's submitted.

@yuja I think your explanation makes sense. If we didn't have consistency and portability the snapshot tests would break on other platforms even if jj would continue to work fine for actual repos.

emesterhazy added a commit that referenced this issue Feb 16, 2024
The `ContentHash` documentation specifies that implementations for enums should
hash the ordinal number of the variant contained in the enum as a 32-bit
little-endian number and then hash the contents of the variant, if any.

The current implementations for `std::Option` and `MergedTreeId` are
non-conformant since they hash the ordinal number as a u8 with platform
specific endianness.


Fixes #3051
@Ralith
Copy link
Contributor

Ralith commented Feb 16, 2024

Portability seems broken since we're using the hardware endianness

All multi-byte integers' implementations should be calling to_le_bytes.

As far as I can tell this is only used for the impl for TreeValue. It's not used for std::Option or MergedTreeId, which are the only other enums that implement ContentHash as far as I can tell.

These are currently (erroneously) using u8 discriminants, whose representation is not sensitive to endianness.

@emesterhazy
Copy link
Contributor Author

These are currently (erroneously) using u8 discriminants, whose representation is not sensitive to endianness.

Right.. there's only a single byte so the platform endianness doesn't matter.

emesterhazy added a commit that referenced this issue Feb 16, 2024
…d RemoteRefState

The `ContentHash` documentation specifies that implementations for enums should
hash the ordinal number of the variant contained in the enum as a 32-bit
little-endian number and then hash the contents of the variant, if any.

The current implementations for `std::Option`, `MergedTreeId`, and
`RemoteRefState` are non-conformant since they hash the ordinal number as a u8
with platform specific endianness.


Fixes #3051
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants