Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Rust built-in UTF-16 conversion #8

Open
sffc opened this issue Nov 16, 2019 · 14 comments
Open

Use Rust built-in UTF-16 conversion #8

sffc opened this issue Nov 16, 2019 · 14 comments
Labels
discuss before start The issues isn't quite ready to be fixed, there are open questions to be discussed enhancement New feature or request help wanted Extra attention is needed

Comments

@sffc
Copy link

sffc commented Nov 16, 2019

Split from #6

The code currently calls u_strToUTF8. It would be better to use Rust standard library functions like std::str::encode_utf16. You would avoid having to go across the FFI boundary when performing the conversion.

@filmil filmil added the enhancement New feature or request label May 8, 2020
@filmil filmil added the help wanted Extra attention is needed label Aug 27, 2020
bdragon added a commit to bdragon/rust_icu that referenced this issue Oct 2, 2020
bdragon added a commit to bdragon/rust_icu that referenced this issue Oct 2, 2020
bdragon added a commit to bdragon/rust_icu that referenced this issue Oct 8, 2020
@bdragon
Copy link
Contributor

bdragon commented Oct 8, 2020

I have been working on this issue (diff here). On the whole it is a straightforward change, mainly involving the replacement of some unsafe code with str::encode_utf16 and String::from_utf16 and making rote adjustments as needed downstream (the main one due to the shift from impl TryFrom<&str> for UChar to From, as converting a &str to a Vec<u16> is infallible). Tests are passing with the exception of those in rust_icu_umsg, and I believe I see why.

In most places where we pass a ustring::UChar across the FFI boundary, we also specify its length. This allows us to generally avoid having to worry about null terminators. What is unique about rust_icu_umsg is that it is the only place in this project where I have seen a va_list used. Because the ICU message formatting APIs inherently expect a variable number of arguments, we have no means of specifying the explicit length of any UChar * passed across the boundary. The underlying ICU code finds a UChar * and expects it to be null-terminated. Of course Rust's str::encode_utf16() does not add a null terminator and there's no guarantee that ustring::UChars created with From<Vec<u16>> will include a trailing 0 either, and so we end up passing a pointer to memory with an undefined length. I feel fairly confident that this is the problem because if I push a 0 onto the end of any ustring::UChars (with make_z) in the va_list first, the issue seems to go away.

I wanted to reach out and get thoughts on this before moving forward.

@filmil
Copy link
Member

filmil commented Oct 9, 2020 via email

@filmil filmil added the discuss before start The issues isn't quite ready to be fixed, there are open questions to be discussed label Oct 9, 2020
@filmil
Copy link
Member

filmil commented Oct 9, 2020

I realized we were missing a label that says "this issue has open questions to be resolved", so I added the label and added it to this issue.

@bdragon
Copy link
Contributor

bdragon commented Oct 10, 2020

I'm not hard set on making this change. When I was first getting acquainted with rust_icu_ustring, it occurred to me that we could do the UTF-16 conversion in Rust, so this issue stood out when I looked at the backlog. I learned a few things along the way so I won't consider it wasted effort.

Do you have any suggestions for a better issue to pick up? Perhaps one of the missing ICU4X APIs (#138 is an example)?

@filmil
Copy link
Member

filmil commented Oct 12, 2020

A couple of notes:

  • As for the UTF-16 conversion, perhaps a good way forward would be to switch it on using feature?
  • As for other worthwhile work, take a look at this table: https://github.com/google/rust_icu/projects
    • The goal is to fill the matrix out with checkmarks.
    • But before starting on any particular cell in that table, please check out the currently active and dormant bugs: https://github.com/google/rust_icu/projects/1 and see if you can assign one to yourself ahead of time, and we can discuss each bug where appropriate. This way we'll know who's working on what and will hopefully avoid duplicated/wasted work.

@filmil
Copy link
Member

filmil commented Nov 4, 2020

Just checking in, @bdragon, Is there something else from the list above that piqued your interest?

@bdragon
Copy link
Contributor

bdragon commented Nov 6, 2020

@filmil sorry for letting this collect dust. I will take another look at the UTF-16 work and reply in more detail this weekend. Regarding the remaining ECMA 402 tasks, I would be happy to work on any of them, really, though I don't believe I can assign one to myself. To be less indifferent, #134 (Implement Intl.Collator) is appealing.

@filmil
Copy link
Member

filmil commented Nov 6, 2020

That's all OK. I just wanted to understand if there is a task that piques your interest.

No need to be super-formal about assignment, especially if there are permission issues -- just drop a note on the bug that you intend to take and I'll assign you to it.

@bdragon
Copy link
Contributor

bdragon commented Nov 10, 2020

@filmil I can see a path for making this an optional feature.

Conversion to UTF-16 with ICU via FFI is obviously fallible; conversion in Rust with str::encode_utf16 is not. Because of this, I opted to replace the TryFrom<&str> implementation for UChar with From<&str>. This meant simplifying (albeit minimally) a few handfuls of call sites throughout rust_icu.

Offering conversion in Rust as an optional feature would enable this slightly friendlier API for end users, but I think having one fallible and one infallible API, depending on how the conversion is performed, would burden those call sites within rust_icu with cfg/feature-conditional logic (could this fail or not?).

Therefore I think it would be best to continue using TryFrom and leave the UChar::try_from(&str)? call sites as they are throughout rust_icu. When the feature is enabled, we could use std::convert::Infallible as the error type in the TryFrom implementation and then additionally offer a From<&str> implementation for end users.

What do you think?

Looking now at my initial concern, since the problem seems to be scoped to rust_icu_umsg, maybe the solution is to ensure that any string arguments passed to message_format! are null-terminated (when the Rust conversion feature is enabled)?

@filmil
Copy link
Member

filmil commented Nov 11, 2020 via email

@bdragon
Copy link
Contributor

bdragon commented Nov 11, 2020

Sounds good—no rush on my end.

@filmil
Copy link
Member

filmil commented Nov 21, 2020

@filmil I can see a path for making this an optional feature.

Therefore I think it would be best to continue using TryFrom and leave the UChar::try_from(&str)? call sites as they are throughout rust_icu. When the feature is enabled, we could use std::convert::Infallible as the error type in the TryFrom implementation and then additionally offer a From<&str> implementation for end users.

What do you think?

This avenue seems to me like the least amount of complication for the end user. A single constructor is used throughout. I am not super-worried about the fallibility of the constructor remaining even if the underlying code is infallible.

Here are a few alternatives, none spectacularly better:

  • Introduce both TryFrom and From implementations, but implement them differently depending on whether the underlying machinery is fallible or not.
    • Pros:
      • Uniform, the APIs are the same regardless of which feature is used
    • Cons:
      • From will have to panic in case of an error.
  • Implement either TryFrom or From, depending on which feature is active
    • Pros:
      • Enables us to say what we mean
    • Cons:
      • Probably confusing UX to have some implementations disappear based on the feature set. It may also make it awkward
        for two libraries to use different feature combinations.

Thanks for your patience.

@filmil
Copy link
Member

filmil commented Nov 23, 2020

Note, I hope to have answered your question here, let me know if I can be of more help.

@bdragon
Copy link
Contributor

bdragon commented Nov 26, 2020

Yes, thank you—I think I have enough to get this wrapped up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss before start The issues isn't quite ready to be fixed, there are open questions to be discussed enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants