Use Rust built-in UTF-16 conversion #8

sffc · 2019-11-16T05:51:26Z

Split from #6

The code currently calls u_strToUTF8. It would be better to use Rust standard library functions like std::str::encode_utf16. You would avoid having to go across the FFI boundary when performing the conversion.

The text was updated successfully, but these errors were encountered:

bdragon · 2020-10-08T22:42:44Z

I have been working on this issue (diff here). On the whole it is a straightforward change, mainly involving the replacement of some unsafe code with str::encode_utf16 and String::from_utf16 and making rote adjustments as needed downstream (the main one due to the shift from impl TryFrom<&str> for UChar to From, as converting a &str to a Vec<u16> is infallible). Tests are passing with the exception of those in rust_icu_umsg, and I believe I see why.

In most places where we pass a ustring::UChar across the FFI boundary, we also specify its length. This allows us to generally avoid having to worry about null terminators. What is unique about rust_icu_umsg is that it is the only place in this project where I have seen a va_list used. Because the ICU message formatting APIs inherently expect a variable number of arguments, we have no means of specifying the explicit length of any UChar * passed across the boundary. The underlying ICU code finds a UChar * and expects it to be null-terminated. Of course Rust's str::encode_utf16() does not add a null terminator and there's no guarantee that ustring::UChars created with From<Vec<u16>> will include a trailing 0 either, and so we end up passing a pointer to memory with an undefined length. I feel fairly confident that this is the problem because if I push a 0 onto the end of any ustring::UChars (with make_z) in the va_list first, the issue seems to go away.

I wanted to reach out and get thoughts on this before moving forward.

filmil · 2020-10-09T16:27:42Z

I am not super convinced that this change is actually necessary for `rust_icu` in particular. But please note that my opinion on this is biased, and a possible way forward is laid out below. I feel that some background is in order to explain my opinion. The folks who started `rust_icu` are mostly working on Fuchsia OS. If you're an operating system, it pays to use just one library to work with Unicode, because you get to reuse it many times. So for us, using Rust code to do something that ICU can already do is not essential. This tradeoff is different for other uses. Most "regular" users are not super concerned with binary size, so it makes sense to them to simplify the APIs as you did. In a world where I could make a wish, I'd leave both approaches, possibly hidden behind a feature switch, so that there is an option to use pure ICU or a hybrid approach. There will be wrinkles as you noted, where the ICU API is not self-consistent: it's not clear to me why for msgfmt in particular they departed from the "pascal" array passing convention (size+data buffer). I think that were the built-in UTF-16 conversion hidden behind a flag, you could simply use the conversion that copies the bytes. This way you could get a choice between using ICU only, without the conversion; and rust+icu which is a better API but slightly inefficient. Thoughts? I apologize if this isn't exactly the feedback you may have hoped for. We should have discussed this issue ahead of time to avoid wasted effort. F

filmil · 2020-10-09T16:30:13Z

I realized we were missing a label that says "this issue has open questions to be resolved", so I added the label and added it to this issue.

bdragon · 2020-10-10T00:05:06Z

I'm not hard set on making this change. When I was first getting acquainted with rust_icu_ustring, it occurred to me that we could do the UTF-16 conversion in Rust, so this issue stood out when I looked at the backlog. I learned a few things along the way so I won't consider it wasted effort.

Do you have any suggestions for a better issue to pick up? Perhaps one of the missing ICU4X APIs (#138 is an example)?

filmil · 2020-10-12T19:56:07Z

A couple of notes:

As for the UTF-16 conversion, perhaps a good way forward would be to switch it on using feature?
As for other worthwhile work, take a look at this table: https://github.com/google/rust_icu/projects
- The goal is to fill the matrix out with checkmarks.
- But before starting on any particular cell in that table, please check out the currently active and dormant bugs: https://github.com/google/rust_icu/projects/1 and see if you can assign one to yourself ahead of time, and we can discuss each bug where appropriate. This way we'll know who's working on what and will hopefully avoid duplicated/wasted work.

filmil · 2020-11-04T00:34:17Z

Just checking in, @bdragon, Is there something else from the list above that piqued your interest?

bdragon · 2020-11-06T04:50:42Z

@filmil sorry for letting this collect dust. I will take another look at the UTF-16 work and reply in more detail this weekend. Regarding the remaining ECMA 402 tasks, I would be happy to work on any of them, really, though I don't believe I can assign one to myself. To be less indifferent, #134 (Implement Intl.Collator) is appealing.

filmil · 2020-11-06T08:34:58Z

That's all OK. I just wanted to understand if there is a task that piques your interest.

No need to be super-formal about assignment, especially if there are permission issues -- just drop a note on the bug that you intend to take and I'll assign you to it.

bdragon · 2020-11-10T05:50:50Z

@filmil I can see a path for making this an optional feature.

Conversion to UTF-16 with ICU via FFI is obviously fallible; conversion in Rust with str::encode_utf16 is not. Because of this, I opted to replace the TryFrom<&str> implementation for UChar with From<&str>. This meant simplifying (albeit minimally) a few handfuls of call sites throughout rust_icu.

Offering conversion in Rust as an optional feature would enable this slightly friendlier API for end users, but I think having one fallible and one infallible API, depending on how the conversion is performed, would burden those call sites within rust_icu with cfg/feature-conditional logic (could this fail or not?).

Therefore I think it would be best to continue using TryFrom and leave the UChar::try_from(&str)? call sites as they are throughout rust_icu. When the feature is enabled, we could use std::convert::Infallible as the error type in the TryFrom implementation and then additionally offer a From<&str> implementation for end users.

What do you think?

Looking now at my initial concern, since the problem seems to be scoped to rust_icu_umsg, maybe the solution is to ensure that any string arguments passed to message_format! are null-terminated (when the Rust conversion feature is enabled)?

filmil · 2020-11-11T19:57:27Z

I'm a bit busy at the moment, but I have this on my TODO list, and will respond to this by the end of the week. Thank you for your patience. In general, the direction of (1) using a feature and (2) making a minimally invasive change would be most welcome. I still need to look in detai.

bdragon · 2020-11-11T21:45:01Z

Sounds good—no rush on my end.

filmil · 2020-11-21T00:23:06Z

@filmil I can see a path for making this an optional feature.

Therefore I think it would be best to continue using TryFrom and leave the UChar::try_from(&str)? call sites as they are throughout rust_icu. When the feature is enabled, we could use std::convert::Infallible as the error type in the TryFrom implementation and then additionally offer a From<&str> implementation for end users.

What do you think?

This avenue seems to me like the least amount of complication for the end user. A single constructor is used throughout. I am not super-worried about the fallibility of the constructor remaining even if the underlying code is infallible.

Here are a few alternatives, none spectacularly better:

Introduce both TryFrom and From implementations, but implement them differently depending on whether the underlying machinery is fallible or not.
- Pros:
  - Uniform, the APIs are the same regardless of which feature is used
- Cons:
  - From will have to panic in case of an error.
Implement either TryFrom or From, depending on which feature is active
- Pros:
  - Enables us to say what we mean
- Cons:
  - Probably confusing UX to have some implementations disappear based on the feature set. It may also make it awkward
    for two libraries to use different feature combinations.

Thanks for your patience.

filmil · 2020-11-23T22:37:19Z

Note, I hope to have answered your question here, let me know if I can be of more help.

bdragon · 2020-11-26T03:01:04Z

Yes, thank you—I think I have enough to get this wrapped up.

filmil added the enhancement New feature or request label May 8, 2020

filmil added the help wanted Extra attention is needed label Aug 27, 2020

bdragon added a commit to bdragon/rust_icu that referenced this issue Oct 2, 2020

Perform UTF-8 <-> UTF-16 conversions in Rust google#8

8ffd298

bdragon added a commit to bdragon/rust_icu that referenced this issue Oct 2, 2020

Perform UTF-8 <-> UTF-16 conversions in Rust google#8

02f0a3b

bdragon added a commit to bdragon/rust_icu that referenced this issue Oct 8, 2020

Perform UTF-8 <-> UTF-16 conversions in Rust google#8

2374505

filmil added the discuss before start The issues isn't quite ready to be fixed, there are open questions to be discussed label Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Rust built-in UTF-16 conversion #8

Use Rust built-in UTF-16 conversion #8

sffc commented Nov 16, 2019

bdragon commented Oct 8, 2020 •

edited

Loading

filmil commented Oct 9, 2020 via email

filmil commented Oct 9, 2020

bdragon commented Oct 10, 2020

filmil commented Oct 12, 2020

filmil commented Nov 4, 2020

bdragon commented Nov 6, 2020

filmil commented Nov 6, 2020

bdragon commented Nov 10, 2020

filmil commented Nov 11, 2020 via email

bdragon commented Nov 11, 2020

filmil commented Nov 21, 2020

filmil commented Nov 23, 2020

bdragon commented Nov 26, 2020

Use Rust built-in UTF-16 conversion #8

Use Rust built-in UTF-16 conversion #8

Comments

sffc commented Nov 16, 2019

bdragon commented Oct 8, 2020 • edited Loading

filmil commented Oct 9, 2020 via email

filmil commented Oct 9, 2020

bdragon commented Oct 10, 2020

filmil commented Oct 12, 2020

filmil commented Nov 4, 2020

bdragon commented Nov 6, 2020

filmil commented Nov 6, 2020

bdragon commented Nov 10, 2020

filmil commented Nov 11, 2020 via email

bdragon commented Nov 11, 2020

filmil commented Nov 21, 2020

filmil commented Nov 23, 2020

bdragon commented Nov 26, 2020

bdragon commented Oct 8, 2020 •

edited

Loading