-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode wwstrings as utf8 #1220
Encode wwstrings as utf8 #1220
Conversation
Hi @reznikmm, I think I need your wisdom with this one. I'm trying to get Alire itself on the
Here is where I hit a situation I'm not fully understanding, and the GNAT docs are somewhat sparse on the exact details: I hear you that
So, I think that we need the "default" Text_IO behavior, so I then applied
Note that that output is being done through I'm going to try to use plain In conclusion, what I would want to do is:
Thanks a lot for any help. PS: I blindly tried
(Note the the failing 🛈 before "If you experience...") PS2: PS3: I'm now using this minimal example (edit: with
And using
|
This is exactly what I wanted to avoid with unconditional |
I guess we're victims of not having used It's possible Ada_TOML is relying on unwritten assumptions about encoding and it needs tweaking too? I saw some low-level looking things. e.g. here. Perhaps it should use Stream_IO rather than Text_IO here? I guess the low-level parsing is because TOML mandates UTF8. That Also I don't have special knowledge of how OSes receive/provide Unicode strings. As plain 32-bit code sequences? in UTF8? Governed by the locale? I guess the Ada IO subsystem is abstracting that part, so going to raw stream IO may be not a good idea anyway. |
I tried switching Text_IO to Stream_IO in ada-toml, given that it seems to be processing raw bytes, and with that things seems to fall in place when using only |
So it seems these are the minimal changes needed to move forward and to pass our test suite short of a full refactoring of "text" strings into wide strings or other separate type. As a learning exercise it's been quite interesting.
This was because I forgot a
This was because of the forgotten All in all, I'm now more confident that going full Unicode is doable and should not be confusing for new crates. There will be probably some trouble when using legacy code with Text_IO in place of WW_Text_IO. The sooner all crates are properly Unicode-ready the shortest the pain I guess. |
That's a lot of backwards incompatibility. We were supposed to find little to no problems, and on Alire alone there are 4 dependencies that require changes, that's a lot.
Does that mean adding |
There are two separate sources of issues here though: the use of "fancy" symbols that are easily detected during compilation, and then the use of Ada.Text_IO/GNAT.IO. It's also possible my incomplete understanding plus the unexpected ada-toml obstacle have painted an excessively troublesome situation. In Alire we have hit what is arguably a perfect storm of corner cases: Once Then I suspect that GNAT.IO is bypassing any/some transformations applied by these switches; but as it is preelaborable it may become a problem to eradicate in some code bases. Also, it has no Wide_Wide_IO alternative. Our logging depended on it for example, and it still does during elaboration time. And I don't see a solution short of buffering log messages for a second task to output them, which may be or not a problem depending on how critical is for output to be emitted ASAP. Then again, this could be chalked to Alire conflating logging and output for users. I suspect this can be quite common though. Then, ada-toml is doing its own parsing of Unicode but at the same time relying on Text_IO to read "bytes", which might be wrong. Once I fixed (?) it to use Stream_IO, everything seems alright from that side. The larger questions are though: is there better alternatives? Is not addressing Unicode at all better? Part of the issue is that string literals worked well by default with UTF-8-encoded sources. So
I don't understand exactly why the resulting strings were any different. It would expose their existence in bodies though.
I think they're left to their own devices. Please feel free to correct any wrong assumptions in here. I'm starting to get confused by all the things I tried. To clear the air, I'm going to prepare a sample crate with dependencies to try to isolate all combos, if only for my sake. Also I plan to run |
I've been experimenting a bit more. I think my brain is fried because my isolated tests here https://github.com/mosteo/unitest match Maxim's explanations but I was unable to replicate the differences between specs and bodies for example. One conclusion is that adding |
Sorry, I'm on vacation and don't pay much attention to e-mails, so I've missed your issue. I hope we can make TOML parser be independent on |
Since GNAT.IO is safe to use with utf8
We have decided with @Fabien-Chouteau to merge this one for the time being. For the changed dependencies, I'll open PRs bumping the major version just in case. |
No description provided.