Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode wwstrings as utf8 #1220

Merged
merged 5 commits into from
Oct 24, 2022
Merged

Encode wwstrings as utf8 #1220

merged 5 commits into from
Oct 24, 2022

Conversation

mosteo
Copy link
Member

@mosteo mosteo commented Oct 10, 2022

No description provided.

@mosteo
Copy link
Member Author

mosteo commented Oct 10, 2022

Hi @reznikmm, I think I need your wisdom with this one.

I'm trying to get Alire itself on the -gnatW8 boat after our discussions in #1207. What I've done is:

  1. Add -gnatW8 to Alire and dependencies where non-ASCII strings appear.
  2. Fix all compilation errors by encoding the wwstrings into UTF-8 strings.

Here is where I hit a situation I'm not fully understanding, and the GNAT docs are somewhat sparse on the exact details:

I hear you that -gnatW8 changes the behavior of Ada.Text_IO.Put_Line. Alas, it also changes Get-related subprograms, because now our TOML parser fails when trying to load UTF-8 files, e.g. (there's some umlaut):

$ alr search hell
error: Invalid TOML contents in file:       
error:    invalid UTF-8 encoding at /home/user/prog/alire-index/index/ad/ada_fuse/ada_fuse-1.0.1.toml:8:20

So, I think that we need the "default" Text_IO behavior, so I then applied -Wb as you suggested (here's were documentation gets really sparse). But, after forcing some non-ASCII output:

$ alr search hell
["26A0"] Index 'local' version (1.2.0) is older than the newest supported by alr (1.2.1)
["26A0"] You can disable this warning with configuration key 'warning.old_index'

Note that that output is being done through Ada.Wide_Wide_Text_IO using decoded UTF8 strings (and the 26A0 matches what I'm trying to print). At this point I don't understand why Ada.Wide_Wide_Text_IO would fail? Or is the decoding faulty? (No, see at the end.)

I'm going to try to use plain Text_IO with UTF-8 strings, but that seems an inferior solution as you're presuming the system encoding.

In conclusion, what I would want to do is:

  1. Keep default Text_IO behavior so Ada_TOML doesn't break.
  2. Print plain WWStrings through Wide_Wide_Text_IO so whatever encoding is used by the OS is honored (I'm guessing this is the way to do it properly, but I'm far from certain).

Thanks a lot for any help.

PS: I blindly tried -W8 instead of -Wb and this results in other bizarre trouble, as some strings look correct while other don't, and anyway TOML remains broken:

$ alr search hell
⚠ Index 'local' version (1.2.0) is older than the newest supported by alr (1.2.1)
⚠ You can disable this warning with configuration key 'warning.old_index'
â If you experience any problems loading this index, you may need to reset the community index with 'alr index --reset-community'. Note that this operation will delete any local changes to the community index.
error: Invalid TOML contents in file:       
error:    invalid UTF-8 encoding at /home/jano/prog/alire-index/index/ad/ada_fuse/ada_fuse-1.0.1.toml:8:20

(Note the the failing 🛈 before "If you experience...")

PS2: -gnatW8 also seems to affect GNAT's detection of modified files, since now if I immediately recompile without touching anything, almost all files are compiled again, always.

PS3: I'm now using this minimal example (edit: with -gnatW8):

with Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
use Ada.Strings.UTF_Encoding;
use  Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
with Ada.Wide_Wide_Text_IO; use Ada.Wide_Wide_Text_IO;

procedure Main is
begin
  Put_Line ("Raw: 🛈⚠");
  Put_Line ("Enc/dec: " & Decode (UTF_8_String'(Encode ("🛈⚠"))));
end Main;

And using -Wb indeed breaks it with output:

Raw: ["01F6C8"]["26A0"]
Enc/dec: ["01F6C8"]["26A0"]

@Fabien-Chouteau
Copy link
Member

This is exactly what I wanted to avoid with unconditional -gnatW8. The impact is much more important than what was discussed in #919.

@mosteo
Copy link
Member Author

mosteo commented Oct 10, 2022

I guess we're victims of not having used -gnatW8 since the very beginning. What I would like to understand well now is what set of switches + practices should adopt an entirely new development based on Unicode, without legacy concerns. My understanding is that it would be -gnatW8, not using -W with the binder, and using Wide_Wide_Text_IO everywhere.

It's possible Ada_TOML is relying on unwritten assumptions about encoding and it needs tweaking too? I saw some low-level looking things. e.g. here. Perhaps it should use Stream_IO rather than Text_IO here? I guess the low-level parsing is because TOML mandates UTF8.

That Wide_Wide_Text_IO is changing behavior is surprising to me, but it seems expected given https://gcc.gnu.org/onlinedocs/gnat_rm/Wide_005fWide_005fText_005fIO.html#Wide_005fWide_005fText_005fIO . So perhaps decoding to use WWTIO instead of TIO with UTF8 is pointless.

Also I don't have special knowledge of how OSes receive/provide Unicode strings. As plain 32-bit code sequences? in UTF8? Governed by the locale? I guess the Ada IO subsystem is abstracting that part, so going to raw stream IO may be not a good idea anyway.

@mosteo
Copy link
Member Author

mosteo commented Oct 10, 2022

I tried switching Text_IO to Stream_IO in ada-toml, given that it seems to be processing raw bytes, and with that things seems to fall in place when using only -gnatW8. I'm still getting some mixed utf8/latin1 output but I guess that will be some stray use of GNAT.IO/Text_IO.

@mosteo
Copy link
Member Author

mosteo commented Oct 10, 2022

So it seems these are the minimal changes needed to move forward and to pass our test suite short of a full refactoring of "text" strings into wide strings or other separate type. As a learning exercise it's been quite interesting.

PS2: -gnatW8 also seems to affect GNAT's detection of modified files, since now if I immediately recompile without touching anything, almost all files are compiled again, always.

This was because I forgot a -gnatW8 in one of the dependencies; now it's back to normal.

some strings look correct while other don't

This was because of the forgotten -gnatW8, so strings in the dependency specs were properly translated but strings in bodies were still latin1-encoded.

All in all, I'm now more confident that going full Unicode is doable and should not be confusing for new crates. There will be probably some trouble when using legacy code with Text_IO in place of WW_Text_IO. The sooner all crates are properly Unicode-ready the shortest the pain I guess.

@mosteo mosteo marked this pull request as ready for review October 10, 2022 22:04
@Fabien-Chouteau
Copy link
Member

Fabien-Chouteau commented Oct 11, 2022

That's a lot of backwards incompatibility. We were supposed to find little to no problems, and on Alire alone there are 4 dependencies that require changes, that's a lot.

This was because I forgot a -gnatW8 in one of the dependencies; now it's back to normal.

Does that mean adding -gnatW8 to the gprbuild command line directly will yield better result?
What about crates that are built in post-fetch? (gtkada)

@mosteo
Copy link
Member Author

mosteo commented Oct 11, 2022

That's a lot of backwards incompatibility. We were supposed to find little to no problems, and on Alire alone there are 4 dependencies that require changes, that's a lot.

There are two separate sources of issues here though: the use of "fancy" symbols that are easily detected during compilation, and then the use of Ada.Text_IO/GNAT.IO. It's also possible my incomplete understanding plus the unexpected ada-toml obstacle have painted an excessively troublesome situation.

In Alire we have hit what is arguably a perfect storm of corner cases:

Once -gnatW8 is in effect, any legacy use of Text_IO with UTF8 strings will result in mangled output without -Wb. But, with -Wb, any code trying to do "the right thing" and using Text_Text_IO will see its output mangled (I think? It's starting to get fuzzy now.) So all dependencies must be on the same page [not] to presume use of -Wb.

Then I suspect that GNAT.IO is bypassing any/some transformations applied by these switches; but as it is preelaborable it may become a problem to eradicate in some code bases. Also, it has no Wide_Wide_IO alternative. Our logging depended on it for example, and it still does during elaboration time. And I don't see a solution short of buffering log messages for a second task to output them, which may be or not a problem depending on how critical is for output to be emitted ASAP. Then again, this could be chalked to Alire conflating logging and output for users. I suspect this can be quite common though.

Then, ada-toml is doing its own parsing of Unicode but at the same time relying on Text_IO to read "bytes", which might be wrong. Once I fixed (?) it to use Stream_IO, everything seems alright from that side.

The larger questions are though: is there better alternatives? Is not addressing Unicode at all better?

Part of the issue is that string literals worked well by default with UTF-8-encoded sources. So -gnatW8 is more tailored to source code entities using non-ASCII letters. Some could consider an acceptable sacrifice to restrict Ada sources to the ASCII set?

Does that mean adding -gnatW8 to the gprbuild command line directly will yield better result?

I don't understand exactly why the resulting strings were any different. It would expose their existence in bodies though.

What about crates that are built in post-fetch? (gtkada)

I think they're left to their own devices.

Please feel free to correct any wrong assumptions in here. I'm starting to get confused by all the things I tried. To clear the air, I'm going to prepare a sample crate with dependencies to try to isolate all combos, if only for my sake.

Also I plan to run alr test locally on all currently succeeding crates to see how many of them break down with -gnatW8.

@mosteo
Copy link
Member Author

mosteo commented Oct 11, 2022

I've been experimenting a bit more. I think my brain is fried because my isolated tests here https://github.com/mosteo/unitest match Maxim's explanations but I was unable to replicate the differences between specs and bodies for example.

One conclusion is that adding -gnatW8 to gprbuild in the command-line would unnecessarily break compilations in more cases (e.g. isolated strings in bodies). Another one seems to be that GNAT.IO, instead of being a problem, is preferable as it always works properly (because it doesn't attempt to remap anything?). Finally, I couldn't find a way for TIO and WWTIO to work properly at the same time, even within the same library, which is just plain messy as in any situation mixing them one of them is always going to fail (??? I must be missing something). I've perused https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/the_implementation_of_standard_i_o.html trying to find something relevant without success.

@reznikmm
Copy link
Contributor

reznikmm commented Oct 12, 2022

Sorry, I'm on vacation and don't pay much attention to e-mails, so I've missed your issue. I hope we can make TOML parser be independent on -W/-gnatW8 switches by adding Form => "WCEM=b" to Ada.Text_IO.Open call in toml-file_io.adb. This parameter sets desired encoding explicitly for given file. But if you already has Input_Stream interface implementation over Stream_IO then your implementation is better from my point of view.

Since GNAT.IO is safe to use with utf8
@mosteo
Copy link
Member Author

mosteo commented Oct 24, 2022

We have decided with @Fabien-Chouteau to merge this one for the time being. For the changed dependencies, I'll open PRs bumping the major version just in case.

@mosteo mosteo merged commit 40062db into master Oct 24, 2022
@mosteo mosteo deleted the fix/utf8 branch October 24, 2022 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants