Replies: 10 comments 46 replies
-
I know everybody uses UTF-8 locales these days, but, at least on GNU/Linux, wouldn't it make sense to switch the behaviour depending on the value of the |
Beta Was this translation helpful? Give feedback.
-
Disclaimer: I do not know much about encoding and codepages and such. But the discussion rang a bell about databases; |
Beta Was this translation helpful? Give feedback.
-
I voted for UTF8 support for all Ada sources in Alire. But it's not without pause. Personally, I stick to ASCII only in my personal projects. It is my impression that one of the original design goals of the Ada language is for the code to look as much as possible like normal English prose. Just look at all the reserved keywords, they are all English words. Thus, introducing the possibility of using UTF8 in the source code with the purpose of the code looking like English prose doesn't seem like a natural course of action. Being able to use identifiers in the source code expressed in Spanish, Japanese or Farsi in one and the same code base won't help maintainability. Well, it is possible to put Farsi in the source code even when the code is ASCII only, for example "function Payam return String;". From the compilers perspective, in general, the identifiers only need to be unique no matter what language is used. Let's say it is possible to use a static code analysis tool to check that an Ada code base only uses identifiers from English regardless of the Ada source code is in ASCII or UTF8. Why else would one want to use UTF8 in the code base? Perhaps one would like to put Emojis in the comments of the source code. Perhaps that is the main motivation for allowing UTF8 in one's source code? No, thinking about it, I would say the main motivation for UTF8 in the source code is to be able to embed UTF8 strings. When sticking to ASCII only, all the UTF8 strings need to be put in separate files from the Ada source code. It is more convenient to be able to embed UTF8 strings directly in the source code. So there is a use-case where it is more convenient, but not necessary. Another reason I avoid UTF8 in the source code are the existence of different code-points in UTF8 which look the same (https://util.unicode.org/UnicodeJsps/confusables.jsp?a=test&r=None) . I don't want the possibility in my personal projects to define a function Hello and then having the compiler tell me it is not possible to call it because there is no function Hello, and the difference would be perhaps for the "e" in the definition of the function and the "e" used in the function call to be different code points but look the same to the naked eye. It would not take me long to figure out what's wrong, but it is a source of error that I am happy to be without. Yet another reason against UTF8 Ada source code is security. Allowing UTF8 in the source code enables dangerous UTF8 character sequences (https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/). When for example writing Go code there is a static source analysis tool for detecting these in Go source code: https://github.com/breml/bidichk The main motivation for allowing UTF8 is to make the transition for newcomers to the Ada language easier. If people have the culture and tradition of putting UTF8 in the source code of their Go, Rust, Java, C# applications etc. it may be off-putting if the Ada source code would be ASCII only. |
Beta Was this translation helpful? Give feedback.
-
The Consider the following simple Ada program that reads the current directory and print file names. This Ada program only uses the ASCII character set, no UTF-8 in it: with Ada.Directories;
with Ada.Text_IO;
procedure Readdir is
Search_Filter : constant Ada.Directories.Filter_Type
:= (Ada.Directories.Ordinary_File => True,
Ada.Directories.Directory => True,
Ada.Directories.Special_File => False);
Search : Ada.Directories.Search_Type;
Ent : Ada.Directories.Directory_Entry_Type;
begin
Ada.Directories.Start_Search (Search, Directory => ".",
Pattern => "*", Filter => Search_Filter);
while Ada.Directories.More_Entries (Search) loop
Ada.Directories.Get_Next_Entry (Search, Ent);
Ada.Text_Io.Put_Line (Ada.Directories.Simple_Name (Ent));
end loop;
end Readdir; Create a file that shows the issue:
If you compile it with If you compile it with The same issue will arise if you get your UTF-8 string from a database or from an i18N database such as Please, stay away from that broken compiler option until it is really fixed. |
Beta Was this translation helpful? Give feedback.
-
ARM 2.1 16 says:
Without |
Beta Was this translation helpful? Give feedback.
-
As @reznikmm pointed out, the Ada String is a sequence of Latin-1 which to my understanding means that:
Any other advises? |
Beta Was this translation helpful? Give feedback.
-
I guess program text encoding is best stored as program text meta data in the alire crate. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately this won't work with current gnat compiler, because it reads
all sources in a single encoding. For instance if you specify -gnatW8 for
your library then compiler will read all dependency as utf8.
|
Beta Was this translation helpful? Give feedback.
-
Lots of interesting discussion. It reinforces my impressions though that without mandated The hard sad truth seems to be that without I don't see how moving forward without Of course you can use String to store UTF-8... as long as you deal with the IO in some raw form. EDIT: and let's not forget that even without |
Beta Was this translation helpful? Give feedback.
-
Is there an 'easy way' to cofigure which I read somewhere about a person (from Finland, Hakkinen? perhaps) who(m?) wrote his own standard with - among other things - another definition of Update: |
Beta Was this translation helpful? Give feedback.
-
We are again discussing what (or if) Alire should do something about Unicode. My last attempt at documenting the situation is at #1332 (see unicode.md and AEP-0004.md for full details). In a way I see this as the same pain that arose in the Python2 -> Python3 transition.
I want to get more feedback from experts and the community feeling. There's a poll below.
The gist of the conumdrum is:
-gnatW8
GNAT presumes UTF-8 sources and changesText_IO
andWide_Wide_Text_IO
default behavior globally.Ada.Text_IO
will behave differently depending on-gnatW8
being active.-gnatW8
,Ada.Text_IO
prints bytes as-is.-gnatW8
,Ada.Text_IO
applies a Latin1 --> UTF8 conversion which garbles UTF-8 strings.Ada.Text_IO
works "as expected" for UTF-8 sources without-gnatW8
.Ada.Wide_Wide_Text_IO
doesn't work by default without-gnatW8
for non-Latin1 characters (prints brackets encoding).There's no good default that will work for all cases for both unaware and aware users, which is the main point of contention.
Reading past debates at
comp.lang.ada
(mainly this one) it seems this is beyond repair from the ARG side, and there are proponents of both extremes, even of avoiding it completely by staying in ASCII in sources.In the poll below you'll find the following options:
-gnatW8
).-gnatW8
) are more likely to find trouble from non-gnatW8 crates.Text_IO
don't really behave as they should, even if this counterintuitively works.-gnatW8
the default (this is currently enabled in themaster
branch)Ada.Text_IO
is unclear until runtime.Ada.Text_IO
garbling UTF-8 strings.27 votes ·
Beta Was this translation helpful? Give feedback.
All reactions