|
Unicode is 1byte per character, that’s the Latin characters and the other symbols found on a standard keyboard
Multibyte is Latin, Greek, Russian and everything else that exceeds the initial 256 symbols
Is that how it works?
|
|
|
|
|
|
I thought a c++ char has the size of one byte. How can something that is greater that 1 byte (Unicode, Multibyte) fit into a char?
|
|
|
|
|
It cannot; it is using an “encoding”, the most popular by far being UTF-8[^].
Mircea
|
|
|
|
|
Mircea Neacsu wrote: the most popular by far being UTF-8[^]. I'd say that the most popular file storage format is UTF-8.
As a working format, in RAM, UTF-16 is very common. E.g. it is the format used by all Windows APIs, which is more or less to say all Windows programs. Java uses UTF-16 in RAM, as do a lot of other modern languages.
It must be said that not all software that claims to use UTF-16 fully handles UTF-16 - only the BMP ("Basic Multilingual Plane"), so that all supported characters will fit in one 16-bit code unit. BMP didn't have space for latecomer alphabets, like for a number of African or Asian languages. Most developers said "But my program isn't aimed at such markets, so I'll ignore the UTF-16 surrogates, for handling such characters as two 16-bit code units. I can treat text as if all characters are of equal width, 16 bits".
But a new situation has arisen: Emojis have procreated to a number far exceeding the number of WinDings. They do not all fit in BMP, so a number of them have been allocated in other planes than BMP. Don't expect the end user to know which emojis are defined in which planes and refrain from using non-BMP emojis! If you are not prepared for them, your code may mess up the text badly.
Writing your own complete UTF-16 interpreter is not recommended. Use library functions! There is more to UTF-16 than just alternative planes: Some character codes are nonspacing, or combining (typically an accent and a character). So you cannot deduce the number of print positions from the length of the UTF-16 string - not even after considering control characters.
For "trivial" strings limited to Western alphabets, there usually is a fairly close correspondence between the number of UTF-16 code units and the number of positions. You can pretend that it is exact, but look out for cases that need to be treated as exceptions. I suspect that is what a lot of programmers do. 99,9% of Western text is free of exceptional cases, so the fixed-code-width assumption holds. Until, of course, emojis become common e.g. in file names. Note that UTF-32 does not provide an ultimate solution to all problems: You still may have to relate to nonspacing or combining characters!
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
I have to confess that I am a convert to the UTF-8 religion as preached in the UTF-8 Everywhere[^] manifesto. So much so that I've written a series of articles[^] on CP about using UTF-8 in Windows (you can find the whole series here[^]).
Some of your assertions are open to interpretations: Quote: So you cannot deduce the number of print positions from the length of the UTF-16 string - not even after considering control characters. Why would that be interesting from a programming point of view? From a typographical point of view, sure, but as programmers we don't usually concern ourselves with such minutia
The subject of emojis is another pet peeve of mine so allow me a bit of a roundabout. Some evolutionary solutions have been reinvented many times: flight has been reinvented by insects, birds, mammals, you name it. However there are some crucial points in evolution that happened only once. Photosynthesis or eukaryotic cells are prime examples but so is alphabetic writing. Moving from pictographic writing, where a symbol represented a whole word, to one where a symbol represented a sound, was a magnificent achievement of the human spirit that opened the path to what we now call the Western civilization. Now, if you buy at least some of my arguments, you can see how disappointed I am when this whole evolutionary path is turned back by the spread of emojis. No longer we need the magic words of a Shakespearean sonnet when we can just put a heart and a smiley face. Bleah!
Mircea
|
|
|
|
|
Mircea Neacsu wrote: Quote:So you cannot deduce the number of print positions from the length of the UTF-16 string - not even after considering control characters.
Why would that be interesting from a programming point of view? If you code anything that is to be presented to a user, you will frequently have to relate to the physical space available, whether a 16 char single line display on an embedded device, or a field in a form on a desktop PC.
If you just send it the entire string, leaving to the display unit to discard what won't fit, for one: You may upset the display device. Second: Maybe it is obvious to you that the first 'n' characters are displayed, but don't trust it: Many small-display devices display a rolling text, so the last 'n' characters are displayed. In either case, your customer may be less than satisfied with your solution. If you present floating point values with the number of decimal positions less than the internal precision (which is almost always the case), you may want to consider rounding the last displayed digit - don't expect a pure UI module to have any concept of floating point rounding! (Besides, it may want the values as separate digits, not as an FP value.)
Even if a value is not presented to a human user, it may be exchanged with another software module in textual format. The receiver may provide a limited size text buffer, or may require a minimum number of (valid) characters (possibly converted to 7-bit ASCII with zero parity, if it is an old *nix application!)
If your software has nothing at all to do with a user interface, you may still be handling data that you handle over to some software doing the UI. This software may put restrictions on the lengths of both prompt strings and data values. You may have to make decisions about what to display, either by some form of abbreviation (Initial only, ellipsis, ...), leaving (semi-)optional parts out, etc.
I certainly can think of specific programming tasks that are completely unrelated to character string length. But to me, those are special cases. The main rule is that the printable length, both in number of positions and the typographical length (when using variable width fonts) can be essential, and you should be prepared to handle it. You ask a Unicode handling library function for the number of positions when you need it. You ask a UI typography library function for the typographical length if that is what you need e.g. to shorten the string to fit into a field.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
My remark was made mostly tongue-in-cheek (hence the smiley after it). Of course the length of the rendered text is of interest in many/most applications. It's just that, luckily, I don't have to worry about it because people who write the nitty-gritty of UI have taken care of it. For instance, in Windows, I can just call GetTextExtentPoint32[^] function to have the text measured.
However this has noting to do with UTF-8 vs UTF-16. I remain of the opinion that UTF-16 has no particular advantage when compared with UTF-8. (If there are other readers of this conversation, please don't start a flame war now - this is just a personal opinion). I see UTF-16 as a stepping stone when computing world needed to move away from ASCII, but in this day and age, it has served its purpose and we can move away to something better.
Mircea
|
|
|
|
|
Mircea Neacsu wrote: However this has noting to do with UTF-8 vs UTF-16 That is certainly true.
Mircea Neacsu wrote: I remain of the opinion that UTF-16 has no particular advantage when compared with UTF-8. I am leaning towards agreeing with you.
Mostly, I am observing - and has been observing for 40+ years - that people strive for non-Einstein solutions, "Make it as simple as possible, but no simpler". People want to do it simpler! For years, I heard lots of people say that 32 bits is overkill, Unicode will never grow beyond the first plane, BMP - there isn't anywhere close to 65,536 different characters! And for a number of years, they were right: Unicode did manage with the basic plane only.
That is when people started using 16 bit characters, although I am not sure that the name UTF-16 was know that early. With BMP only, most simple(r than possible) developers thought it quite simple; a string of 16-bit characters was just like a string of 8-bit characters, only with more characters. (Look at the History section of Wikipedia: Unicode[^] - even the initial developers of Unicode argued the same!)
If it had ended up that way, it would have been significantly simpler: You can count the number of characters as easily in 16 bit as in 8 bit character code. You can index character 23 by string8[23] or string16[23]. In other words: I can fully understand why Windows NT (1993) and Java (1995) went for 16 bit characters. (At the time of Windows NT release, UTF-8 had been proposed, but was not yet accepted as a standard - anyway, you don't change the system character encoding from 16 bits fixed to n*8 bits a few weeks before the release of a new OS!)
As we all know now, the solution was simpler than possible. Several of my coworkers were highly surprised when BMP overflowed, but didn't worry: We are never going to encounter those characters in the entire lifetime of our software! I think that they for at least ten more years continued to access character 23 by string16[23]. I can understand them. Until we got emojis in other planes, they were essentially right.
But it was a too simple solution. When you were forced to handle multiple planes, and maybe you at the same time discovered combining and non-spacing codes, then the simplicity disappeared. You have all the same issues with UTF-8; it is not any worse with UTF-16, and in Western text, the special cases occur rarely. Most of the time, UTF-16 is more straightforward, but you have to be prepared for the exceptions. With UTF-8, you can never relax; you handle variable length characters all the time! (At least if you regularly write non-English text, which is the common case in most European countries.)
If UTF-8 didn't exist, I would be happy with sending UTF-16 memory strings straight to file. Having UTF-8 as an alternative in-memory format creates trouble; I want one single unambiguous string format. Now that Windows, Java and C# both use UTF-16, I am not going to start using UTF-8 in-memory.
But I also want to have one singe unambiguous file format. UTF-8 is established, UTF-16 is not. So UTF-8 wins. I am stressing: Don't waste your time trying to process UTF-xxx yourself; use library functions. So when I read text from or write text to file, I let library functions process the strings. Each format has its use.
After all, I guess I really disagree with you: If we start with a tabula rasa, but we are to select The One And Only Encoding, UTF-16 and UTF-8 are equally good. But that isn't the situation in memory: Windows and numerous other essential tools/subsystems have based themselves on UTF-16. Given that, using UTF-8 in my application strings would introduce a lot of complexities. So accepting the realities of life, my programs will continue to use UTF-16 strings.
Until, of course, I start working with an OS having UTF-8 as its system string encoding and languages/tools that use UTF-8 as their in-memory string encoding.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
trønderen wrote: After all, I guess I really disagree with you World would be too boring if we wouldn't have different opinions
trønderen wrote: I want one single unambiguous string format. You aren't going to get it, or at least not in this lifetime . If you go to Linux or Mac worlds, everything is UTF-8. In Windows world it's UTF-16 with a sprinkle of UTF-8.
trønderen wrote: But I also want to have one singe unambiguous file format. UTF-8 is established, UTF-16 is not. So UTF-8 wins. If I understand you correctly, you suggest having UTF-8 files converted to UTF-16 on entry, processed as UTF-16 inside the application and converted back to UTF-8 on output. That would complicate things very much if you target different OS-es. It would also be inefficient if your app doesn't require the UTF-16 parts of the OS (ReadFile and WriteFile functions in Windows work with any encoding).
My strategy is almost a mirror image of that: Everything is UTF-8 until it needs to call certain OS functions when a thin wrapper converts all inputs to UTF-16 and all results back to UTF-8.
Mircea
|
|
|
|
|
Mircea Neacsu wrote: If I understand you correctly, you suggest having UTF-8 files converted to UTF-16 on entry, processed as UTF-16 inside the application and converted back to UTF-8 on output. A simple UTF-8/16 conversion filter (included by default) in the StreamReader/Writer, or whatever your IO classes are named certainly does not complicate things for the developer.
For interaction with other systems, whether they use UTF-8, UTF-32 or UTF-16 as a working format, UTF-8 is the lingua franca, the Esperanto of textual information. It is The File Character Encoding. No application needs to be concerned about it.
That would complicate things very much if you target different OS-es. If your alternative is to reject any OS that does not use UTF-8 as its system character encoding, and all programming languages, libraries and development environments that does not use UTF-8, you are most certainly right. In that situation, I would certainly go for UTF-8 in-memory as well.
For a great number of developers, in-memory UTF-8 isn't a viable option. UTF-16-oriented OSes, languages and tools are a fact of life. I say; When in Rome, roam with the Romans (or however they saying goes). Even though a lot of developers of drivers and interrupt handlers and low-level network protocols with only rudimentary textual output work in *nix-like environments, the great majority of those communicating textually, with users and others, roam Rome in UTF-16 environments.
If you are talking about making applications that can be ported between UTF-16 and UTF-8 oriented OSes without a single source code change in the string handling, and your code assuming UTF-8 strings even under an APIs expecting UTF-16, then you are overly optimistic. You will have to do a lot of adaptations to UTF-16-oriented library and system functions. Or wrap every single one of them in a two-way-conversion wrapper.
The best way to handle it is to leave all string handling to library functions, you application knowing nothing about the encoding under the hood. (Your floating point application would probably run fine on a machine with an FP format different from IEEE-754!). Treat a string as a string, regardless of encoding. Make sure that when you handle characters individually, you use a char32 to hold them.
If you go to Linux or Mac worlds, everything is UTF-8. Most certainly not. Well, I never worked with Mac, but in Linux there are loads of software that can't handle anything but 8 bit character sets. You can even come across those that cannot handle 8 bit, but only 7 bit characters. A few years ago, I was editing a configuration file on a Unix system; that network module crashed immediately because I had added a comment which contained a non-ASCII 8859 character (in the name of one maintainer).
A number of RFC-822 based internet protocols still cannot handle ISO 8859 (it would be against RFC-822). There still is a real need for QP encoding, backslash or ampersand encoding, etc.
You may rightfully say that *nix/Mac apps written to handle UTF-8 does handle UTF-8. Big surprise. You may claim that languages/tools specifying UTF-16 string representation doesn't use that when running under *nix/Mac - all strings are converted to UTF-8 when ported to these OSes, both in source code and library APIs. I doubt very much that that is the case.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
Mircea Neacsu wrote: My strategy is almost a mirror image of that: Everything is UTF-8 until it needs to call certain OS functions when a thin wrapper converts all inputs to UTF-16 and all results back to UTF-8. Mine is exactly the same; which I thought I had explained earlier. I guess not clearly enough.
|
|
|
|
|
Sorry Richard, indeed I didn't notice that. It's been a rather busy weekend
Mircea
|
|
|
|
|
trønderen wrote: If you just send it the entire string, leaving to the display unit to discard what won't fit, for one: You may upset the display device.
To be fair however your example is suggesting that developer has no idea what the business space even is.
So for example if I expect the uses to use a PC with a larger display, and they use a phone, then it is not going to work very well. But if I expect them to use a phone also then I should account for that and test/develop for that also.
trønderen wrote: This software may put restrictions on the lengths of both prompt strings and data values.
That isn't true. Such data is always limited. Nothing in computing is unlimited. Doesn't matter how it is used. If a developer is not considering that from initial creation then that will cause problems. Display problems is just one case. It is similar to a very naive developer that designs a database and makes every text field into a blob just on the chance that the extra space might be needed.
|
|
|
|
|
I read your article on the use of UTF-8, but I have a point of (very slight) disagreement. The majority of Windows API calls have both UNICODE and ASCII versions, and the C and C++ runtimes are very much UTF-8 biased. So there is no need to make your projects UNICODE based, as there are only a few instances when you are forced to use Unicode strings. I used to write my applications so that I could easily build them either way, but gave that up and now manage the conversions for the odd functions that force me to use Unicode text.
|
|
|
|
|
Richard MacCutchan wrote: The majority of Windows API calls have both UNICODE and ASCII versions, Indeed, but the question is: what happens when you call an ANSI function, let's say MessageBoxA with a text that contains characters outside the 0-128 range? The result depends on the active code page setting. Until recently (2019) the programmer had no control on what ACP user has selected. He could only query the this setting using the GetACP function.
In newer versions of Windows, you can declare the ACP page you want to use using the app manifest. If you declare UTF-8 code page, you can use UTF-8 with the ANSI functions.
For more details see Use UTF-8 code pages in Windows apps - Windows apps | Microsoft Learn[^]
Mircea
|
|
|
|
|
Mircea Neacsu wrote: what happens when ... Assuming the developers understand the customers' requirements* this will have been catered for.
*which is probably on the toss of a coin
|
|
|
|
|
Mircea Neacsu wrote: Indeed, but the question is: what happens when you call an ANSI function, let's say MessageBoxA with a text that contains characters outside the 0-128 range? ANSI always was 8-bit, 0-255, ISO 8859-x, 'x' varying with the region. Almost all the Western world had their needs covered by 8859-1. (Lapps were one exception; they used 8859-8, if my memory is correct.)
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
trønderen wrote: Almost all the Western world had their needs covered by 8859-1 Funny you say that: I cannot even properly write my last name in 8859-1. It is written "Neacşu". I have to go to Windows-1252 for that. The Wikipedia page for ISO/IEC 8859-1[^] lists also other languages that are not fully covered.
Anyway, at this point, I think we should agree to disagree as I'll continue to keep everything in UTF-8 inside my programs.
Mircea
|
|
|
|
|
trønderen wrote: But a new situation has arisen: Emojis have procreated to a number far exceeding the number of WinDings. They do not all fit in BMP
I did not even consider that as a possibility.
So make sure I add something that says "no emojis!".
|
|
|
|
|
|
I understand. Thank you guys
|
|
|
|
|
You are welcome.
"In testa che avete, Signor di Ceprano?"
-- Rigoletto
|
|
|
|
|
Did *nix, or C under any other OS, ever run on a machine with a non-binary word size, like 12, 18 or 36 bits? Such as DEC-10 / DEC-20 or Univac 1100 series. I never heard of any, but I'd expect that it was done. Univac could either operate with 6 bit FIELDATA bytes or 9 bit bytes. DEC put five 7 bit bytes into each word, with one bit to spare.
If C was run on such machines, how was char handled? Enforcing 8 bits, making the strings incompatible with other programming languages? Or did C surrender to 6 or 7 bit bytes? If they stuck to 8 bits, did each word on a 36 bit machine have 4 bits to spare? (2 bits on an 18 bit machine) Or did they fit 4.5 char a word - 9 char per two words? (On Univac, you could address 6 or 9 bit bytes semi-directly, using string instructions, without the need for shifting and masking.)
By old definition, both architectures had a char size of one byte. The old understanding of 'byte' was the space required to store a single character; it could vary from 5 bits (Baudot code) to at least 9 (Univac and others) bits. Lots of international standards (developed outside of internet environments!) use the term 'octet' for 8 bits to avoid confusion with non-binary word/byte sizes.
Religious freedom is the freedom to say that two plus two make five.
|
|
|
|
|
|