Click here to Skip to main content
16,021,687 members
Home / Discussions / C / C++ / MFC
   

C / C++ / MFC

 
AnswerRe: Character set Pin
CPallini12-Sep-24 23:19
mveCPallini12-Sep-24 23:19 
GeneralRe: Character set Pin
Calin Negru12-Sep-24 23:46
Calin Negru12-Sep-24 23:46 
GeneralRe: Character set Pin
Mircea Neacsu13-Sep-24 0:06
Mircea Neacsu13-Sep-24 0:06 
GeneralRe: Character set Pin
trønderen13-Sep-24 9:33
trønderen13-Sep-24 9:33 
GeneralRe: Character set Pin
Mircea Neacsu13-Sep-24 17:17
Mircea Neacsu13-Sep-24 17:17 
GeneralRe: Character set Pin
trønderen14-Sep-24 7:25
trønderen14-Sep-24 7:25 
GeneralRe: Character set Pin
Mircea Neacsu14-Sep-24 15:10
Mircea Neacsu14-Sep-24 15:10 
GeneralRe: Character set Pin
trønderen14-Sep-24 17:27
trønderen14-Sep-24 17:27 
Mircea Neacsu wrote:
However this has noting to do with UTF-8 vs UTF-16
That is certainly true.
Mircea Neacsu wrote:
I remain of the opinion that UTF-16 has no particular advantage when compared with UTF-8.
I am leaning towards agreeing with you.

Mostly, I am observing - and has been observing for 40+ years - that people strive for non-Einstein solutions, "Make it as simple as possible, but no simpler". People want to do it simpler! For years, I heard lots of people say that 32 bits is overkill, Unicode will never grow beyond the first plane, BMP - there isn't anywhere close to 65,536 different characters! And for a number of years, they were right: Unicode did manage with the basic plane only.

That is when people started using 16 bit characters, although I am not sure that the name UTF-16 was know that early. With BMP only, most simple(r than possible) developers thought it quite simple; a string of 16-bit characters was just like a string of 8-bit characters, only with more characters. (Look at the History section of Wikipedia: Unicode[^] - even the initial developers of Unicode argued the same!)

If it had ended up that way, it would have been significantly simpler: You can count the number of characters as easily in 16 bit as in 8 bit character code. You can index character 23 by string8[23] or string16[23]. In other words: I can fully understand why Windows NT (1993) and Java (1995) went for 16 bit characters. (At the time of Windows NT release, UTF-8 had been proposed, but was not yet accepted as a standard - anyway, you don't change the system character encoding from 16 bits fixed to n*8 bits a few weeks before the release of a new OS!)

As we all know now, the solution was simpler than possible. Several of my coworkers were highly surprised when BMP overflowed, but didn't worry: We are never going to encounter those characters in the entire lifetime of our software! I think that they for at least ten more years continued to access character 23 by string16[23]. I can understand them. Until we got emojis in other planes, they were essentially right.

But it was a too simple solution. When you were forced to handle multiple planes, and maybe you at the same time discovered combining and non-spacing codes, then the simplicity disappeared. You have all the same issues with UTF-8; it is not any worse with UTF-16, and in Western text, the special cases occur rarely. Most of the time, UTF-16 is more straightforward, but you have to be prepared for the exceptions. With UTF-8, you can never relax; you handle variable length characters all the time! (At least if you regularly write non-English text, which is the common case in most European countries.)

If UTF-8 didn't exist, I would be happy with sending UTF-16 memory strings straight to file. Having UTF-8 as an alternative in-memory format creates trouble; I want one single unambiguous string format. Now that Windows, Java and C# both use UTF-16, I am not going to start using UTF-8 in-memory.

But I also want to have one singe unambiguous file format. UTF-8 is established, UTF-16 is not. So UTF-8 wins. I am stressing: Don't waste your time trying to process UTF-xxx yourself; use library functions. So when I read text from or write text to file, I let library functions process the strings. Each format has its use.

After all, I guess I really disagree with you: If we start with a tabula rasa, but we are to select The One And Only Encoding, UTF-16 and UTF-8 are equally good. But that isn't the situation in memory: Windows and numerous other essential tools/subsystems have based themselves on UTF-16. Given that, using UTF-8 in my application strings would introduce a lot of complexities. So accepting the realities of life, my programs will continue to use UTF-16 strings.

Until, of course, I start working with an OS having UTF-8 as its system string encoding and languages/tools that use UTF-8 as their in-memory string encoding.
Religious freedom is the freedom to say that two plus two make five.

GeneralRe: Character set Pin
Mircea Neacsu15-Sep-24 2:26
Mircea Neacsu15-Sep-24 2:26 
GeneralRe: Character set Pin
trønderen15-Sep-24 12:16
trønderen15-Sep-24 12:16 
GeneralRe: Character set Pin
Richard MacCutchan15-Sep-24 21:08
mveRichard MacCutchan15-Sep-24 21:08 
GeneralRe: Character set Pin
Mircea Neacsu16-Sep-24 2:41
Mircea Neacsu16-Sep-24 2:41 
GeneralRe: Character set Pin
jschell17-Sep-24 12:29
jschell17-Sep-24 12:29 
GeneralRe: Character set Pin
Richard MacCutchan14-Sep-24 21:34
mveRichard MacCutchan14-Sep-24 21:34 
GeneralRe: Character set Pin
Mircea Neacsu15-Sep-24 2:45
Mircea Neacsu15-Sep-24 2:45 
GeneralRe: Character set Pin
Richard MacCutchan15-Sep-24 2:58
mveRichard MacCutchan15-Sep-24 2:58 
GeneralRe: Character set Pin
trønderen15-Sep-24 12:19
trønderen15-Sep-24 12:19 
GeneralRe: Character set Pin
Mircea Neacsu17-Sep-24 13:49
Mircea Neacsu17-Sep-24 13:49 
GeneralRe: Character set Pin
jschell17-Sep-24 12:22
jschell17-Sep-24 12:22 
GeneralRe: Character set Pin
CPallini13-Sep-24 0:06
mveCPallini13-Sep-24 0:06 
GeneralRe: Character set Pin
Calin Negru13-Sep-24 1:32
Calin Negru13-Sep-24 1:32 
GeneralRe: Character set Pin
CPallini13-Sep-24 1:34
mveCPallini13-Sep-24 1:34 
GeneralRe: Character set Pin
trønderen14-Sep-24 6:48
trønderen14-Sep-24 6:48 
GeneralRe: Character set Pin
Calin Negru14-Sep-24 9:18
Calin Negru14-Sep-24 9:18 
GeneralRe: Character set Pin
k505414-Sep-24 10:15
mvek505414-Sep-24 10:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.