Click here to Skip to main content
15,889,808 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

I know in normal language char size 1 byte.what is the maximum size of char in languages such as chinese.Is w_char is it's corresponding representation?

Thanks in advance
Posted

Take a look at ICU:ICU - International Components for Unicode[^]

The ICU libraries, provides first class support for Unicode[^], while wchar_t[^] only provides support for wider character encodings.

You can encode chineese characters in UTF-8, UTF-16, UCS-32, GB 18030, Code page 936, and others.

Best regards
Espen Harlinn
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 15-Jul-11 20:31pm    
Good reference, my 5.
Actually, coding is not a problem just yet; the problem is that OP makes a false statement from the very beginning and does not understand how encodings work.

Please see my answer.
--SA
Espen Harlinn 16-Jul-11 4:36am    
Thank you, Sergey!
thatraja 15-Jul-11 23:24pm    
Bookmarked, 5!
Espen Harlinn 16-Jul-11 4:36am    
Thank you, thatraja!
What you "know" is not true!

First, how much byte a character takes is not a pure characteristic of the language. And of course, there is not such thing as "normal" language. Core Unicode does not define how many bytes each character has; it defines the set of code points and a correspondence between a character as a cultural phenomena, abstracted from its concrete glyph and a set on integer values understood in its mathematical sense abstracted from its computer presentation.

Encodings called UTFs define how to represent each code point in byte. Only UTF-32 has fixed 4 bytes per characters. Byte-oriented UTF-8 uses interesting algorithm which makes a character take 1, 2, 3 or 4 bytes with the actual length depending in the value of previous byte(s), and UTF-16 is not a 16-bit code (!), a length of the character can be either 16 or 32 bits (in case of bytes outside Base Multi-lingual Plane (BMP) expressed in a surrogate pair — two 16-bit words). Also, UTF-16 and UTF-32 encodings can be little endian or big endian.

Now, about "normal" languages. Which language do you want to consider. American English perhaps? All expressed in ASCII, code points, 0 to 127, right? Think again! It depends on what you consider a "language". How about fully-fledged punctuation used in this language? Consider, for example, correct typography for dash and quotation marks: —, – “ ”. Try to type them in your keyboard. The code points are 0x2013, 0x2014, 0x201C and 0x201D. Try to squeeze them in one byte — good luck!

See http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^].

Please don't make false statements, understand things by yourself first.

—SA
 
Share this answer
 
Comments
thatraja 15-Jul-11 23:23pm    
Unicode rocks, 5!
Sergey Alexandrovich Kryukov 16-Jul-11 15:18pm    
Thank you, Raja.
--SA
The standard C functions always worked for me - when working with Chinese (Big5), Japanese, and Korean encoded byte streams. They are dependent on locale and OS support.

#include <cstdlib>
#include <climits>
MB_LEN_MAX // Maximum size of multibyte character (any locale)
MB_CUR_MAX // Current maximum size supported
mblen()    // length of MB character
mbtowc()   // MB character to WC character
wctomb()   // WC character to MB character
mbstowcs() // MB string to WC string
wcstombs() // WC string to MB string
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 15-Jul-11 20:30pm    
This is made obsolete by introduction if Unicode. This is not a problem; the problem is that OP makes a false statement from the very beginning and does not understand how encodings work.

Please see my answer.
--SA
John R. Shaw 16-Jul-11 12:13pm    
Actually they are not obsolete. I know Unicode encoding in depth (bmp, surrogates, etc.) - I have done internationalization. Encoding in Big5 and Traditional Chinese as well as Shift-Jis are still around and need conversion into Unicode. The newer C standard (see Draft N1494) does provide new Unicode specific functions in clause 7.27 and under the hood the STL usually calls the C-functions.
Sergey Alexandrovich Kryukov 18-Jul-11 23:35pm    
Well, I did not mean Big5 and Shift-Jis does not exist; but they are morally obsolete; as you say, "need conversion to Unicode". Not visa versa, right. Obsolete is obsolete; this is an accurate term.
--SA
char size entirely depends on encoding type.
as example
if it is UNICODE, then simply two bytes. it can be either big endian or little endian
if it is utf8 then not simple, it varies from one byte to six bytes. To get better picture read utf8 [wiki] design part[^]
 
Share this answer
 
You can use the type wchar_t (1 short = 2 bytes) to encode the Unicode representation (65536 different characters). This representation is well supported by the common development environments.
 
Share this answer
 
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900