Click here to Skip to main content
15,885,890 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I try to read one by one character in the Unicode (utf-8) file, but I don't know how to read a single character. So can you tell me what is the easiest way to read a single character?
Posted
Updated 9-Jan-12 2:09am
v2
Comments
Mohibur Rashid 9-Jan-12 8:09am    
Correcting Title

Maybe this article will get you started in the right direction.
 
Share this answer
 
Comments
Emilio Garavaglia 7-Jan-12 13:16pm    
:-O
There are several options depending on the type of stream you're using like fgetc or ReadFile or fstream.>> etc.
 
Share this answer
 
After reading a good article referenced above by DrBones69, you can also use sample code from this thread: Read unicode file into wstring[^]
 
Share this answer
 
Due to the fact that UTF-8 encoded characters have a variable length, you have to check each byte read. A possible solution (using file a file handle opened in binary mode) would be:

C#
typedef struct {
    int nLen;
    unsigned char cByte[6];
} utf8char_t;

// Read UTF-8 char into struct
// Return number of UTF-8 bytes read (0 upon EOF, -1 upon invalid codes)
int read_utf8_char(FILE *f, utf8char_t& tChar)
{
    tChar.nLen = 0;
    if (feof(f))
        return 0;
    unsigned char c = tChar.cByte[0] = 
        static_cast<unsigned char>(fgetc(f));
    if (c & 0x80)
    {
        while (c & 0x80)
        {
            ++tChar.nLen;
            c <<= 1;
        }
        for (int i = 1; i < tChar.nLen && i < 6)
        {
            if (feof(f))
                return 0;
            tChar.cByte[i] = static_cast<unsigned char>(fgetc(f));
            if ((tChar.cByte[i] & 0xC0) != 0x80)
                return -1;
        }
        if (tChar.nLen >= 6)
            return -1;
    }
    else
        tChar.nLen = 1;
    return tChar.nLen;
}


Please nothe that this example does not check for all possible wrong UTF-8 codes.
 
Share this answer
 
Comments
Mohibur Rashid 8-Jan-12 22:24pm    
Dude OP Said Unicode

Unicode is 2 bytes long, UTF-8 is variable length
Jochen Arndt 9-Jan-12 4:03am    
He said Unicode in the title and stated more precisely UTF-8 in the question.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900