Click here to Skip to main content
15,115,119 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a file which contains some data. That data is encoded in UTF-8 (without a BOM)

Those bytes are usually no problem to handle. Yet know in that file there is a byte sequence I don't know what it should represent (neither could I find any information about it too)

To examine the date I opened the file in a hex editor. There were UTF-8 char sequences which were pretty normal (C3 BC for ü and C3 B6 for ö etc.)

Yet then there was the following sequence I don't know how to get to the expected char:

C3 83 EF BF BF

From the context I can gather that it should represent the character ü. Yet I've no idea how you could possibly get to that sequence...


Example how this looks like in the file (Hex View):
54 65 73 74 20 77 69 74 68 20 63 68 61 72 20 22 
75 65 22 20 2D 3E 20 C3 83 EF BF BF 20 69 74 20 73 68 6F 75 
6C 64 20 70 72 6F 62 61 62 6C 79 20 72 65 70 72 
65 73 65 6E 74 20 74 68 65 20 63 68 61 72 20 ^__b style="color:darkred">C3 
BC


Actual text (UTF-8):

Test with char "ue" -> 

Now that strange sequence: ^__b style="color:darkred">Ã it should probably represent the char ^__b style="color:darkred">ü

(Well looks like CP won't let me display the decode value of EF BF BF ;) )

I've highlighted the according sections in the Hex View and the Representation in the text View.

Now the question:

What should C3 83 EF BF BF represent? I suppose C3 83 translates okay to à but what is EF BF BF? The only thing I found was that if you convert the char 0xFFFF to UTF-8 EF BF BF is the byte sequence that you get. But still: what should it exactly represent?
Posted

1 solution

I think your sequence C3 83 EF BF BF is the result of an other UTF8 encoding with the "ANSI" sequence C3 BC.

Let me explain:
1) when trying to convert char C3 to UTF8, you will get C3 83
2) if BC is not known in the CodePage, the Unicode result might be FF FF
3) Encoding to UTF8 the Unicode result will generate EF BF BF

in conclusion:
C3 BC is converted to Unicode using a codepage (don't know which one, but not UTF8).
This will result in C3 00 FF FF (because BC is not known in the used codepage.
Then this result is encoded from Unicode to UTF8 to
C3 83 EF BF BF

I think the error is in the program generating your source file.
   
Comments
Nicholas Marty 4-Dec-13 7:53am
   
Yeah, most likely the program generating that file is at fault somewhere.

My missing link was that I didn't know (or maybe also forgot) that when converting an unknown character this might result in "FF FF". (I was at least pretty close in finding out that FF FF translates to EF BF BF in UTF-8 ;) )

Your explanation would very well explain the problem here. So thanks for that :)
Pascal-78 4-Dec-13 9:01am
   
In fact, "U+FFFD" is the specific code for "replacement character" used to replace an unknown or unrepresentable character. and "FF FF" is not allowed as an Unicode Character. May be it's an other mistake of the program generating the file.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900