Unicode Buddy

yetibrain

4.00/5 (2 votes)

Nov 6, 2016

CPOL

6 min read

15734

694

Unicode Buddy is a tool to inspect unicode files. It can detect orphaned surrogates and invalid utf-8 sequences. It is able to show the encoding/decoding of a certain codepoint. It is not an editor but a viewer.

Introduction

When the bit was invented, soon the nibble and octet/byte followed. Now a byte could be a processor opcode, a parameter, a number or a ASCII code. Because people needed to view characters and strings e. g. text, ASCII was invented on a 1:1 basis between byte and character. However the highest bit was reserved for special purposes, so the language/country-specific codepages that covered special language/country-specific characters have been given numbers in that range between 128-255.

So for ASCII, the 127 characters (including non-printable characters) were sufficient to produce, store, print text. The a-z, A-Z, @,<>{}[]()/\;,.: etc. characters are nowerdays still used around the world. However other countries proclaimed about the fact that their own alphabet and/or characters could not be used along with ASCII, just the latin letters are available with ASCII.

So people invented UNICODE. First of all we might think that this is just a change from a byte to a word, e. g. 8 bits to 16 bits but this is not the case, indeed people gave numbers to characters up to 0x10FFFF. Some of the numbers are reserved for special purposes (surrogates). However the ASCII standard has not been changed, so the numbers for ASCII characters are still from 0 to 127, but from 128 to 0x10FFFF every number (except surrogates) is defined to present a special character or symbol. The best of it all is that UNICODE characters can be mixed within a single file or document.

Of course characters higher than 255 cannot be stored within a byte, so we might use a word to store those unicode characters. That is what really is done for storing unicode, especially in memory where strings are usually a byte-array, so nowerdays mostly word-arrays are used. You can also talk about a
char-array and nowerdays we use a wchar-array. However it's not that the character is "wide", it's just the number of a unicode codepoint that can be much higher than 255, so it cannot be stored within a single byte. Be sure that characters higher than 0xFFFF cannot be stored within a word or wchar datatype either! Therefore surrogates have
been invented. We will come to this later.

Now because with unicode we have 0x10FFFF possible numbers that we need to store, one might think that the best is to use a dwchar-array or lets say a dword of 32 bits, indeed then we will have the same old 1:1 relation as in ASCII times but nowerdays with unicode. Every value in one DWORD will just be the number of the character. But people who mostly use ASCII characters spoil 3 bytes for every ASCII character that way, so again some people invented the so called Unicode Transformation Formats, e. g. there is UTF-32, UTF-16 as well as UTF-8. All three transformation formats allow to store the entire unicode range but use different encoding/decoding algorythms to store the number of a unicode character. To find out what encoding actually a file has, the BOM has been invented. It is a 3-byte, 2-byte or 4-byte signature at the top of the file, that acts like a "header" for the file and it is really recommended to place such BOM into a UNICODE file.

In order to reveal information on textfiles with different encodings, people usually use a hex-editor in order to view the content of a file, because each byte will be displayed with its hexadecimal number from 0x00 to 0xFF. It is easy then to detect a UTF-8 3-byte BOM at the beginning of a file for example.

Background

Many Hex-Editors still show a 1:1 relation between a byte and character, even when the file's encoding has been detected successfully by examining the BOM. But there is no more a 1:1 relation between a byte and a character if for example the file's encoding is utf-8. utf-8 maintains singlebytes for characters <= 127 but everything > 127 is stored as 2-byte, 3-byte or 4-byte sequence. Such sequence' bytes have no more a 1:1 relation to the unicode character, but has to be decoded. If we take a look at utf-16, there is no more 1:1 relation between a byte and a character, instead there is a 1:1 relation between word and character but even this is only true for unicode characters of the BMP (Basic Multilingual Plane) e. g. characters < 0xFFFF. Characters beyond the BMP are encoded using surrogate pairs. Last not least utf-32 offers a 1:1 relation between the DWORD and the unicode codepoint. With all these encodings, hex-editors should not display characters on the left side that are a 1:1 relation of the bytes of the opened file, instead the characters should just show up as the character of the underlying encoding. This is what MadEdit does, so it inspired me very much to create something similar to MadEdit but i ended not up with a hex editor but with a tool i called Unicode Buddy from the first. time. Unicode Buddy doesn't allow to edit the bytes of
the file, it is just a viewer application. However it shows in detail how a character is encoded.

Points of Interest

Sometimes developers and users have problems displaying the proper characters of some file, often there is a mismatch regarding the encoding of the file and the decoding of the visualization unit. Sometimes it is just the font that doesn't contain the glyph of the appropriate unicode character and so a default glyph is displayed (empty rectangle or question mark etc.). Often developers have to look at the BOM in order to find out the proper encoding of a file and even when there is a BOM, the file might still be corrupt. A utf-8 encoded file can for example contain invalid sequences. If an ANSI file gets interpreted as utf-8, the first byte higher than 127 will make problems usually, because utf-8 detects it as a startbyte or following byte, not as a single character. Surrogates are not allowed within utf-8 and utf-32 because it is senseless there to use them. Unicode Buddy can reveal such information, it can reveal surrogate-pairs, orphaned surrogates, it can detect invalid utf-8 sequences and it has built in filtering functions in order to view sequences only, surrogates only etc. There is also a small statictics panel that can be helpful in getting information about the content of the file. Be aware that currently Unicode Buddy is limited to a certain file size due to the fact that it uses the Windows ListView Form, which is limited to a certain number of rows it can display. Still Unicode Buddy can help to reveal information on encoded unicode characters and can help to learn how certain encodings/decodings are working. In order to view characters that need wierd fonts, Unicode Buddy can be configured in a way that it is possible to define special fonts for special unicode blocks.

History

First Version with given features. Possibly yet more to come.