[Answering a follow-up question about encoding]
Text files comes with different
encodings. When you read file in one encoding assuming a different encoding, you may or may not screw up the content of the file. It also depends on the actual set of code points used in the file — for example if all code points lie in the ASCII range, UTF-8 encoding and ANSI encoding will give identical results if UTF-8 without BOM is ised. Encoding issues are covered by the class
System.Text.Encoding
and derived classes, see
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[
^].
Encodings are roughly subdivided into Unicode UTFs and the encodings used prior to Unicode. .NET supports Unicode. Unicode at its core does not define encodings. It defines one-to-one correspondence between characters as cultural entities, regardless fonts, glyphs, etc., and a set of abstract mathematical numbers (
code points
), totally abstracted from their computer presentation such as size, little- of big-endian, etc. Instead, a set of different UTF encodings is defined.
Unicode is not a 16-bit encoding! It standardize much more code points than 2
16! Even UTF-8 and UTF-16 UTFs support them all (in this way, a character is not one byte or two bytes — it takes 1 to 4 bytes, in fact). They have different convenience in different situations. A text file may or may not contain Byte-Order Marker (BOM) at the beginning of the file, which helps to recognize UTF encoding before reading.
For more information, use:
http://unicode.org/[
^],
http://unicode.org/faq/utf_bom.html[
^].
—SA