Click here to Skip to main content
15,880,725 members
Articles / Programming Languages / C++
Tip/Trick

Handling simple text files in C/C++

Rate me:
Please Sign up or sign in to vote.
4.33/5 (13 votes)
11 Feb 2012CPOL2 min read 62.8K   21   14
Recent questions on reading ANSI vs Unicode text prompted the following
This code handles a block of text read from a text file in various formats and preprocesses it into the form required by a program. The source text can be ANSI, UTF-8, Unicode or Unicode Big Endian. The code below will convert the text into Unicode or UTF-8 as appropriate for the project settings, whether compiled for Unicode or MBCS support.

The source text is identified by the presence of a Byte Order Mark at the beginning of the buffer. In the absence of a BOM it is assumed that the data is pure ANSI, although there are other tools and Win32 API functions that can help in the determination as described in the following
MSDN link: Unicode and Character Sets[^]

Character types and Byte Order Marks are defined as follows:

  1. ANSI
    No signature, single byte characters in the range 0x00 to 0x7F.
  2. UTF-8
    Signature = 3 bytes: 0xEF 0xBB 0xBF
    followed by multi-byte characters as referred in the following link
    UTF Information[^].
  3. UTF-16 LE (Little Endian), used for Windows and other operating systems. Typically called "Unicode".
    Signature = 2 bytes: 0xFF 0xFE (or 1 word 0xFEFF)
    followed by words:
    0x0000 to 0x007F for normal 0-127 ASCII chars.
    0x0080 to 0xFDFF for the extended set.
  4. UTF-16 BE (Big Endian). This is used for Macintosh operating systems.
    Signature = 2 bytes: 0xFE 0xFF (or 1 word 0xFFFE)
    followed by words as UTF-16 but with bytes reversed.


Following the comments from MilanA below, I have modified the code to always
return a newly allocated buffer, even when no conversion has taken place.

The input buffer into which the text is read must be followed by two null bytes to signify the end of the text block (even if it is ANSI). Also the calling routine is responsible for disposing of both the buffers when they are no longer required.

C++
PTSTR Normalise(PBYTE	pBuffer
        	)
{
    PTSTR			ptText;		// pointer to the text char* or wchar_t* depending on UNICODE setting
    PWSTR			pwStr;		// pointer to a wchar_t buffer
    int				nLength;	// a useful integer variable
    
    // obtain a wide character pointer to check BOMs
    pwStr = reinterpret_cast<PWSTR>(pBuffer);
    
    // check if the first word is a Unicode Byte Order Mark
    if (*pwStr == 0xFFFE || *pwStr == 0xFEFF)
    {
        // Yes, this is Unicode data
        if (*pwStr++ == 0xFFFE)
        {
            // BOM says this is Big Endian so we need
            // to swap bytes in each word of the text
            while (*pwStr)
            {
                // swap bytes in each word of the buffer
                WCHAR	wcTemp = *pwStr >> 8;
                wcTemp |= *pwStr << 8;
                *pwStr = wcTemp;
                ++pwStr;
            }
            // point back to the start of the text
            pwStr = reinterpret_cast<PWSTR>(pBuffer + 2);
        }
#if !defined(UNICODE)
        // This is a non-Unicode project so we need
        // to convert wide characters to multi-byte
        
        // get calculated buffer size
        nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, NULL, 0, NULL, NULL);
        // obtain a new buffer for the converted characters
        ptText = new TCHAR[nLength];
        // convert to multi-byte characters
        nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, ptText, nLength, NULL, NULL);
#else
        nLength = wcslen(pwStr) + 1;    // if Unicode, then copy the input text
        ptText = new WCHAR[nLength];    // to a new output buffer
        nLength *= sizeof(WCHAR);       // adjust to size in bytes
        memcpy_s(ptText, nLength, pwStr, nLength);
#endif
    }
    else
    {
        // The text data is UTF-8 or Ansi
#if defined(UNICODE)
        // This is a Unicode project so we need to convert
        // multi-byte or Ansi characters to Unicode.
        
        // get calculated buffer size
        nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, NULL, 0);
        // obtain a new buffer for the converted characters
        ptText = new TCHAR[nLength];
        // convert to Unicode characters
        nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, ptText, nLength);
#else
        // This is a non-Unicode project so we just need
        // to skip the UTF-8 BOM, if present
        if (memcmp(pBuffer, "\xEF\xBB\xBF", 3) == 0)
        {
            // UTF-8
            pBuffer += 3;
        }
        nLength = strlen(reinterpret_cast<PSTR>(pBuffer)) + 1;  // if UTF-8/ANSI, then copy the input text
        ptText = new char[nLength];                             // to a new output buffer
        memcpy_s(ptText, nLength, pBuffer, nLength);
#endif
    }
    
    // return pointer to the (possibly converted) text buffer.
    return ptText;
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Retired
United Kingdom United Kingdom
I was a Software Engineer for 40+ years starting with mainframes, and moving down in scale through midi, UNIX and Windows PCs. I started as an operator in the 1960s, learning assembler programming, before switching to development and graduating to COBOL, Fortran and PLUS (a proprietary language for Univac systems). Later years were a mix of software support and development, using mainly C, C++ and Java on UNIX and Windows systems.

Since retiring I have been learning some of the newer (to me) technologies (C#, .NET, WPF, LINQ, SQL, Python ...) that I never used in my professional life, and am actually able to understand some of them.

I still hope one day to become a real programmer.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Member 150787162-Jul-22 6:46
Member 150787162-Jul-22 6:46 
GeneralRe: My vote of 5 Pin
Richard MacCutchan3-Jul-22 1:03
mveRichard MacCutchan3-Jul-22 1:03 
GeneralRe: You are right, I will modify the code to take account of tha... Pin
Richard MacCutchan23-Jan-12 23:19
mveRichard MacCutchan23-Jan-12 23:19 
GeneralRe: You do not release the pointer inside your code. But the use... Pin
MilanA23-Jan-12 23:10
MilanA23-Jan-12 23:10 
GeneralReason for my vote of 4 Thanks for sharing. Pin
cumirror15-Feb-12 17:46
cumirror15-Feb-12 17:46 
GeneralRe: Thanks. Pin
Richard MacCutchan15-Feb-12 22:26
mveRichard MacCutchan15-Feb-12 22:26 
GeneralNice tips. Thanks for sharing. <a href="http://www.oil-pain... Pin
Steven Nolan14-Feb-12 2:28
Steven Nolan14-Feb-12 2:28 
GeneralRe: My pleasure; thanks for the feedback. Pin
Richard MacCutchan14-Feb-12 5:31
mveRichard MacCutchan14-Feb-12 5:31 
GeneralReason for my vote of 2 This code will lead to either crash/... Pin
MilanA23-Jan-12 22:04
MilanA23-Jan-12 22:04 
GeneralRe: But as you notice I do neither of these things so it does no... Pin
Richard MacCutchan23-Jan-12 23:00
mveRichard MacCutchan23-Jan-12 23:00 
GeneralReason for my vote of 2 This function will lead to memory le... Pin
Philippe Mori22-Jan-12 11:36
Philippe Mori22-Jan-12 11:36 
GeneralRe: Not if it's handled correctly. I have used this in plenty of... Pin
Richard MacCutchan22-Jan-12 22:33
mveRichard MacCutchan22-Jan-12 22:33 
QuestionWrite a C/C++ program that connects to a MySQL server and displays the global TIMEZONE. Pin
garav kumar mishra13-Feb-12 19:51
garav kumar mishra13-Feb-12 19:51 
AnswerRe: Write a C/C++ program that connects to a MySQL server and displays the global TIMEZONE. Pin
Richard MacCutchan13-Feb-12 22:16
mveRichard MacCutchan13-Feb-12 22:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.