Question about hebrew ansi to unicode

Question

0.00/5 (No votes)

See more:

I have a Hebrew ANSI text file i should convert to Unicode Hebrew ( file ) conversion is done but iam not able to get the desired output as expected. please let me know how to do it.

What I have tried:

C#

//code page
int nlanguageCodePage = this->GetCodepage(lpszOldFileName);

while (fgets(chAnsiBuff, NMLANG_MaxNBuf, pFile) != NULL)
{
    sUnicodeBuff = chAnsiBuff;

    //CONVERTING TO UNICODE
    nSize = MultiByteToWideChar(nlanguageCodePage, 0, sUnicodeBuff, -1, NULL, NULL);
    MultiByteToWideChar(nlanguageCodePage, 0, sUnicodeBuff, -1, chUniocodeBuff, nSize);

    // bom at starting
    if (nBOM == 0) { arcOut.Write(&bom, 2); }
    arcOut.WriteString(chUniocodeBuff);

    nBOM++;
}

Posted 15-Nov-16 22:13pm

Member 12677926

Updated 15-Nov-16 22:52pm

OriginalGriff

v3

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Answer 1 · 2016-11-15T22:52:00

Solution 1

~~You are using the same buffer for input and output. That won't work.~~ See the MultiByteToWideChar function (Windows)[^].

It should be like this:

int nSize = MultiByteToWideChar(nlanguageCodePage, 0, chAnsiBuf, -1, NULL, NULL);
LPWSTR sUnicodeBuf = new WCHAR[nSize];
MultiByteToWideChar(nlanguageCodePage, 0, chAnsiBuf, -1, sUnicodeBuff, nSize);
// Use sUnicodeBuff here
delete [] sUniocodeBuff;

However, when having a fixed size for the ANSI input buffer, it can be also used for the output buffer because the Unicode string will never have more wide characters than the number of ANSI characters in the input string:

C++

WCHAR wUnicodeBuf[NMLANG_MaxNBuf];
while (fgets(chAnsiBuff, NMLANG_MaxNBuf, pFile) != NULL)
{
    MultiByteToWideChar(nlanguageCodePage, 0, chAnsiBuf, -1, wUnicodeBuff, NMLANG_MaxNBuf);
 
    // bom at starting
    if (nBOM == 0) { arcOut.Write(&bom, 2); }
    arcOut.WriteString(wUnicodeBuff);
 
    nBOM++;
}

That should work. If the result is not as expected, check your other involved functions like arcOut.WriteString(), if the BOM is correct, and if your input file is really encoded with the code page nlanguageCodePage.

[EDIT]
Another possible source may be the arcOut.WriteString() call when it converts the Unicode string back to ANSI. You may then use a binary write instead:

C++

int len = MultiByteToWideChar(nlanguageCodePage, 0, chAnsiBuf, -1, wUnicodeBuff, NMLANG_MaxNBuf);

// bom at starting
if (nBOM == 0) { arcOut.Write(&amp;bom, 2); }
if (len > 0)
    arcOut.Write(wUnicodeBuff, len * sizeof(WCHAR));

nBOM++;

[/EDIT]

Posted 15-Nov-16 22:52pm

Jochen Arndt

Updated 16-Nov-16 0:06am

v3

Comments

Member 12677926 16-Nov-16 5:34am

still not working ..please can you do one sample and let me know if it works .please

Member 12677926 16-Nov-16 5:34am

thanks...and give any samples

Jochen Arndt 16-Nov-16 6:01am

You should give a more detailed problem description ("not working as expected" does not tell others anything).

I have used MultiByteToWideChar quite often and never had problems (but not with Hebrew so far). So I expect the error source somewhere else.

Is MultiByteToWideChar() returning an error or is the file content not as expected?

What is the value of nlanguageCodePage?

How is WriteString() defined? Is it a library function or written by you?
I'm asking this because that function might convert the passed string back to ANSI using the current code page. If so, use a binary write:
arcOut.Write(wUnicodeBuf, length_returned_by_MultiByteToWideChar);
I will update my answer regarding this.

Finally you may give a short example (a Hebrew text line, the corresponding hex dump, and the hex dump from the output file).

Member 12677926 16-Nov-16 6:06am

i ported to Unicode translator ...from ANSI ... .before how they are using was they will keep locale as hebrew and that is showing characters different and Unicode file is showing different text.

Member 12677926 16-Nov-16 6:10am

i have a ansi file in that some hebrew characters are there before ..... as shown below

ID_TEST ="כיצד מתחברים ניתן לקבוע חריץ באם נחוץ"

when i converted to Unicode its showing same characters as ANSI means above both ANSI and unicode characters are same ... but when before when they kept locale as hebrew it showing diffrent... hebrew as locale and convrted string from ansi to Unicode is not same ....

Member 12677926 16-Nov-16 6:14am

as now i ported to Unicode there is no need of keeping locale ... when we are converting from ANSI to Unicode ..wether in unicode it will show the correct chracters i.e different from ANSI or same

ID_TEST = How does your connect? You can specify a dado or adjustment if needed
This is my english test ..

Member 12677926 16-Nov-16 6:14am

my doubt is while we are converting from ANSI to unicode wether the same text will appear on unicode or correct ..text ? please help me out

Jochen Arndt 16-Nov-16 6:25am

Appear where?
From within you program you can print out the Unicode text using a Unicode aware print function (or just any if your application is build as Unicode).

The created file will (or shall) be Unicode. So you have to use a viewer or editor that supports Unicode files.

To check the files, use a hex editor. ANSI and Unicode files are quite different for the same text.

Member 12677926 16-Nov-16 6:34am

iam getting the data for both ansi and unicode iam converting from ansi to unicode is the code page is 1255 is correct ?

Member 12677926 16-Nov-16 6:15am

ansi file language code page simply based on the file name iam taking the language code page...

Jochen Arndt 16-Nov-16 6:31am

So what is the number?

For Hebrew I would expect 1255 (Windows) or 28598/ 38598 (ISO).

But you have to know the application that created the file (e.g. the selected Windows system code page when the file was created). There is no solution to detect the code page of ANSI files. It must be known.

Member 12677926 16-Nov-16 6:33am

yes ..i hard coded that ..part ....its coming 1255 but ....still the same problem

Member 12677926 16-Nov-16 6:43am

please help me out

Jochen Arndt 16-Nov-16 7:02am

Please post the first hex bytes of the input and output file. Seeing these should help to know what is going wrong.
Example:
The HEBREW LETTER ALEF has code 0xE0 in code page 1255.
The corresponding Unicode code point (UTF16-LE) is 0x05D0.
So if you create an input file with a single byte of 0xE0, the output file should contain 0xFF 0xFE 0xD0 0x05.

To check now where the failure occurs, create such an input file (or use a constant string for testing instead of reading from file), perform the conversion, check the content of the Unicode string (bytes 0xD0 and 0x05), and finall check the contents of the output file.

Member 12677926 16-Nov-16 7:12am

for input file i got e0 for input file