Forcing record fields to be treated as character data

Question

0.00/5 (No votes)

See more:

I've got a frustrating problem, I hope folks here can help out. I have an Excel file (.xlsx format) that I have to read with a C++ application. I have no control over the Excel file itself. This used to work when the Excel file was done with Office 2003 (.xls) but the producer of the file is now using Office 2007 (.xlsx) and this problem cropped up.

The interesting columns in the data file are the first two, a 'part number' and a 'support band'. A sample of the first few records is below.

<small>201493-B21  699<br />
201723R-B22 5J0<br />
201723R-B22 6FG<br />
218231R-B22 562<br />
218231R-B22 699</small>

When I read this, I no longer get the second field as a string, I get "699.0" as if it was interpreted as a number. The 2nd record does not return "5J0" but a blank field "" (presumably because it is an illegal number). I've tried lots of things but it always works out this way.

I access the file with

C#

sDsn.Format("ODBC;DRIVER={%s};DSN=\"\";DBQ=%s;ReadOnly=True", sDriver, FileHelper::GetFullPathspec(filename));

for (ii=0; ii<dim(ExcelTabs); ii++)
{
    try
    {
        database.Open(NULL, false, false, sDsn);
        CRecordset recset(&database);
        sSql.Format("SELECT * FROM [%s$]", ExcelTabs[ii]);
        recset.Open(CRecordset::forwardOnly, sSql, CRecordset::readOnly);
 	recset.GetFieldValue((short)0, thePart);
	recset.GetFieldValue((short)1, theBand);
        ...
        ... yada yada yada

and I've tried using the CBVariant records but the string fields there are not right either.

So, without changing the content of the .xlsx file, which comes from an internal web site, how can I force the interpretation of the 2nd field to be text or characters intead of numeric? Is there something I can do with the SQL statement that fetches the data? Or to the database.Open()? I'm at a loss here. Google / Bing / Yahoo have been some help, particularly with using the newer xlsx ODBC driver, but no other clues.

Thanks in advance,
------------------------------------------------
Edit 10/14
------------------------------------------------
After some investigation and experimentation, it appears that the xlsx ODBC driver ("Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)") is actually honoring the cell formatting for the column in question. I have asked the person who creates the file to format the cells as 'text' in their master copy and that should correct the problem.

Just FYI, the reason for struggling to read the Excel data as it comes from the supplier rather than just save it in a differnent format or other modifications, like using CSV files and reading the data however I see fit, is that these files are part of a very large (over 100 input files, some text, some Excel) and complex process that is automated. If I were to add changing data formats to the 'setup step' of the process, that would introduce places where mistakes could be made. I am the architect / implementor of the process an applications but an intern or other such 'administrative assistant' type person would be doing the work and complicating their instructions / steps is not desirible.

So, I'm going to stop chasing this for a while and see if the producer can make the format changes.

Thanks.

Posted 13-Oct-11 7:42am

Chuck O'Toole

Updated 14-Oct-11 4:11am

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

André Kraak · Accepted Answer · 2011-10-13T08:01:00

I found this Running SQL against Excel file[^].

In the "Common Excel Import problems" section is talks about how the type of each column is determined and how columns with mixed data may cause your problem.
It suggest using "TypeGuessRows=0" and "IMEX=1" in your DSN string.

Common Excel Import problems

ADO/ODBC must determine data type for each column in your Excel worksheet or range (This is not affected by Excel cell formatting settings.) This is done by scanning number of rows defined by registry setting TypeGuessRows (default value is 8). Quite often there are numeric values mixed with text values in the same column,

For example sorted financial coding structures often have numbers at the beginning of the list 001-999 than text AAA-XXX

Both the Jet and the ODBC Provider return the data of the majority type, but return NULL (empty) values for the minority data type. If the two types are equally mixed in the column, the provider chooses numeric over text.

Plus when first rows are less than 255 characters long it will truncate all data to 255 characters even if cell values below are longer

On of the ways of avoiding this problem is using Import Mode (IMEX=1). This forces mixed data to be converted to text. However it only works when first TypeGuessRows Rows have mixed values.
If all values are numeric than setting IMEX=1 will not convert the default datatype to Text, it will remain numeric.

The best combination to avoid problems is TypeGuessRows=0 + IMEX=1.
Setting TypeGuessRows=0 forces driver to read all data to determine field type.
Unfortunately our own experience shows that quite often it does not work.
And when it does work it slows everything down

So the only solution is not to use mixed values and be prepared for the data being truncated to 255 characters