Click here to Skip to main content
15,921,179 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I've got a frustrating problem, I hope folks here can help out. I have an Excel file (.xlsx format) that I have to read with a C++ application. I have no control over the Excel file itself. This used to work when the Excel file was done with Office 2003 (.xls) but the producer of the file is now using Office 2007 (.xlsx) and this problem cropped up.

The interesting columns in the data file are the first two, a 'part number' and a 'support band'. A sample of the first few records is below.
<small>201493-B21  699<br />
201723R-B22 5J0<br />
201723R-B22 6FG<br />
218231R-B22 562<br />
218231R-B22 699</small>

When I read this, I no longer get the second field as a string, I get "699.0" as if it was interpreted as a number. The 2nd record does not return "5J0" but a blank field "" (presumably because it is an illegal number). I've tried lots of things but it always works out this way.

I access the file with
C#
sDsn.Format("ODBC;DRIVER={%s};DSN=\"\";DBQ=%s;ReadOnly=True", sDriver, FileHelper::GetFullPathspec(filename));

for (ii=0; ii<dim(ExcelTabs); ii++)
{
    try
    {
        database.Open(NULL, false, false, sDsn);
        CRecordset recset(&database);
        sSql.Format("SELECT * FROM [%s$]", ExcelTabs[ii]);
        recset.Open(CRecordset::forwardOnly, sSql, CRecordset::readOnly);
 	recset.GetFieldValue((short)0, thePart);
	recset.GetFieldValue((short)1, theBand);
        ...
        ... yada yada yada


and I've tried using the CBVariant records but the string fields there are not right either.

So, without changing the content of the .xlsx file, which comes from an internal web site, how can I force the interpretation of the 2nd field to be text or characters intead of numeric? Is there something I can do with the SQL statement that fetches the data? Or to the database.Open()? I'm at a loss here. Google / Bing / Yahoo have been some help, particularly with using the newer xlsx ODBC driver, but no other clues.

Thanks in advance,
------------------------------------------------
Edit 10/14
------------------------------------------------
After some investigation and experimentation, it appears that the xlsx ODBC driver ("Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)") is actually honoring the cell formatting for the column in question. I have asked the person who creates the file to format the cells as 'text' in their master copy and that should correct the problem.

Just FYI, the reason for struggling to read the Excel data as it comes from the supplier rather than just save it in a differnent format or other modifications, like using CSV files and reading the data however I see fit, is that these files are part of a very large (over 100 input files, some text, some Excel) and complex process that is automated. If I were to add changing data formats to the 'setup step' of the process, that would introduce places where mistakes could be made. I am the architect / implementor of the process an applications but an intern or other such 'administrative assistant' type person would be doing the work and complicating their instructions / steps is not desirible.

So, I'm going to stop chasing this for a while and see if the producer can make the format changes.

Thanks.
Posted
Updated 14-Oct-11 4:11am
v2

1 solution

I found this Running SQL against Excel file[^].

In the "Common Excel Import problems" section is talks about how the type of each column is determined and how columns with mixed data may cause your problem.
It suggest using "TypeGuessRows=0" and "IMEX=1" in your DSN string.

Common Excel Import problems

ADO/ODBC must determine data type for each column in your Excel worksheet or range (This is not affected by Excel cell formatting settings.) This is done by scanning number of rows defined by registry setting TypeGuessRows (default value is 8). Quite often there are numeric values mixed with text values in the same column,

For example sorted financial coding structures often have numbers at the beginning of the list 001-999 than text AAA-XXX

Both the Jet and the ODBC Provider return the data of the majority type, but return NULL (empty) values for the minority data type. If the two types are equally mixed in the column, the provider chooses numeric over text.

Plus when first rows are less than 255 characters long it will truncate all data to 255 characters even if cell values below are longer

On of the ways of avoiding this problem is using Import Mode (IMEX=1). This forces mixed data to be converted to text. However it only works when first TypeGuessRows Rows have mixed values.
If all values are numeric than setting IMEX=1 will not convert the default datatype to Text, it will remain numeric.

The best combination to avoid problems is TypeGuessRows=0 + IMEX=1.
Setting TypeGuessRows=0 forces driver to read all data to determine field type.
Unfortunately our own experience shows that quite often it does not work.
And when it does work it slows everything down

So the only solution is not to use mixed values and be prepared for the data being truncated to 255 characters
 
Share this answer
 
Comments
Chuck O'Toole 13-Oct-11 15:10pm    
Thanks for the reference. I looked into this further and the TypeGuessRows is a registry setting (in the Wow6432Node section for Windows 7) and it controls how many rows of data the engine uses to "guess" at the data type. Setting it to 0 causes it to look at "all" the data (rather than none) so the "majority data type" still rules. Setting IMEX to 1 doesn't override the guess (nor does the registry setting for 'ImportMixedTypes' (defaults to Text) in the same Key. Still, lots of more places to look, the quest continues.
Ali_100 11-Dec-13 2:42am    
did you get the answer? i am facing the same issue.
André Kraak 11-Dec-13 4:58am    
I think you intended to place comment on the question instead my answer.
I am not sure whether the Chuck received this message.

To be sure I suggest you post it again using the "Having a Question or Comment" right beneath the question.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900