Click here to Skip to main content
15,891,529 members
Articles / Programming Languages / C#
Article

Fast Excel file reader with basic functionality

Rate me:
Please Sign up or sign in to vote.
4.88/5 (104 votes)
1 Jul 20063 min read 1.6M   14K   211   173
A set of managed classes for native reading of data from MS Excel files

Introduction

When I was writing one of the projects (multi-format data parser) I had to deal with Microsoft Excel files.

First of all I tried to read data via COM interop, but this way was very slow (and sometimes input files have up to 100,000 rows and even more) and required Excel to be installed on the target machine.

Then I tried to read data via Excel's OleDb driver. When DataReader ran through data rows without reading them (i.e. while (rd.Read());) it was quite fast, but retrieving data was still very slow.

When I had tried out one of commercial products (TMS FlexCel Studio), it showed rather good results (several seconds for loading 17 MB file, 2 sheets, ~ 80 000 rows), but access to them was not very fast (at least not suitable for displaying data using DataGridView in virtual mode).

So, I decided to write my own reader for Excel files.

Background

First of all I decided that the reader will produce a DataSet object, with sheets represented as DataTables within it. Access to DataTable is rather fast, and its TableName property is suitable for sheet title.

To read an Excel file, we need to get BIFF stream from xls OLE2 container. Working with OLE2 container is implemented in the following classes:

  • XlsHeader (file header representation)
  • XlsFat (support for file system in OLE2 container)
  • XlsRootDirectory (directory tree representation)
  • XlsDirectoryEntry, and
  • XlsStream (file stream)

Note that implementation of OLE2 container object model is limited (e.g. I didn't write any MiniFAT support code) and Excel-specific (XlsHeader supports only MS Excel header field values, any other will produce InvalidHeaderException).

Now we can parse BIFF stream with spreadsheet data. For handy reading data from stream we have a XlsBiffStream class derived from XlsStream. It provides a set of methods for reading XlsBiffRecord common structures. And then, each structure is converted to the appropriate type. Note that only a few record types are supported. All other structures are read as basic XlsBiffRecord and are ignored.

My implementation of reader ignores any text formatting, graphics, embedded objects and charts. Only cell values are read. When a cell contains a formula, the last calculated formula value is used as the cell value. Also, only indexed sheets are read (i.e. sheets with index data) - I didn't write not-indexed sheets support due to my laziness (I didn't come across any file without an index yet).

Using the code

Now, it's very easy to use the parser. All you need is a Stream with an Excel file.

C#
FileStream fs = new FileStream(@"c:\file.xls", 
        FileMode.Open, FileAccess.Read);
ExcelDataReader rd = new ExcelDataReader(fs);
fs.Close();

// Now we can access data:
DataSet data = rd.WorkbookData;

Points Of Interest

Latest versions of Excel use SST (Shared String Table) to store string information, and text cells only reference string by index. Also, the size of one BIFF section is limited, and SST can be rather big, so it uses the so-called CONTINUE sections. Every string in SST can be saved as Unicode (2 byte) or as ANSI (single byte per symbol). But when the string is broken with the continue section, it can change its encoding from one to another! This took me a lot of debugging, and I wish nobody has to do the same again...

Results

This implementation of reader is rather nonoptimal (I was pressed for time), but it showed a good speed. It surely beats any interop or OLE method and even some commercial analogs.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Russian Federation Russian Federation
Alex
.NET Developer
Russian Federation
rakemaker@gmail.com

Comments and Discussions

 
GeneralThanks for the work Pin
Brian Leach5-Jul-06 8:36
Brian Leach5-Jul-06 8:36 
GeneralXML or excel spread sheet? [modified] Pin
Keith Vinson5-Jul-06 5:35
Keith Vinson5-Jul-06 5:35 
GeneralRe: XML or excel spread sheet? Pin
Messir5-Jul-06 5:56
Messir5-Jul-06 5:56 
GeneralRe: XML or excel spread sheet? Pin
Keith Vinson5-Jul-06 6:09
Keith Vinson5-Jul-06 6:09 
GeneralExcel Data Writer Pin
BradleyC4-Jul-06 4:44
BradleyC4-Jul-06 4:44 
GeneralRe: Excel Data Writer Pin
Messir4-Jul-06 6:25
Messir4-Jul-06 6:25 
GeneralHyperlink Pin
kuerbis2-Jul-06 10:41
kuerbis2-Jul-06 10:41 
GeneralRe: Hyperlink Pin
Messir2-Jul-06 18:14
Messir2-Jul-06 18:14 
Hi, Thomas!

Well, hyperlink stream isn't documented, but I tried to guess somewhat about it.

Hyperlink record has signature 0x01B8, then size of structure comes (general BIFF record format), then four 2-byte fields: first row of the hyperlink, last row of the hyperlink, first column of the hyperlink, last column of the hyperlink. Structure ends with variable-size hyperlink stream.

Stream starts with 16 bytes (D0 C9 EA 79 F9 BA CE 11 8C 82 00 AA 00 4B A9 0B, or {79EAC9D0-BAF9-11CE-828C-00AA004BA90B} in GUID form) which correspond to StdHlink ClassID. Then comes 4-byte value (I think, it's version number, for all documents it was equal to 0x00000002). Then 4-byte type flags (Could not guess all meanings, but it's equal to 0x00000017 for standard links, 0x0000001C - for links to Excel cells and 0x00000117 for local network links). Then 4-byte length of link text (in characters, including terminating zero), then variable-length unicode text of link. After that:

1) For standard links (type = 0x00000017) - ClassID of corresponding moniker: E0 C9 EA 79 F9 BA CE 11 8C 82 00 AA 00 4B A9 0B (or {79EAC9E0-BAF9-11CE-828C-00AA004BA90B} in GUID form) for Internet links (URL Moniker) or 03 03 00 00 00 00 00 00 C0 00 00 00 00 00 00 46 (or {00000303-0000-0000-C000-000000000046} in GUID form) for File links (FileMoniker). May be, there are other types of links used by Excel, but I didn't meet one yet. After moniker CLSID comes specific data.
For URL moniker it is 4-byte length of URL (in bytes, including terminating zero) followed by variable-length unicode link address.
For File moniker it is 2-byte field (always was equal to zero), then 4-byte length of non-unicode path (including terminating zero), then variable-length single-byte string with DOS-format path; Then comes 4-byte signature (0xDEADFFFF, something devilish Wink | ;) )followed by 20 zeros. Then comes 4-byte length of unicode section. Unicode section contains 4-byte length of unicode path (in bytes, without terminating zero), 2-byte field (typically with value 0x0003 and mysterious meaning) and variable-length unicode string (without termination zero).

2) For other types of links - 4-byte length of link target (in characters, including terminating zero) followed by variable-length unicode string.

If there is a custom tooltip for link, it follows after link description. It starts with 2-byte value 0x0800, then - 2-byte length of tooltip data. Tooltip data also starts with 0x0800, then 8 bytes (I could not guess, what they mean), then - unicode string with tooltip text.

* Note that all strings are stored with terminating zero, excepting unicode path for FileMoniker.

If you could find out something more, I'll be happy if you share your knowledge with me Smile | :)

-- modified at 0:15 Monday 3rd July, 2006
GeneralNice Article Pin
Paul Conrad1-Jul-06 17:58
professionalPaul Conrad1-Jul-06 17:58 
GeneralRe: Nice Article Pin
Messir1-Jul-06 18:48
Messir1-Jul-06 18:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.