Click here to Skip to main content
15,867,453 members
Articles / Programming Languages / C# 4.0
Tip/Trick

A Robust CSV Reader

Rate me:
Please Sign up or sign in to vote.
3.52/5 (6 votes)
14 Jan 2020CPOL6 min read 34.5K   82   22   15
The routines in this library can parse any string that I can throw at it, including Common Name strings read from X.509 Digital Certificates.

Introduction

In the course of writing a program to generate a report about the properties of the contents of a Windows certificate store, I discovered that none of the CSV readers that I had at my disposal could process the unusual CSV strings found in many digital certificates. Turning to The Code Project, I found A Fast CSV Reader, which I eagerly downloaded, built, and tested.

Regretfully, I quickly discovered that it couldn't handle these types of strings, either, and set about to roll yet another CSV reader of my own.

Background

The type of string that broke every CSV reader at which I threw it looks like the following actual example from a production store of Trusted Root Certificates.

CN=RapidSSL CA, O="GeoTrust, Inc.", C=US

Specifically, the middle substring, O="GeoTrust, Inc.", is nonstandard, because the opening quotation mark that is intended to protect the comma in the organization name, GeoTrust, Inc. is the fourth character in the substring. This is by design, since the Common Name string is a series of name-value pairs, as are many other certificate properties.

Although every other CSV library in The Code Project and elsewhere that I reviewed mimics a base class, such as System.IO.StreamReader, I chose instead to adhere to the Single Responsibility Principle, and focus entirely on parsing strings, leaving you free to determine how to acquire and manage them.

Getting the Code

Although this library has shown itslf to be quite stable, in anticipation that it might need maintenance,
the source code is in one of my oldest (and most stable) GitHub repositories, at https://github.com/txwizard/AnyCSV. More recently, I added MSDN-style
documentation, which is published
at https://txwizard.github.io/AnyCSV/.

There is also a NuGet package available at .https://www.nuget.org/packages/WizardWrx.AnyCSV/.

Using the Code

The working code is in a class library, which I built against Microsoft .NET Framework 2.0, and it can be used in projects that target any newer version of the framework. For your convenience, I included both debug and release builds of the library and the test program.

The package includes a test program, AnyCSVTestStand.exe, which is built from a single source file, Program.cs, and a set of string resources. Since it was intended to prove the robustness of the algorithm, the test program confines itself to the static methods.

Class library WizardWrx.AnyCSV.dll exposes a single class, Parser, which has nine constructor overloads, all of which are optional, and one overloaded static Parse method that does the real work. Since the real worker method is static, and everything it needs can be either passed into the method or allowed to take default values, the only reason to use an instance is when you want to specify non-default values for one or more of the arguments that have corresponding properties on the class in advance, so that you can omit them from the argument list when you call the Parse method in a loop to process a set of strings.

The Parse method has four overloads, all of which return a string [ ] containing the substrings. I shall next summarize the overloads, starting with the fourth, and simplest.

  1. The simplest overload takes two arguments, the string to be parsed and the delimiter, a pchrDelimiter.
  2. Adding a tad of complexity, the third overload adds a second char argument, pchrProtector, which specifies the character that protects a pchrDelimiter that appears in the middle of a substring, which must be ignored.
  3. A third overload adds penmGuardDisposition, which uses a member of the GuardDisposition enumeration to govern disposition of guard characters. Its valid values are straightforward and self-explanatory: Keep and Strip.
  4. The most complex overload adds the final parameter, penmTrimWhiteSpace, which uses the TrimWhiteSpace enumeration to specify four possible options to dispose of leading and trailing blanks in a substring.
Value Outcome
TrimWhiteSpace
Leave Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
TrimLeading Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
TrimTrailing Trim trailing white space. This option is especially useful with CSV files generated by Microsoft Excel, which often have long runs of meaningless white space, especially when a worksheet has blank rows or columns in its UsedRange.
TrimBoth Given that TrimLeading and TrimTrailing are required use cases, trimming both ends is essentially free. This flag is implemented such that it can be logically processed as TrimLeading | TrimTrailing.

One final static method, StandardCSVParse, is dedicated to parsing a true Comma Separated Values string. As such, it has a single string argument, pstrAnyCSV. The delimiter character is a comma, and the guard character is the double quotation mark.

Points of Interest

Everything that counts happens in a single method, the most complex of the four static Parse methods. Everything else, including the instance Parse method, calls upon it, specifying default values for omitted arguments or, in the case of the instance method, the corresponding instance properties.

The Parse method gets its robustness from a simple state machine that uses a pair of simple Boolean variables, fInProgress and fProtectDelimiters, as state variables. The key point is that when fProtectDelimiters is TRUE, indicating that a guard character has been found, but its mate has yet to be found, delimiter characters are ignored. Strings are assembled by appending characters to a StringBuilder that is initialized with a size sufficient to hold a degenerate case string - one that is devoid of delimiters. The objective of reserving such a large amount of memory is that the StringBuilder never needs to expand; hence, it need never move a partially built string to a bigger buffer.

To simplify routine use, the class exposes common delimiter and guard characters as public constants. If you prefer to avoid using raw characters, you can specify them in terms of a pair of enumerations, DelimiterChar and GuardChar. The constants and enumerations are especially useful for specifying guard characters, since quotation mark literals are rather fussy.

Translating the DelimiterChar and GuardChar enumerations is aided by arrays of DelimiterMap and GuardMap structures, which are populated by a static Parser constructor.

The instance properties use s_objSyncLock, a private generic object, to make themselves thread safe. I borrowed this proven concept from other classes that use such an object as part of their implementation of the Singleton design pattern.

Room for Improvement

I anticipate that others will find numerous ways to improve this library. One that occurs to me is the addition of a static StringBuilder that grows, but never shrinks during its lifetime. At a minimum, this would require the Parse method to test for its existence and size, and use the s_objSyncLock object to make itself thread-safe.

History

  • Friday, 01 May 2015 - Initial publication
  • Monday, 29 April 2019 - Add "Getting the Code" section, since the code lives in a GitHub repsoitory
  • Tuesday, 14 January 2020 - Remove stray markup from the text.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
I deliver robust, clean, adaptable, future-ready applications that are properly documented for users and maintainers. I have deep knowledge in multiple technologies and broad familiarity with computer and software technologies of yesterday, today, and tomorrow.

While it isn't perceived as sexy, my focus has always been the back end of the application stack, where data arrives from a multitude of sources, and is converted into reports that express my interpretation of The Fundamental Principle of Tabular Reporting, and are the most visible aspect of the system to senior executives who approve the projects and sign the checks.

While I can design a front end, I prefer to work at the back end, getting data into the system from outside sources, such as other computers, electronic sensors, and so forth, and getting it out of the system, as reports to IDENTIFY and SOLVE problems.

When presented with a problem, I focus on identifying and solving the root problem for the long term.

Specialties: Design: Relational data base design, focusing on reporting; organization and presentation of large document collections such as MSDS libraries

Development: Powerful, imaginative utility programs and scripts for automated systems management and maintenance

Industries: Property management, Employee Health and Safety, Services

Languages: C#, C++, C, Python, VBA, Visual Basic, Perl, WinBatch, SQL, XML, HTML, Javascript

Outside Interests: Great music (mostly, but by no means limited to, classical), viewing and photographing sunsets and clouds, traveling by car on small country roads, attending museum exhibits (fine art, history, science, technology), long walks, especially where there is little or no motor traffic, reading, especially nonfiction and thoughtfully written, thought provoking science fiction

Comments and Discussions

 
QuestionProject AnyCSVTestStand has a missing DLL Pin
Member 25759441-May-19 10:40
Member 25759441-May-19 10:40 
QuestionDownload source Pin
lafffer_11918-May-15 18:54
lafffer_11918-May-15 18:54 
AnswerRe: Download source Pin
soho13-Jan-16 23:33
soho13-Jan-16 23:33 
GeneralRe: Download source Pin
David A. Gray28-Apr-19 20:30
David A. Gray28-Apr-19 20:30 
QuestionCleanup the code ZIP Pin
Matt T Heffron4-May-15 13:20
professionalMatt T Heffron4-May-15 13:20 
QuestionCertificates store Pin
Dr Gadgit2-May-15 3:54
Dr Gadgit2-May-15 3:54 
AnswerRe: Certificates store Pin
David A. Gray1-Jun-15 19:05
David A. Gray1-Jun-15 19:05 
GeneralRe: Certificates store Pin
Dr Gadgit4-Jul-15 4:12
Dr Gadgit4-Jul-15 4:12 
GeneralRe: Certificates store Pin
Dr Gadgit19-Jul-15 4:19
Dr Gadgit19-Jul-15 4:19 
AnswerRe: Certificates store Pin
David A. Gray1-Jun-15 19:05
David A. Gray1-Jun-15 19:05 
QuestionNo Source or Sample? Pin
Gary Noble2-May-15 1:30
Gary Noble2-May-15 1:30 
AnswerRe: No Source or Sample? Pin
Matt T Heffron4-May-15 13:20
professionalMatt T Heffron4-May-15 13:20 
Look "under" the "Browse Code" link above in the left side "menu".
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.

GeneralRe: No Source or Sample? Pin
David A. Gray28-Apr-19 20:32
David A. Gray28-Apr-19 20:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.