Click here to Skip to main content
15,881,882 members
Articles / Programming Languages / C# 4.0
Tip/Trick

A Robust CSV Reader

Rate me:
Please Sign up or sign in to vote.
3.52/5 (6 votes)
14 Jan 2020CPOL6 min read 34.7K   82   22   15
The routines in this library can parse any string that I can throw at it, including Common Name strings read from X.509 Digital Certificates.

Introduction

In the course of writing a program to generate a report about the properties of the contents of a Windows certificate store, I discovered that none of the CSV readers that I had at my disposal could process the unusual CSV strings found in many digital certificates. Turning to The Code Project, I found A Fast CSV Reader, which I eagerly downloaded, built, and tested.

Regretfully, I quickly discovered that it couldn't handle these types of strings, either, and set about to roll yet another CSV reader of my own.

Background

The type of string that broke every CSV reader at which I threw it looks like the following actual example from a production store of Trusted Root Certificates.

CN=RapidSSL CA, O="GeoTrust, Inc.", C=US

Specifically, the middle substring, O="GeoTrust, Inc.", is nonstandard, because the opening quotation mark that is intended to protect the comma in the organization name, GeoTrust, Inc. is the fourth character in the substring. This is by design, since the Common Name string is a series of name-value pairs, as are many other certificate properties.

Although every other CSV library in The Code Project and elsewhere that I reviewed mimics a base class, such as System.IO.StreamReader, I chose instead to adhere to the Single Responsibility Principle, and focus entirely on parsing strings, leaving you free to determine how to acquire and manage them.

Getting the Code

Although this library has shown itslf to be quite stable, in anticipation that it might need maintenance,
the source code is in one of my oldest (and most stable) GitHub repositories, at https://github.com/txwizard/AnyCSV. More recently, I added MSDN-style
documentation, which is published
at https://txwizard.github.io/AnyCSV/.

There is also a NuGet package available at .https://www.nuget.org/packages/WizardWrx.AnyCSV/.

Using the Code

The working code is in a class library, which I built against Microsoft .NET Framework 2.0, and it can be used in projects that target any newer version of the framework. For your convenience, I included both debug and release builds of the library and the test program.

The package includes a test program, AnyCSVTestStand.exe, which is built from a single source file, Program.cs, and a set of string resources. Since it was intended to prove the robustness of the algorithm, the test program confines itself to the static methods.

Class library WizardWrx.AnyCSV.dll exposes a single class, Parser, which has nine constructor overloads, all of which are optional, and one overloaded static Parse method that does the real work. Since the real worker method is static, and everything it needs can be either passed into the method or allowed to take default values, the only reason to use an instance is when you want to specify non-default values for one or more of the arguments that have corresponding properties on the class in advance, so that you can omit them from the argument list when you call the Parse method in a loop to process a set of strings.

The Parse method has four overloads, all of which return a string [ ] containing the substrings. I shall next summarize the overloads, starting with the fourth, and simplest.

  1. The simplest overload takes two arguments, the string to be parsed and the delimiter, a pchrDelimiter.
  2. Adding a tad of complexity, the third overload adds a second char argument, pchrProtector, which specifies the character that protects a pchrDelimiter that appears in the middle of a substring, which must be ignored.
  3. A third overload adds penmGuardDisposition, which uses a member of the GuardDisposition enumeration to govern disposition of guard characters. Its valid values are straightforward and self-explanatory: Keep and Strip.
  4. The most complex overload adds the final parameter, penmTrimWhiteSpace, which uses the TrimWhiteSpace enumeration to specify four possible options to dispose of leading and trailing blanks in a substring.
Value Outcome
TrimWhiteSpace
Leave Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
TrimLeading Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
TrimTrailing Trim trailing white space. This option is especially useful with CSV files generated by Microsoft Excel, which often have long runs of meaningless white space, especially when a worksheet has blank rows or columns in its UsedRange.
TrimBoth Given that TrimLeading and TrimTrailing are required use cases, trimming both ends is essentially free. This flag is implemented such that it can be logically processed as TrimLeading | TrimTrailing.

One final static method, StandardCSVParse, is dedicated to parsing a true Comma Separated Values string. As such, it has a single string argument, pstrAnyCSV. The delimiter character is a comma, and the guard character is the double quotation mark.

Points of Interest

Everything that counts happens in a single method, the most complex of the four static Parse methods. Everything else, including the instance Parse method, calls upon it, specifying default values for omitted arguments or, in the case of the instance method, the corresponding instance properties.

The Parse method gets its robustness from a simple state machine that uses a pair of simple Boolean variables, fInProgress and fProtectDelimiters, as state variables. The key point is that when fProtectDelimiters is TRUE, indicating that a guard character has been found, but its mate has yet to be found, delimiter characters are ignored. Strings are assembled by appending characters to a StringBuilder that is initialized with a size sufficient to hold a degenerate case string - one that is devoid of delimiters. The objective of reserving such a large amount of memory is that the StringBuilder never needs to expand; hence, it need never move a partially built string to a bigger buffer.

To simplify routine use, the class exposes common delimiter and guard characters as public constants. If you prefer to avoid using raw characters, you can specify them in terms of a pair of enumerations, DelimiterChar and GuardChar. The constants and enumerations are especially useful for specifying guard characters, since quotation mark literals are rather fussy.

Translating the DelimiterChar and GuardChar enumerations is aided by arrays of DelimiterMap and GuardMap structures, which are populated by a static Parser constructor.

The instance properties use s_objSyncLock, a private generic object, to make themselves thread safe. I borrowed this proven concept from other classes that use such an object as part of their implementation of the Singleton design pattern.

Room for Improvement

I anticipate that others will find numerous ways to improve this library. One that occurs to me is the addition of a static StringBuilder that grows, but never shrinks during its lifetime. At a minimum, this would require the Parse method to test for its existence and size, and use the s_objSyncLock object to make itself thread-safe.

History

  • Friday, 01 May 2015 - Initial publication
  • Monday, 29 April 2019 - Add "Getting the Code" section, since the code lives in a GitHub repsoitory
  • Tuesday, 14 January 2020 - Remove stray markup from the text.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
I deliver robust, clean, adaptable, future-ready applications that are properly documented for users and maintainers. I have deep knowledge in multiple technologies and broad familiarity with computer and software technologies of yesterday, today, and tomorrow.

While it isn't perceived as sexy, my focus has always been the back end of the application stack, where data arrives from a multitude of sources, and is converted into reports that express my interpretation of The Fundamental Principle of Tabular Reporting, and are the most visible aspect of the system to senior executives who approve the projects and sign the checks.

While I can design a front end, I prefer to work at the back end, getting data into the system from outside sources, such as other computers, electronic sensors, and so forth, and getting it out of the system, as reports to IDENTIFY and SOLVE problems.

When presented with a problem, I focus on identifying and solving the root problem for the long term.

Specialties: Design: Relational data base design, focusing on reporting; organization and presentation of large document collections such as MSDS libraries

Development: Powerful, imaginative utility programs and scripts for automated systems management and maintenance

Industries: Property management, Employee Health and Safety, Services

Languages: C#, C++, C, Python, VBA, Visual Basic, Perl, WinBatch, SQL, XML, HTML, Javascript

Outside Interests: Great music (mostly, but by no means limited to, classical), viewing and photographing sunsets and clouds, traveling by car on small country roads, attending museum exhibits (fine art, history, science, technology), long walks, especially where there is little or no motor traffic, reading, especially nonfiction and thoughtfully written, thought provoking science fiction

Comments and Discussions

 
QuestionProject AnyCSVTestStand has a missing DLL Pin
Member 25759441-May-19 10:40
Member 25759441-May-19 10:40 
QuestionDownload source Pin
lafffer_11918-May-15 18:54
lafffer_11918-May-15 18:54 
AnswerRe: Download source Pin
soho13-Jan-16 23:33
soho13-Jan-16 23:33 
GeneralRe: Download source Pin
David A. Gray28-Apr-19 20:30
David A. Gray28-Apr-19 20:30 
QuestionCleanup the code ZIP Pin
Matt T Heffron4-May-15 13:20
professionalMatt T Heffron4-May-15 13:20 
QuestionCertificates store Pin
Dr Gadgit2-May-15 3:54
Dr Gadgit2-May-15 3:54 
AnswerRe: Certificates store Pin
David A. Gray1-Jun-15 19:05
David A. Gray1-Jun-15 19:05 
GeneralRe: Certificates store Pin
Dr Gadgit4-Jul-15 4:12
Dr Gadgit4-Jul-15 4:12 
GeneralRe: Certificates store Pin
Dr Gadgit19-Jul-15 4:19
Dr Gadgit19-Jul-15 4:19 
AnswerRe: Certificates store Pin
David A. Gray1-Jun-15 19:05
David A. Gray1-Jun-15 19:05 
First, please accept my apology for my tardy response; I have been preoccupied with other matters, and I haven't visited this article since the day after I published it.

Though I don't remember where they are stored, as I recall, they live in deeply hidden, heavily guarded Registry keys. At any rate, there are several levels of certificate authorities, with Trusted Root Certificates forming the trunks of the trees that inhabit the certificate forest.

The certificate that protects SSL connections to an individual Web server, which maps to its IP address, are the leaves of the trees, and it's easy enough to order a certificate from a CA that matches its IP address. When you order such a certificate, the order, itself, originates from the server for which it is intended, by way of a Certificate Signing Request, and the CA encodes the IP address from which the request originated into the certificate before they sign it. At the same time, they also encode the fully qualified hostname (e. g., www.google.com), unless the certificate is a "wild card," which is encoded to be valid for any host at the Internet domain on whose behalf it is issued (e. g., *.MyDomain.com). Such certificates have no business in the store of Trusted Root Certificates; when you install one, it goes into one of the other stores.

Personal is the generic name of the default store. For example, if you install a certificate from a Web server through Internet Explorer, that's where it lands unless you override the default.

Trusted People refers to certificates from people (or organizations) that you have elected to trust. Like the Personal store, this is a per user store.

Trusted Publishers refers to code signing certificates that you designate as having been issued to software publishers whom you trust.

Enterprise Trust is the other major machine wide certificate store, which holds server and code signing certificates that the owner of the machine designates as trustworthy.

As I recall from when I studied all of this in 2006, when I was working on an application that required a robust Public Key Infrastructure, to be admitted to any of the trusted stores, a certificate must either have an unbroken Chain of Trust that leads all the way back to a Trusted Root Certificate, or be self signed.

I'm not certain who handles it on behalf of open source operating systems such as Linux, but I know that in the case of Windows, the store is managed by Microsoft, and only they can add certificates to it.

All the fuss a couple of years ago about certificates being booted from the Trusted Root Certificates store included several out of cycle updates that removed certificates from it that had been discovered to be fraudulent. Those certificates were issued on behalf of Commodo, and other established members of the club of Trusted Root CAs, by rogue licensees. You don't hear as much about them any more because Microsoft and others quickly made the revocation process more highly automated, so that the updates happen without fanfare, and the operators of the Trusted Root CAs imposed tighter restrictions on who, and under what circumstances, the good name of a Trusted Root CA can be attached to a certificate.

Based on this information, I seldom give much thought to the trustworthiness of the root certificates. The audit that you surmised that I must be doing had a more limited scope, yet it required me to enumerate every certificate on the machine. I was investigating why a certificate that was issued to me by GlobalSign, a member of the Trusted Root CA club, was inexplicably being marked as invalid on scores of machines in a client's offices in several US cities. Since I have a copy installed without its private key on two machines at my office, both of which are treated as valid, I was trying to determine whether the certificates that were classified as expired were in the wrong store on the client's machines.
QuestionNo Source or Sample? Pin
Gary Noble2-May-15 1:30
Gary Noble2-May-15 1:30 
AnswerRe: No Source or Sample? Pin
Matt T Heffron4-May-15 13:20
professionalMatt T Heffron4-May-15 13:20 
GeneralRe: No Source or Sample? Pin
David A. Gray28-Apr-19 20:32
David A. Gray28-Apr-19 20:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.