fastCSV - A Tiny, Fast, Standard Compliant CSV Reader Writer

Mehdi Gholam

5.00/5 (30 votes)

Jan 1, 2020

CPOL

7 min read

82375

2282

With the increased prominence of machine learning and ingesting large datasets with the CSV format for this purpose, I decided to write a CSV parser which met my requirements of being small, fast and easy to use.

Download fastCSV_v2.0.8.zip

Introduction

Most libraries I looked at didn't really meet my requirements and fastCSV was born.

Also CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON.

Features

Fully CSV standard compliant
- Multi-line
- Quoted columns
- Keeps spaces between delimiters
Really fast reading and writing of CSV files (see performance)
Tiny 8kb DLL compiled to net40 or netstandard20
Ability to get a typed list of objects from a CSV file
Ability to filter a CSV file while loading
Ability to specify a custom delimiter

CSV Standard

You can read the CSV RFC here : https://tools.ietf.org/html/rfc4180 as a summary a CSV file can :

be multi lined if the values in a column contains new lines
a column must be quoted if it contains a new line, delmiter or quote character
- quotes must be quoted
spaces between the delimiter are considered part of the column

Below is an example of a complex standard compliant CSV file from the wiki page https://en.wikipedia.org/wiki/Comma-separated_values :

Year,Make,Model,Description,Price
1997,Ford,"E350
F150","ac, abs,
moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air,
"",moon,""
roof, loaded",4799.00
1999,BMW,Z3,"used",14900.00
1999, Toyota,Corolla,,7000.00

as you can see some rows are multi lined and contain quotes and commas, which will give the table below:

Year

Make

Model

Description

Price

1997

Ford

E350

F150

ac, abc,

moon

3000.00

1999

Chevy

Venture "Extended Edition"

4900.00

1999

Chevy

Venture "Extended Edition, Very Large"

5000.00

1996

Jeep

Grand Cherokee

MUST SELL!

air,

",moon,"

roof, loaded

4799.00

1999

BMW

used

14900.00

1999

Toyota

Corolla

7000.00

as you can see some columns are multi line, and " Toyota" column of the last row starts with a space.

Performance Benchmarks

Loading the https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip (585Mb) file which has 4,496,263 rows on my machine as a relative comparison to other libraries:

fastCSV : 11.20s 639Mb used
NReco.CSV : 6.76s 800Mb used
fastCSV string.Split() : 11.50s 638Mb used
TinyCSVparser : 34s 992Mb used

As a comparison of a baseline of what is possible on the same dataset:

File.ReadAllBytes() : 1.5s 573Mb used
File.ReadAllLines() with no processing : 3.7s 1633Mb used
File.ReadLines() with no processing : 1.9s
File.ReadLines() + string.Split() no return list : 7.5s

The difference from 1 to 2 is the overhead of converting the bytes to Unicode strings : 2.2s

The difference between 2 and 3 is the memory overhead of creating string[] : 1.8s

The difference from 4 to fastCSV is the overhead of creating T objects and adding to a list : 4s

Roads not taken

Loading chunks in a buffer :
- while initially I tried this route, it proved too complex and I couldn't get it to work properly. Judging by other libraries which did this, it is slow in comparison to the current implementation anyway.
StringBuilder character by character :
- using this option proved too slow for parsing columns out of a line.

Using the code

Below are some examples of how to use fastCSV:

public class car
{
    // you can use fields or properties
    public string Year;
    public string Make;
    public string Model;
    public string Description;
    public string Price;
}

// listcars = List<car>
var listcars = fastCSV.ReadFile<cars>(
    "csvstandard.csv", // filename
    true,              // has header
    ',',               // delimiter
    (o, c) =>          // to object function o : car object, c : columns array read
    {
        o.Year = c[0];
        o.Make = c[1];
        o.Model = c[2];
        o.Description = c[3];
        o.Price = c[4];
        // add to list
        return true;
    });

fastCSV.WriteFile<LocalWeatherData>(
    "filename2.csv",   // filename
    new string[] { "WBAN", "Date", "SkyCondition" }, // headers defined or null
    '|',               // delimiter
    list,              // list of LocalWeatherData to save
    (o, c) =>          // from object function 
	{
    	c.Add(o.WBAN);
    	c.Add(o.Date.ToString("yyyyMMdd"));
    	c.Add(o.SkyCondition);
	});

Helper functions for performance

fastCSV has the following helper functions:

int ToInt(string s) creates an int from a string
int ToInt(string s, int index, int count) creates an int from a substring
DateTime ToDateTimeISO(string value, bool UseUTCDateTime) creates an ISO standard DateTime i.e. yyyy-MM-ddTHH:mm:ss ( optional part.nnnZ)

public class LocalWeatherData
{
    public string WBAN;
    public DateTime Date;
    public string SkyCondition;
}

var list = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        o.WBAN = c[0];
        // c[1] data is in "20150301" format
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4), 
                              fastCSV.ToInt(c[1], 4, 2), 
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        //if (o.Date.Day % 2 == 0)
        //    add = false;
        return add;
    });

Usage Scenarios

Filtering CSV while loading
- In your map function while loading you can write conditions on your loaded line data and filter out lines you don't want by using return false;
Reading CSV to import to other systems
- in your map function you can send the line data to another system and return false;
- or process the entire file and use the List<T> returned
Processing/aggregating data while loading
- You can have a List<T> which has no bearing on the columns of the CSV file and sum/min/max/avg/etc. the lines read

Inside the code

Essentially the reading is a loop through parsing a line, creating a generic element for a list, handing of the object created and the columns extracted from the line to the user defined map function and adding it to the list for return (if the map function says so) :

var c = ParseLine(line, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
   list.Add(o);

Now the CSV standard complexity comes from handling multi lines correctly which is done by counting if there are odd number of quotes in a line, hence it's multi line and reading the lines until the quotes are even, which is done in the ReadFile() function.

The beauty of this approach is that it is simple, does no reflection and is really fast, with the control being in the users hands.

All the reading code is below :

public static List<T> ReadFile<T>(string filename, bool hasheader, char delimiter, ToOBJ<T> mapper) where T : new()
{
    string[] cols = null;
    List<T> list = new List<T>();
    int linenum = -1;
    StringBuilder sb = new StringBuilder();
    bool insb = false;
    foreach (var line in File.ReadLines(filename))
    {
        try
        {
            linenum++;
            if (linenum == 0)
            {
                if (hasheader)
                {
                    // actual col count
                    int cc = CountOccurence(line, delimiter);
                    if (cc == 0)
                        throw new Exception("File does not have '" + delimiter + "' as a delimiter");
                    cols = new string[cc + 1];
                    continue;
                }
                else
                    cols = new string[_COLCOUNT];
            }
            var qc = CountOccurence(line, '\"');
            bool multiline = qc % 2 == 1 || insb;

            string cline = line;
            // if multiline add line to sb and continue
            if (multiline)
            {
                insb = true;
                sb.Append(line);
                var s = sb.ToString();
                qc = CountOccurence(s, '\"');
                if (qc % 2 == 1)
                {
                    sb.AppendLine();
                    continue;
                }
                cline = s;
                sb.Clear();
                insb = false;
            }

            var c = ParseLine(cline, delimiter, cols);

            T o = new T();
            var b = mapper(o, c);
            if (b)
                list.Add(o);
        }
        catch (Exception ex)
        {
            throw new Exception("error on line " + linenum, ex);
        }
    }

    return list;
}

private unsafe static int CountOccurence(string text, char c)
{
    int count = 0;
    int len = text.Length;
    int index = -1;
    fixed (char* s = text)
    {
        while (index++ < len)
        {
            char ch = *(s + index);
            if (ch == c)
                count++;
        }
    }
    return count;
}

private unsafe static string[] ParseLine(string line, char delimiter, string[] columns)
{
    //return line.Split(delimiter);
    int col = 0;
    int linelen = line.Length;
    int index = 0;

    fixed (char* l = line)
    {
        while (index < linelen)
        {
            if (*(l + index) != '\"')
            {
                // non quoted
                var next = line.IndexOf(delimiter, index);
                if (next < 0)
                {
                    columns[col++] = new string(l, index, linelen - index);
                    break;
                }
                columns[col++] = new string(l, index, next - index);
                index = next + 1;
            }
            else
            {
                // quoted string change "" -> "
                int qc = 1;
                int start = index;
                char c = *(l + ++index);
                // find matching quote until delim or EOL
                while (index++ < linelen)
                {
                    if (c == '\"')
                        qc++;
                    if (c == delimiter && qc % 2 == 0)
                        break;
                    c = *(l + index);
                }
                columns[col++] = new string(l, start + 1, index - start - 3).Replace("\"\"", "\"");
            }
        }
    }

    return columns;
}

ParseLine() is responsible for extracting the columns from a line in an optimized unsafe way.

And the writing code is just :

public static void WriteFile<T>(string filename, string[] headers, char delimiter, List<T> list, FromObj<T> mapper)
{
    using (FileStream f = new FileStream(filename, FileMode.Create, FileAccess.Write))
    {
        using (StreamWriter s = new StreamWriter(f))
        {
            if (headers != null)
                s.WriteLine(string.Join(delimiter.ToString(), headers));

            foreach (var o in list)
            {
                List<object> cols = new List<object>();
                mapper(o, cols);
                for (int i = 0; i < cols.Count; i++)
                {
                    // qoute string if needed -> \" \r \n delim 
                    var str = cols[i].ToString();
                    bool quote = false;

                    if (str.IndexOf('\"') >= 0)
                    {
                        quote = true;
                        str = str.Replace("\"", "\"\"");
                    }

                    if (quote == false && str.IndexOf('\n') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf('\r') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf(delimiter) >= 0)
                        quote = true;

                    if (quote)
                        s.Write("\"");
                    s.Write(str);
                    if (quote)
                        s.Write("\"");

                    if (i < cols.Count - 1)
                        s.Write(delimiter);
                }
                s.WriteLine();
            }
            s.Flush();
        }
        f.Close();
    }
}

Sample Use cases

Splitting data sets for testing and training

In data science you generally split your data into training and testing sets, and in the example below every 3rd row is for testing ( you could make the splitting more elaborate) :

var testing = new List<LocalWeatherData>();
int line = 0;
var training = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        line++;
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
                              fastCSV.ToInt(c[1], 4, 2),
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        if (line % 3 == 0)
        {
            add = false;
            testing.Add(o);
        }
        return add;
    });

Appendix v2.0

With apologies to NReco.CSV, I messed up the timings for that library which was pointed out to me by KPixel on GitHub, this prompted me to go back to the drawing board and redo the internals of fastCSV for more speed.

The new code is even faster than a naïve roll-your-own File.ReadLines() and string.Split() which can't handle multi-lines.

Performance

The new performance numbers are below which in comparison to the v1 code is nearly 2x faster at the expense of a little more memory usage on the same 4,496,263 rows dataset.

fastCSV .net4 : 6.27s 753Mb used
fastCSV core : 6.51s 669Mb used

Interestingly on .net core, the library uses less memory.

Changes Made

First up was creating a buffered StreamReader instead of relying on File.ReadLines() this is handled by the BufReader class. This class also handles multi-lines without resorting to a StringBuilder and the possible case that a line is larger than the buffer being used to read it.
While testing I discovered that using IL to create the generic list objects is faster than using new T()
Poor mans (.net 4) Span with the MGSpan class which just passes around the buffer, start and length of the data and delays the creation of strings until they are actually used in the object filling delegate.
ReadFile<T>() looks cleaner now and does not use StringBuilder for multi-lines.

Some Weird Stuff

The first pass of FillBuffer() used _tr.BaseStream.Seek() to go back in the CSV data file when the end of the buffer being read was reached and a line was not completed, this was fine at first but failed non ASCII data and UTF8 encoded files, the reason being that some characters are 2 bytes not 1 and a char in .net while looking like a byte in fact can be 2 bytes. This messes the offset computation when seeking which results in reading incorrect lines.

To remedy this I used Array.Copy() to copy the characters to the start of the buffer and read the rest from the file until the buffer is full, interestingly again you can't use Buffer.BlockCopy() for the same reason as above.

A simple change of struct MGSpan instead of class MGSpan results in a very distinct speed up, probably because the objects are passed around on the stack and don't stay around long which faster than using the heap.

Previous Versions

History

Initial release : 1^st January 2020
Update v2.0 : 27^th January 2020
- article corrections
- speed ups
Update v2.0.1 : 5^th September 2020
- new overloads
- ignore escaped characters at the end of a quoted line (thanks to Denis Samuel)
Update v2.0.2 : 6^th September 2020
- ToString() null check (thanks to Martijn)
Update v2.0.8 : 6^th December 2020
- added unit tests
- bug fix trailing new line sometimes
- bug fix empty trailing data columns showing prev row data
- switched to TextReader
- more ReadStream() overloads
- bug fix reading first line when hasheader = false