Click here to Skip to main content
14,975,219 members
Articles / Programming Languages / C#
Article
Posted 1 Jan 2020

Tagged as

Stats

42.4K views
1.9K downloads
74 bookmarked

fastCSV - A Tiny, Fast, Standard Compliant CSV Reader Writer

Rate me:
Please Sign up or sign in to vote.
5.00/5 (30 votes)
6 Dec 2020CPOL7 min read
With the increased prominence of machine learning and ingesting large datasets with the CSV format for this purpose, I decided to write a CSV parser which met my requirements of being small, fast and easy to use.
In this article, I look at CSV Standard and a summary of what a CSV file can do, performance benchmarks, some examples of how to use fastCSV, some helper functions for performance, usage scenarios, a quick peek inside the code, and some sample use cases.

Introduction

Most libraries I looked at didn't really meet my requirements and fastCSV was born.

Also CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON.

Features

  • Fully CSV standard compliant

    • Multi-line
    • Quoted columns
    • Keeps spaces between delimiters
  • Really fast reading and writing of CSV files (see performance)

  • Tiny 8kb DLL compiled to net40 or netstandard20

  • Ability to get a typed list of objects from a CSV file

  • Ability to filter a CSV file while loading

  • Ability to specify a custom delimiter

CSV Standard

You can read the CSV RFC here : https://tools.ietf.org/html/rfc4180 as a summary a CSV file can :

  • be multi lined if the values in a column contains new lines
  • a column must be quoted if it contains a new line, delmiter or quote character
    • quotes must be quoted
  • spaces between the delimiter are considered part of the column

Below is an example of a complex standard compliant CSV file from the wiki page https://en.wikipedia.org/wiki/Comma-separated_values :

Year,Make,Model,Description,Price
1997,Ford,"E350
F150","ac, abs,
moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air,
"",moon,""
roof, loaded",4799.00
1999,BMW,Z3,"used",14900.00
1999, Toyota,Corolla,,7000.00   

as you can see some rows are multi lined and contain quotes and commas, which will give the table below:

Year Make Model Description Price
1997 Ford

E350

F150

ac, abc,

moon

3000.00
1999 Chevy Venture "Extended Edition"   4900.00
1999 Chevy Venture "Extended Edition, Very Large"   5000.00
1996 Jeep Grand Cherokee

MUST SELL!

air,

",moon,"

roof, loaded

4799.00
1999 BMW Z3 used 14900.00
1999  Toyota Corolla   7000.00

as you can see some columns are multi line, and " Toyota" column of the last row starts with a space.

Performance Benchmarks

Loading the https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip (585Mb) file which has 4,496,263 rows on my machine as a relative comparison to other libraries:

  • fastCSV : 11.20s 639Mb used
  • NReco.CSV : 6.76s 800Mb used
  • fastCSV string.Split() : 11.50s 638Mb used
  • TinyCSVparser : 34s 992Mb used

As a comparison of a baseline of what is possible on the same dataset:

  1. File.ReadAllBytes() : 1.5s 573Mb used
  2. File.ReadAllLines() with no processing : 3.7s 1633Mb used
  3. File.ReadLines() with no processing : 1.9s
  4. File.ReadLines() + string.Split() no return list : 7.5s

The difference from 1 to 2 is the overhead of converting the bytes to Unicode strings : 2.2s

The difference between 2 and 3 is the memory overhead of creating string[] : 1.8s

The difference from 4 to fastCSV is the overhead of creating T objects and adding to a list : 4s

Roads not taken

  • Loading chunks in a buffer :

    • while initially I tried this route, it proved too complex and I couldn't get it to work properly. Judging by other libraries which did this, it is slow in comparison to the current implementation anyway.
  • StringBuilder character by character :

    • using this option proved too slow for parsing columns out of a line.

Using the code

Below are some examples of how to use fastCSV:

C#
public class car
{
    // you can use fields or properties
    public string Year;
    public string Make;
    public string Model;
    public string Description;
    public string Price;
}

// listcars = List<car>
var listcars = fastCSV.ReadFile<cars>(
    "csvstandard.csv", // filename
    true,              // has header
    ',',               // delimiter
    (o, c) =>          // to object function o : car object, c : columns array read
    {
        o.Year = c[0];
        o.Make = c[1];
        o.Model = c[2];
        o.Description = c[3];
        o.Price = c[4];
        // add to list
        return true;
    });

fastCSV.WriteFile<LocalWeatherData>(
    "filename2.csv",   // filename
    new string[] { "WBAN", "Date", "SkyCondition" }, // headers defined or null
    '|',               // delimiter
    list,              // list of LocalWeatherData to save
    (o, c) =>          // from object function 
	{
    	c.Add(o.WBAN);
    	c.Add(o.Date.ToString("yyyyMMdd"));
    	c.Add(o.SkyCondition);
	});

Helper functions for performance

fastCSV has the following helper functions:

  • int ToInt(string s) creates an int from a string
  • int ToInt(string s, int index, int count) creates an int from a substring
  • DateTime ToDateTimeISO(string value, bool UseUTCDateTime) creates an ISO standard DateTime i.e. yyyy-MM-ddTHH:mm:ss ( optional part.nnnZ)
C#
public class LocalWeatherData
{
    public string WBAN;
    public DateTime Date;
    public string SkyCondition;
}

var list = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        o.WBAN = c[0];
        // c[1] data is in "20150301" format
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4), 
                              fastCSV.ToInt(c[1], 4, 2), 
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        //if (o.Date.Day % 2 == 0)
        //    add = false;
        return add;
    });

Usage Scenarios

  • Filtering CSV while loading

    • In your map function while loading you can write conditions on your loaded line data and filter out lines you don't want by using return false;
  • Reading CSV to import to other systems

    • in your map function you can send the line data to another system and return false;
    • or process the entire file and use the List<T> returned
  • Processing/aggregating data while loading

    • You can have a List<T> which has no bearing on the columns of the CSV file and sum/min/max/avg/etc. the lines read

Inside the code

Essentially the reading is a loop through parsing a line, creating a generic element for a list, handing of the object created and the columns extracted from the line to the user defined map function and adding it to the list for return (if the map function says so) :

C#
var c = ParseLine(line, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
   list.Add(o);

Now the CSV standard complexity comes from handling multi lines correctly which is done by counting if there are odd number of quotes in a line, hence it's multi line and reading the lines until the quotes are even, which is done in the ReadFile() function.

The beauty of this approach is that it is simple, does no reflection and is really fast, with the control being in the users hands.

All the reading code is below :

C#
public static List<T> ReadFile<T>(string filename, bool hasheader, char delimiter, ToOBJ<T> mapper) where T : new()
{
    string[] cols = null;
    List<T> list = new List<T>();
    int linenum = -1;
    StringBuilder sb = new StringBuilder();
    bool insb = false;
    foreach (var line in File.ReadLines(filename))
    {
        try
        {
            linenum++;
            if (linenum == 0)
            {
                if (hasheader)
                {
                    // actual col count
                    int cc = CountOccurence(line, delimiter);
                    if (cc == 0)
                        throw new Exception("File does not have '" + delimiter + "' as a delimiter");
                    cols = new string[cc + 1];
                    continue;
                }
                else
                    cols = new string[_COLCOUNT];
            }
            var qc = CountOccurence(line, '\"');
            bool multiline = qc % 2 == 1 || insb;

            string cline = line;
            // if multiline add line to sb and continue
            if (multiline)
            {
                insb = true;
                sb.Append(line);
                var s = sb.ToString();
                qc = CountOccurence(s, '\"');
                if (qc % 2 == 1)
                {
                    sb.AppendLine();
                    continue;
                }
                cline = s;
                sb.Clear();
                insb = false;
            }

            var c = ParseLine(cline, delimiter, cols);

            T o = new T();
            var b = mapper(o, c);
            if (b)
                list.Add(o);
        }
        catch (Exception ex)
        {
            throw new Exception("error on line " + linenum, ex);
        }
    }

    return list;
}

private unsafe static int CountOccurence(string text, char c)
{
    int count = 0;
    int len = text.Length;
    int index = -1;
    fixed (char* s = text)
    {
        while (index++ < len)
        {
            char ch = *(s + index);
            if (ch == c)
                count++;
        }
    }
    return count;
}

private unsafe static string[] ParseLine(string line, char delimiter, string[] columns)
{
    //return line.Split(delimiter);
    int col = 0;
    int linelen = line.Length;
    int index = 0;

    fixed (char* l = line)
    {
        while (index < linelen)
        {
            if (*(l + index) != '\"')
            {
                // non quoted
                var next = line.IndexOf(delimiter, index);
                if (next < 0)
                {
                    columns[col++] = new string(l, index, linelen - index);
                    break;
                }
                columns[col++] = new string(l, index, next - index);
                index = next + 1;
            }
            else
            {
                // quoted string change "" -> "
                int qc = 1;
                int start = index;
                char c = *(l + ++index);
                // find matching quote until delim or EOL
                while (index++ < linelen)
                {
                    if (c == '\"')
                        qc++;
                    if (c == delimiter && qc % 2 == 0)
                        break;
                    c = *(l + index);
                }
                columns[col++] = new string(l, start + 1, index - start - 3).Replace("\"\"", "\"");
            }
        }
    }

    return columns;
}

ParseLine() is responsible for extracting the columns from a line in an optimized unsafe way.

And the writing code is just :

C#
public static void WriteFile<T>(string filename, string[] headers, char delimiter, List<T> list, FromObj<T> mapper)
{
    using (FileStream f = new FileStream(filename, FileMode.Create, FileAccess.Write))
    {
        using (StreamWriter s = new StreamWriter(f))
        {
            if (headers != null)
                s.WriteLine(string.Join(delimiter.ToString(), headers));

            foreach (var o in list)
            {
                List<object> cols = new List<object>();
                mapper(o, cols);
                for (int i = 0; i < cols.Count; i++)
                {
                    // qoute string if needed -> \" \r \n delim 
                    var str = cols[i].ToString();
                    bool quote = false;

                    if (str.IndexOf('\"') >= 0)
                    {
                        quote = true;
                        str = str.Replace("\"", "\"\"");
                    }

                    if (quote == false && str.IndexOf('\n') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf('\r') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf(delimiter) >= 0)
                        quote = true;

                    if (quote)
                        s.Write("\"");
                    s.Write(str);
                    if (quote)
                        s.Write("\"");

                    if (i < cols.Count - 1)
                        s.Write(delimiter);
                }
                s.WriteLine();
            }
            s.Flush();
        }
        f.Close();
    }
}

Sample Use cases

Splitting data sets for testing and training

In data science you generally split your data into training and testing sets, and in the example below every 3rd row is for testing ( you could make the splitting more elaborate) :

C#
var testing = new List<LocalWeatherData>();
int line = 0;
var training = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        line++;
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
                              fastCSV.ToInt(c[1], 4, 2),
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        if (line % 3 == 0)
        {
            add = false;
            testing.Add(o);
        }
        return add;
    });

Appendix v2.0

With apologies to NReco.CSV, I messed up the timings for that library which was pointed out to me by KPixel on GitHub, this prompted me to go back to the drawing board and redo the internals of fastCSV for more speed.

The new code is even faster than a naïve roll-your-own File.ReadLines() and string.Split() which can't handle multi-lines.

Performance

The new performance numbers are below which in comparison to the v1 code is nearly 2x faster at the expense of a little more memory usage on the same 4,496,263 rows dataset.

  • fastCSV .net4 : 6.27s  753Mb used
  • fastCSV core : 6.51s  669Mb used

Interestingly on .net core, the library uses less memory.

Changes Made

  • First up was creating a buffered StreamReader instead of relying on File.ReadLines() this is handled by the BufReader class. This class also handles multi-lines without resorting to a StringBuilder and the possible case that a line is larger than the buffer being used to read it.
  • While testing I discovered that using IL to create the generic list objects is faster than using new T()
  • Poor mans (.net 4) Span with the MGSpan class which just passes around the buffer, start and length of the data and delays the creation of strings until they are actually used in the object filling delegate.
  • ReadFile<T>() looks cleaner now and does not use StringBuilder for multi-lines.

Some Weird Stuff

The first pass of FillBuffer() used _tr.BaseStream.Seek() to go back in the CSV data file when the end of the buffer being read was reached and a line was not completed, this was fine at first but failed non ASCII data and UTF8 encoded files, the reason being that some characters are 2 bytes not 1 and a char in .net while looking like a byte in fact can be 2 bytes. This messes the offset computation when seeking which results in reading incorrect lines.

To remedy this I used Array.Copy() to copy the characters to the start of the buffer and read the rest from the file until the buffer is full, interestingly again you can't use Buffer.BlockCopy() for the same reason as above.

A simple change of struct MGSpan instead of class MGSpan results in a very distinct speed up, probably because the objects are passed around on the stack and don't stay around long which faster than using the heap.

Previous Versions

History

  • Initial release : 1st January 2020
  • Update v2.0 : 27th January 2020
    • article corrections
    • speed ups
  • Update v2.0.1 : 5th September 2020
    • new overloads
    • ignore escaped characters at the end of a quoted line (thanks to Denis Samuel)
  • Update v2.0.2 : 6th September 2020
    • ToString() null check (thanks to Martijn)
  • Update v2.0.8 : 6th December 2020
    • added unit tests
    • bug fix trailing new line sometimes
    • bug fix empty trailing data columns showing prev row data
    • switched to TextReader
    • more ReadStream() overloads
    • bug fix reading first line when hasheader = false

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mehdi Gholam
Architect -
United Kingdom United Kingdom
Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)
* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

Comments and Discussions

 
QuestionVERY LONG line in CSV file - System.Exception: 'error on line 923 fastCSV+COLUMNS+MGSpan' Pin
Member 151804282-May-21 9:14
MemberMember 151804282-May-21 9:14 
AnswerRe: VERY LONG line in CSV file - System.Exception: 'error on line 923 fastCSV+COLUMNS+MGSpan' Pin
Mehdi Gholam2-May-21 18:35
mvaMehdi Gholam2-May-21 18:35 
GeneralRe: VERY LONG line in CSV file - System.Exception: 'error on line 923 fastCSV+COLUMNS+MGSpan' Pin
Member 151804283-May-21 22:14
MemberMember 151804283-May-21 22:14 
QuestionCan it work with DataTable? Pin
jflycn13-Dec-20 14:09
Memberjflycn13-Dec-20 14:09 
AnswerRe: Can it work with DataTable? Pin
Mehdi Gholam13-Dec-20 18:33
mvaMehdi Gholam13-Dec-20 18:33 
GeneralRe: Can it work with DataTable? Pin
chrisbray19-Dec-20 4:27
Memberchrisbray19-Dec-20 4:27 
GeneralRe: Can it work with DataTable? Pin
Mehdi Gholam19-Dec-20 4:30
mvaMehdi Gholam19-Dec-20 4:30 
GeneralRe: Can it work with DataTable? Pin
chrisbray19-Dec-20 4:40
Memberchrisbray19-Dec-20 4:40 
GeneralRe: Can it work with DataTable? Pin
Mehdi Gholam19-Dec-20 4:52
mvaMehdi Gholam19-Dec-20 4:52 
GeneralRe: Can it work with DataTable? Pin
chrisbray19-Dec-20 5:11
Memberchrisbray19-Dec-20 5:11 
QuestionTextFieldParser Class Pin
PIEBALDconsult6-Dec-20 11:32
professionalPIEBALDconsult6-Dec-20 11:32 
AnswerRe: TextFieldParser Class Pin
Mehdi Gholam6-Dec-20 19:33
mvaMehdi Gholam6-Dec-20 19:33 
QuestionReplace \n to Environment.Newline Pin
Norbert Haberl23-Nov-20 23:56
MemberNorbert Haberl23-Nov-20 23:56 
AnswerRe: Replace \n to Environment.Newline Pin
Mehdi Gholam24-Nov-20 0:35
mvaMehdi Gholam24-Nov-20 0:35 
QuestionRunning out of buffer Pin
Martijn 795-Sep-20 20:57
MemberMartijn 795-Sep-20 20:57 
AnswerRe: Running out of buffer Pin
Mehdi Gholam6-Sep-20 2:02
mvaMehdi Gholam6-Sep-20 2:02 
GeneralRe: Running out of buffer Pin
Martijn 797-Sep-20 10:21
MemberMartijn 797-Sep-20 10:21 
GeneralRe: Running out of buffer Pin
Mehdi Gholam7-Sep-20 17:34
mvaMehdi Gholam7-Sep-20 17:34 
GeneralRe: Running out of buffer Pin
Martijn 799-Sep-20 10:43
MemberMartijn 799-Sep-20 10:43 
GeneralRe: Running out of buffer Pin
Mehdi Gholam9-Sep-20 17:59
mvaMehdi Gholam9-Sep-20 17:59 
QuestionfastCSV+COLUMNS+MGSpan ---> System.ArgumentNullException Pin
Martijn 793-Sep-20 10:05
MemberMartijn 793-Sep-20 10:05 
AnswerRe: fastCSV+COLUMNS+MGSpan ---> System.ArgumentNullException Pin
Mehdi Gholam3-Sep-20 17:36
mvaMehdi Gholam3-Sep-20 17:36 
GeneralRe: fastCSV+COLUMNS+MGSpan ---> System.ArgumentNullException Pin
Martijn 794-Sep-20 8:01
MemberMartijn 794-Sep-20 8:01 
GeneralRe: fastCSV+COLUMNS+MGSpan ---> System.ArgumentNullException Pin
Mehdi Gholam4-Sep-20 18:34
mvaMehdi Gholam4-Sep-20 18:34 
GeneralRe: fastCSV+COLUMNS+MGSpan ---> System.ArgumentNullException Pin
Martijn 794-Sep-20 20:11
MemberMartijn 794-Sep-20 20:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.