Click here to Skip to main content
15,885,707 members
Articles / Programming Languages / C#

Auto-detect CSV Separator

Rate me:
Please Sign up or sign in to vote.
4.98/5 (111 votes)
27 Jul 2011CPOL4 min read 96.9K   2.3K   143   14
Explains how to detect which character is used as a separator in CSV file

auto-detect-csv-separator/AutoDetectCsvSeparator.png

Introduction

CSV files are very popular for storing tabular data because they are simple textual files with a very few rules. This makes them very interoperable because CSV readers and writers are relatively easy to implement. Interoperability is, probably, the first reason why someone would choose to save the data in CSV format.

Although rules for writing and reading CSV files, which are explained in the next chapter, are relatively known and widely accepted, one rule is an exception – determining a character that will be used as a separator. CSV files, as the name Comma Separated Values says, should use comma [,] as the separator but there are many CSV files that use semicolon [;] or horizontal tab [\t] as a separator.

So, in order to build a generic CSV reader that will read CSV file regardless of the separator, the reader must first figure out which character is used as a separator. This article gives one possible solution to this problem.

CSV Format

Rules for writing CSV files are pretty simple:

  • If value contains separator character or new line character or begins with a quote – enclose the value in quotes.
  • If value is enclosed in quotes – any quote character contained in the value should be followed by an additional quote character.

These two simple rules enable us to write CSV writers easily, in just few minutes. Implementing CSV reader is much more problematic because CSV stream has to be parsed sequentially, character by character and additional state storage has to be provided – which effectively makes CSV reader a state machine. There are a lot of CSV readers out there that have wrong implementation because they do not follow the rules stated above.

Implementation

Now that we have defined the rules for CSV files, we can implement CSV reader that is able to find out which character is used as a separator. 

Here is an entire C# source code of the method that detects separator in CSV stream: 

C#
public static char Detect(TextReader reader, int rowCount, IList<char> separators)
{
    IList<int> separatorsCount = new int[separators.Count];

    int character;

    int row = 0;

    bool quoted = false;
    bool firstChar = true;

    while (row < rowCount)
    {
        character = reader.Read();

        switch (character)
        {
            case '"':
                if (quoted)
                {
                    if (reader.Peek() != '"') // Value is quoted and 
			// current character is " and next character is not ".
                        quoted = false;
                    else
                        reader.Read(); // Value is quoted and current and 
				// next characters are "" - read (skip) peeked qoute.
                }
                else
                {
                    if (firstChar) 	// Set value as quoted only if this quote is the 
				// first char in the value.
                        quoted = true;
                }
                break;
            case '\n':
                if (!quoted)
                {
                    ++row;
                    firstChar = true;
                    continue;
                }
                break;
            case -1:
                row = rowCount;
                break;
            default:
                if (!quoted)
                {
                    int index = separators.IndexOf((char)character);
                    if (index != -1)
                    {
                        ++separatorsCount[index];
                        firstChar = true;
                        continue;
                    }
                }
                break;
        }

        if (firstChar)
            firstChar = false;
    }

    int maxCount = separatorsCount.Max();

    return maxCount == 0 ? '\0' : separators[separatorsCount.IndexOf(maxCount)];
}

CSV stream is represented with reader parameter that is used for reading characters from CSV stream, parameter rowCount tells the method how many rows should be read before determining separator and separators parameter is a list of characters that tells the method which characters are possible separators.

Method maintains internal state with these parameters:

  • separatorsCount – used for counting the number of occurrences of possible separator as a separator in CSV stream
  • character – last character that was read from the CSV stream
  • row – index of the currently processing row in the CSV stream
  • quotedtrue if characters that are read next are enclosed in quotes, otherwise false
  • firstChartrue if next character that is to be read is the first character of the next entry in CSV stream. This parameter is needed because we consider a value to be enclosed in quotes only if opening quote is the first character of the CSV entry.

When rowCount rows are read or CSV stream is read to the end, method returns first of the possible separators that has maximum number of occurrences as a separator in CSV stream. If any of the possible separators never occurred as a separator in CSV stream, ‘\0’ is returned.

Method takes care when reading quotes, separators and new line characters that are part of the quoted value. In this case, if a quote is read, method will peek into CSV stream to see if the next character is also a quote, otherwise it will consider this quote to be a closing quote. New line and separator characters are ignored if contained in a quoted value.

For example, in the following Employees.csv file:

Name,Surname,Salary
John,Doe,"$2,130"
Fred;Nurk;"$1,500"
Hans;Meier;"$1,650"
Ivan;Horvat;"$3,200"

Method detects that CSV separator is [;] although total number of occurrences of [;] is 6 and total number of occurrences of [,] is 8. That is because last 4 occurrences of [,] are enclosed in quotes so they don’t qualify as a possible separators. So total number of occurrences of [,] as separators is 4 and total number of occurrences of [;] as separators is 6, which makes [;] the most probable CSV separator.

Bundled with this article is a WPF solution that demonstrates auto detection of CSV separator in action. Solution can be downloaded here. Application is located in bin/Release folder.

Alternative

It shouldn’t be too hard to derive the entire CSV reader from the code presented in this article, but tabular data can come in many different formats and implementing a reader and a writer for each of them may not be so easy and could really hurt your productivity.

For that reason, you could use some third party component which supports various file formats. This will probably cost you some money, but formats like XLS, XLSX, CSV, ODS, HTML are likely to be supported within the same API, so your application will be able to target different file formats using the same code.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) GemBox Ltd
United Kingdom United Kingdom
Josip Kremenic works as a developer at GemBox Software.
He works on:

  • GemBox.Spreadsheet - a C# / VB.NET Excel component for reading and/or writing XLS, XLSX, CSV, HTML, PDF, XPS and ODS files.
  • GemBox.Document - a C# / VB.NET Word component for reading and/or writing DOCX, DOC, HTML, PDF, XPS, RTF and TXT files.
  • GemBox.Presentation- a C# / VB.NET PowerPoint component for reading and/or writing PPTX, PPT, PDF and XPS files.
  • GemBox.Email - a C# / VB.NET Email component for composing, receiving and sending MSG, EML and MHTML email messages using IMAP, POP and SMTP.
  • GemBox.Pdf - a C# / VB.NET Pdf component for for reading and/or writing PDF files.

Comments and Discussions

 
Praiseexcellent Pin
Southmountain12-Apr-22 17:47
Southmountain12-Apr-22 17:47 
QuestionMaybe counting per row... Pin
NickUser1233-Nov-15 7:31
NickUser1233-Nov-15 7:31 
QuestionCan you provide equivalent java code for Detect method ? Pin
ronitkheni17-Sep-15 8:38
ronitkheni17-Sep-15 8:38 
GeneralGreat solution Pin
Stepan Kobzey9-Mar-15 9:19
Stepan Kobzey9-Mar-15 9:19 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey12-Feb-12 23:58
professionalManoj Kumar Choubey12-Feb-12 23:58 
GeneralExcellent idea Pin
Gustav Brock2-Aug-11 3:17
professionalGustav Brock2-Aug-11 3:17 
GeneralMy vote of 4 Pin
YvesDaoust1-Aug-11 8:58
YvesDaoust1-Aug-11 8:58 
Questionnice one Pin
mayur csharp G29-Jul-11 0:37
mayur csharp G29-Jul-11 0:37 
QuestionThoughts Pin
PIEBALDconsult26-Jul-11 18:29
mvePIEBALDconsult26-Jul-11 18:29 
AnswerRe: Thoughts Pin
JosipK26-Jul-11 21:47
professionalJosipK26-Jul-11 21:47 
GeneralRe: Thoughts Pin
PIEBALDconsult27-Jul-11 2:35
mvePIEBALDconsult27-Jul-11 2:35 
GeneralRe: Thoughts Pin
JosipK27-Jul-11 3:30
professionalJosipK27-Jul-11 3:30 
GeneralRe: Thoughts Pin
PIEBALDconsult27-Jul-11 14:26
mvePIEBALDconsult27-Jul-11 14:26 
GeneralRe: Thoughts Pin
jsc421-Aug-11 23:20
professionaljsc421-Aug-11 23:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.