Click here to Skip to main content
15,077,494 members
Articles / General Programming / Parser
Article
Posted 11 Jul 2021

Stats

3.8K views
133 downloads
10 bookmarked

CSV Interface - A CSV Parser for VBA. Stable, Efficient, Flexible and Standards-compliant

Rate me:
Please Sign up or sign in to vote.
4.43/5 (7 votes)
11 Jul 2021GPL39 min read
To cover the need to import information from a CSV file directly into RAM, in an efficient, secure and stable way, from VBA, specialized libraries are needed to ensure the ease of use and integrity of the data.
In this article, we will take a look at the CSV specifications exposed in RFC-4180 and a brief explanation of the structure that a file of this type can have. I will address some examples of how to use CSV interface, some extreme cases in which the vast majority of parsers fail and I will detail all the features that the library under study has.

Introduction

For many users, design a CSV library is something unfortunate, premise that is based on the existence of tools such as Power Query for Excel and the powerful SQL language for Microsoft Access. However, if our need is to read the information from a CSV file and store the data read in RAM, using Power Query in Excel would result in the steps: establish a physical connection to the file, dump the data into a spreadsheet, copy the data to a VBA array, delete the data dumped to the sheet and delete the connection to the file. Clearly, for the aforementioned case, the best option would be to use a library that allows us to dump the data directly from the CSV file to the RAM memory. That is the need that has led me to design the CSV interface.

Capabilities

  • RFC-4180 specs compliant and more. The parser is adjusted to a variety of situations although it is preferable to follow the specifications: supports quoted and multi-line fields, user can skip commented and blank records.
  • Stable. Fully Test Driven Developed (TDD) library, (64/64 test passed), that includes 650+ line of code for testing. See VBA test library by Tim Hall.
  • Memory-friendly. CSV/TSV files are processed using a custom stream technique, only 0.5MB are in memory at a time.
  • Robust. Parser and writer accept Unix-style quotes escape sequences.
  • Easy to use. A few lines of code can do the work!
  • Automatic delimiter guesser. Don't worry if you forgot the file configuration. The interface has a solid strategy for guessing delimiters!
  • Highly Configurable. User can configure the parser to work with a wide range of CSV files.
  • CSV data filtering. Only save the CSV data that fill the specified requirements.
  • Like SQL queries on CSV files. Add your own logic to mimic SQL queries and filter data by criteria (=, <>, >=, <=, AND, OR).
  • Flexible. Import only certain range of records from the given file, import fields (columns) by indexes or names, read records in sequential mode.
  • Dynamic Typing support. Turn CSV data field to a desired VBA data type.
  • Data sorting. Sort CSV imported data using the hyper-fast Yaroslavskiy Dual-Pivot Quicksort like Java.
  • Microsoft Access compatible. The library has a version for those who feel in comfort working through DAO databases, download from here.

RFC-4180 Specs

Currently, there is no defined standard for CSV files, there are a multitude of implementations and variants. Notwithstanding the foregoing, there are specifications such as RFC-4180 that define the basic structure of these files, while variants of this standard are defined in the specifications of the Library Of Congress.

According to the specifications, a CSV can contain:

  • A field (column) containing the fields delimiter character, the record delimiter character (multiline fields), or the escape character (usually double quotation marks). This type of field must be encapsulated ("escaped") using the escape character. Literal escape characters in escaped fields must also be escaped by duplicating each escape character or preceding each escape character with a backslash (Unix style).
  • A commented or empty record (row/line).
  • The tab character (\t) or semicolon (;) as a field delimiter and one of the characters CRLF (\r\n), CR (\r), and LF (\n) as a record delimiter. In exceptional cases, we can find CSV files with a mixture of record delimiters.

The library supports the apostrophe (') as an escape character and, with the limitation of the specifications, a field containing the slang "isn't" should be stored in the CSV file as "isn''t", which is confusing. In these cases, users may choose to use the Unix escape mechanism and store the slang as "isn\'t", which is a more human-readable option.

An example of a CSV with a structure that conforms to the RFC-4180 specifications would be as follows:

"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4

The above example should produce the fields and records shown in the following table (headers added for demonstration purposes):

Column 1 Column 2 Column 3 Column 4
rec1, fld1  

rec1","fld3.1

",

fld3.2

rec1

fld4

rec2, fld1.1

 

fld1.2

rec2 fld2.1"fld2.2"fld2.3   rec2 fld4

In the table, field #3 of record #1 is multiline, the same is true for field #4 of the aforementioned record, and also for field #1 of record #2. According to RFC-4180 specifications, literal escape characters will be escaped by duplicating each of them, so field #3 of record #1 and field #2 of record #2 have escape characters that must be unescaped by the parser. On the other hand, many fields contain the field separator itself, which must be handled by the parser.

So this example is a perfect scenario for testing any CSV parser, and it is already included as a test case in the VBA CSV interface.

Solution Core

Many authors recommend avoiding loading chunks of files into a buffer in order to reduce the memory footprint, and the reason is that this alternative can be extremely complex. Nevertheless, buffering is the route that has been taken to achieve a parser capable of working with CSV files efficiently.

Using the Code

This section will attempt to analyze all the capabilities of the CSV interface.

Import whole CSV file:

VBScript
Sub CSVimport()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"        ' Full path to the file, including its extension.
        .fieldsDelimiter = ","         ' Columns delimiter
        .recordsDelimiter = vbCrLf     ' Rows delimiter
    End With
    With csvinf
        .ImportFromCSV .parseConfig    ' Import the CSV to internal object
    End With
End Sub

Now suppose from the file "Sample.csv" the user only requires to import a specific range of records. It is possible to write a code like the one shown below:

VBScript
Sub CSVimportRecordsRange()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"        ' Full path to the file, including its extension.
        .fieldsDelimiter = ","         ' Columns delimiter
        .recordsDelimiter = vbCrLf     ' Rows delimiter
        .startingRecord = 10           ' Start import on the tenth record
        .endingRecord = 20             ' End of importation in the 20th record
    End With
    With csvinf
        .ImportFromCSV .parseConfig    ' Import the CSV to internal object
    End With
End Sub

If the user wants to sort the imported data, a code like the following can be written:

VBScript
Sub CSVimportAndSort()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"               ' Full path to the file, including its extension.
        .fieldsDelimiter = ","                ' Columns delimiter
        .recordsDelimiter = vbCrLf            ' Rows delimiter
    End With
    With CSVint
        .ImportFromCSV .parseConfig           ' Import the CSV to internal object
        .Sort SortColumn:=1, Descending:=True ' Sort imported data on first column
    End With
End Sub

CSV data are mainly treated as text strings, what if the user wants to do some calculations on the data obtained from a given file? In this situation, the user can change the behavior of the parser to work in dynamic typing mode. Here's an example:

VBScript
Sub CSVimportAndTypeData()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"               ' Full path to the file, including its extension.
        .fieldsDelimiter = ","                ' Columns delimiter
        .recordsDelimiter = vbCrLf            ' Rows delimiter
        .dynamicTyping = True                 ' Enable dynamic typing mode
        '@---------------------------------------------------------
        ' Configure dynamic typing
        .DefineTypingTemplate TypeConversion.ToDate, _
                                TypeConversion.ToLong, _
                                TypeConversion.ToDouble
        .DefineTypingTemplateLinks 6, _
                                    7, _
                                    10
        ' The dynamic typing mode will perform the following:
        '      * Over column 6 ---> String To Date data Type conversion
        '      * Over column 7 ---> String To Long data Type conversion
        '      * Over column 10 ---> String To Double data Type conversion
    End With
    With CSVint
        .ImportFromCSV .parseConfig             ' Import the CSV to internal object
    End With
End Sub

The escape character can be defined as one of them, according to an enumeration:

VBScript
Sub SetEscapeChar()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .escapeToken = EscapeTokens.DoubleQuotes  ' 2 = ["] (Default)
        '.escapeToken = EscapeTokens.Apostrophe   ' 1 = [']
        '.escapeToken = EscapeTokens.Tilde        ' 3 = [~]
    End With
End Sub

Once the data is imported and saved to the internal object, the user can access it in the same way as a standard VBA array. An example would be:

VBScript
Sub LoopData(ByRef CSVint As CSVinterface)
    With CSVint
        Dim iCounter As Long
        Dim cRecord() As Variant              ' Records are stored as a one-dimensional array.
        Dim cField As Variant
        
        For iCounter = 0 To CSVint.count - 1
            cRecord() = .item(iCounter)       ' Retrieves a record
            cField = .item(iCounter, 2)       ' Retrieves the 2nd field of the current record
        Next
    End With
End Sub

However, it is sometimes disadvantageous to store data in containers other than VBA arrays. This becomes especially noticeable when it is required to write the information stored in Excel's own objects, such as spreadsheets, or VBA user forms, the case of list boxes, which allow to be filled in a single instruction using arrays. Then, the user can copy the information from the internal object using code like this:

VBScript
Sub DumpData(ByRef CSVint As CSVinterface)
    Dim oArray() As Variant
    With CSVint
        .DumpToArray oArray            ' Dump the internal data into a two-dimensional array
        .DumpToJaggedArray oArray      ' Dump the internal data into a jagged array
        .DumpToSheet                   ' Dump the internal data into a new sheet 
                                       ' using ThisWorkbook
        '@-------------------------------------------------------------------
        ' *NOTE: ONLY AVAILABLE FOR THE ACCESS VERSION OF THE CSV INTERFACE
        ' Dump the internal data into the Table1 in oAccessDB database.
        ' The method would create indexes in the 2nd and 3th fields.
        .DumpToAccessTable oAccessDB, _
                           "Table1", _
                            2, 3
    End With
End Sub

So far, in the examples addressed, the user has been allowed to choose between two actions:

  1. Import ALL records contained in a CSV file.
  2. Import a recordset, starting at record X and ending at record Y.

In both options, the user is obliged to import all fields (columns) present in the file. Most CSV file parsers only offer the first option, but what if the user wants to save only the information that is relevant to them? and what happens is intended to store in memory only the registers that meet a certain set of requirements?

An user may need to import 2 of 12 columns from a CSV file, in this case, the user can use something like:

VBScript
Sub CSVimportDesiredColumns()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"              ' Full path to the file, including its extension.
        .fieldsDelimiter = ","               ' Columns delimiter
        .recordsDelimiter = vbCrLf           ' Rows delimiter
    End With
    With CSVint
        .ImportFromCSV .parseConfig, _
                        1, "Revenue"         ' Import 1st and "Revenue" fields ONLY
    End With
End Sub

So, OK, let's imagine now that an user wants to apply some logic before saving the data, in which case they can step through the records in the CSV file one by one, using the sequential reader, as shown in the following example:

VBScript
Sub CSVsequentialImport()
    Dim CSVint As CSVinterface
    Dim csvRecord As ECPArrayList
    
    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"             ' Full path to the file, including its extension.
        .fieldsDelimiter = ","              ' Columns delimiter
        .recordsDelimiter = vbCrLf          ' Rows delimiter
    End With
    With CSVint
        .OpenSeqReader .parseConfig, _
                        1, "Revenue"        ' Import the 1st and "Revenue" fields using 
                                            ' seq. reader
        Do
            Set csvRecord = .GetRecord
            '//////////////////////////////////////////////
            'Implement your logic here
            '//////////////////////////////////////////////
        Loop While Not csvRecord Is Nothing   ' Loop until the end of the file is reached
    End With
End Sub

Is there a way to sequentially fetch a set of records at a time instead of a single record? Currently, there is no built-in method to do that with a single instruction, as in the examples above, but with a few extra lines of code and the tools provided by the library, it is possible to achieve that goal. This is illustrated in the following example where the CSV file is streamed:

VBScript
Sub CSVimportChunks()
    Dim CSVint As CSVinterface
    Dim StreamReader As ECPTextStream
            
    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .fieldsDelimiter = ","                              ' Columns delimiter
        .recordsDelimiter = vbCrLf                          ' Rows delimiter
    End With
    Set StreamReader = New ECPTextStream
    With StreamReader
        .endStreamOnLineBreak = True                        ' Instruct to find line breaks
        .OpenStream "C:\Sample.csv"                         ' Connect to CSV file
        Do
            .ReadText                                       ' Read a CSV chunk
            CSVint.ImportFromCSVString .bufferString, _
                                    CSVint.parseConfig, _
                                    1, "Revenue"            ' Import a set of records
            '//////////////////////////////////////
            'Implement your logic here
            '//////////////////////////////////////
        Loop While Not .atEndOfStream                       ' Continue until reach 
                                                            ' the end of the CSV file.
    End With
    Set CSVint = Nothing
    Set StreamReader = Nothing
End Sub

So far, it has been outlined the way in which you can import the records from a CSV file sequentially, the following example shows how to filter the records, in a like SQL way, according to whether they meet a criterion set by the user:

VBScript
Sub QueryCSV(path As String, ByVal keyIndex As Long, queryFilters As Variant)
    Dim CSVint As CSVinterface
    Dim CSVrecords As ECPArrayList
    
    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"
        .fieldsDelimiter = ","                              ' Columns delimiter
        .recordsDelimiter = vbCrLf                          ' Rows delimiter
    End With
    If path <> vbNullString Then
        '@-----------------------------------------------
        ' The following instruction will filter the data
        ' on the keyIndex(th) field.
        Set CSVrecords = CSVint.GetCSVsubset(path, _
                                            queryFilters, _
                                            keyIndex)
        CSVint.DumpToSheet DataSource:=CSVrecords           ' Dump result to new WorkSheet
        Set CSVint = Nothing
        Set CSVrecords = Nothing
    End If
End Sub

In some situations, we may encounter a CSV file with a combination of vbCrLf, vbCr and vbLf as record delimiters. This can happen for many reasons, but the most common is by adding data to an existing CSV file without checking the configuration of the previously stored information. These cases will break the logic of many robust CSV parsers, including the demo of the 737K weekly downloaded Papa Parse. The next example shows how an user can import CSV files with mixed line break as record delimiter, an option that uses the turnStreamRecDelimiterToLF property of the parseConfig object to work with these special CSV files.

VBScript
Sub ImportMixedLineEndCSV()
    Dim CSVint As CSVinterface
            
    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Mixed Line Breaks.csv"
        .fieldsDelimiter = ","                ' Columns delimiter 
        .recordsDelimiter = vbCrLf            ' Rows delimiter
        .turnStreamRecDelimiterToLF = True    ' All delimiters will be turned into vbLf
    End With
    With CSVint
        .ImportFromCSV .parseConfig
    End With
    Set CSVint = Nothing
End Sub

In all the above examples, an implicit assumption has been made, and that is that the user knows the configuration of the CSV file to be imported, so the question arises: can it be possible that the user does not know the configuration of the file to be imported? It is certainly possible, so how can the CSV interface help in these cases?

The tool includes a utility to guess field delimiters, record delimiters and escape character. This can be done with code like the following:

VBScript
Sub DelimitersGuessing()
    Dim CSVint As CSVinterface

    Set CSVint = New CSVinterface
    With CSVint.parseConfig
        .path = "C:\Sample.csv"           ' Full path to the file, including its extension.
    End With
    With CSVint
        .GuessDelimiters .parseConfig     ' Try to guess delimiters and save to internal
                                          ' parser configuration object.
        '@--------------------------------------------------------------
        ' *NOTE: the user can also create a custom configuration object
        '        and try to guess the delimiter with it.
    End With
End Sub

Points of Interest

Guessing the delimiters, or determining the dialect of the CSV file, may be seen as a simple problem, but it is definitely not. Even at the time of writing, there is open research on the subject. For example, G.J.J. van den Burg's research ended with the implementation of a robust system called CleverCSV which allows the dialect of CSV files to be determined. The solution proposed by the researcher determines the dialect of a CSV by evaluating the patterns of the rows, in order to check their uniformity, and the number of fields that can be converted to the data types established in the system. Then, your Achilles heel comes when you try to define the dialect in a file that makes use of unseated data, as exposed in its issue #37 and in the issue #35.

# This CSV has caused CleaverCSV issue 37
# The parser guess delimiter [:] instead of [,]
"{""fake"": ""json"", ""fake2"":""json2""}",13:31:38,06:00:04+01:00
"{""fake"": ""json"", ""fake2"":""json2""}",22:13:29,14:20:11+02:00
"{""fake"": ""json"", ""fake2"":""json2""}",04:37:27,22:04:28+03:00
"{""fake"": ""json"", ""fake2"":""json2""}",04:25:28,23:12:53+01:00
"{""fake"": ""json"", ""fake2"":""json2""}",21:04:15,08:23:58+02:00
"{""fake"": ""json"", ""fake2"":""json2""}",10:37:03,11:06:42+05:30
"{""fake"": ""json"", ""fake2"":""json2""}",10:17:24,23:38:47+06:00
"{""fake"": ""json"", ""fake2"":""json2""}",00:02:51,20:04:45-06:00

For its part, Papa Parse is unable to disambiguate the field delimiter when faced with CSV files in which the comma (',') is used as a decimal separator and the semicolon (';') as a field’s delimiter. Here are some CSVs that prevent Matt Holt's powerful piece of code from properly determining delimiters:

Prüfung1;Prüfung2;Prüfung3
1,5;33,33;15,55
2,5;25,44;30,1
3,5;16,67;45,2
4,5;12;60,3
'Neroductions Group';£ 1,80;£ 9000,50
'Hatchworks Ltd.';£ 2,00;£ 100000,30
id;value
1;3,4,5
2;6,7,8
3;9,10,11
4;13,14,15
5;"15,16,17;also;that"

Does this mean that we can't determine the dialect or guess the delimiters of a CSV with complete certainty? The answer is that there is no foolproof solution to disambiguate tables generated by random configuration of delimiters and escape characters. Anyway, the method of statistical scoring of fields, records and tables implemented in CSV interface works perfectly with the files that caused problems for CleverCSV and Papa Parse.

In the case of CSV interface, the problem of disambiguation becomes insoluble when two dialects produce tables with the same congruence, a situation that can occur in CSV files without headers such as the one shown below:

1|2;3|4;5
3;a|c;6|6

The CSV interface scoring method will produce exactly the same result, both for the dialect that uses the vertical bar as a field delimiter and for the dialect that uses the semicolon for those purposes, there is no machine or human disambiguation possible for those particular cases.

History

  • 11th July, 2021: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

W. García
Dominican Republic Dominican Republic
No Biography provided

Comments and Discussions

 
QuestionPregunta Pin
Pablo H 202112-Jul-21 18:11
MemberPablo H 202112-Jul-21 18:11 
AnswerRe: Pregunta Pin
W. García13-Jul-21 1:47
MemberW. García13-Jul-21 1:47 
GeneralMy vote of 5 Pin
JimCoffell12-Jul-21 6:55
professionalJimCoffell12-Jul-21 6:55 
PraiseRe: My vote of 5 Pin
W. García12-Jul-21 7:54
MemberW. García12-Jul-21 7:54 
I hope it is useful. If I can complete the V4.0.0 version, do not hesitate to share it with this great community!

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.