Cinchoo ETL - Split a Large JSON File Based on Deeply Nested Array Property

Cinchoo

5.00/5 (2 votes)

Jan 12, 2022

CPOL

3 min read

6963

Tip to split large JSON file based on deeply nested array property using Cinchoo ETL

1. Introduction

ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.

This article talks about splitting large JSON file based on deeply nested array property using Cinchoo ETL framework. This method helps to parse the large JSON file without going through out of memory exception by splitting them into multiple smaller files.

2. Requirement

This framework library is written in C# using .NET 4.5 / .NET Core 3.x Framework.

3. How to Use

3.1 Sample Data

Let's begin by looking into the sample JSON file below. Assuming the JSON file is large in size, in here, InstrumentData nodes are repeated and contains thousands of elements in it. The objective of this exercise is to produce split files containing all the root nodes along with one InstrumentData in each in them.

Listing 3.1.1. Input JSON file (sample.json)

{
  "Job": {
    "Keys": {
      "JobID": "test123",
      "DeviceID": "TEST01"
    },
    "Props": {
      "FileType": "Measurements",
      "InstrumentDescriptions": [
        {
          "InstrumentID": "1723007",
          "InstrumentType": "Actual1",
          "Name": "U",
          "DataType": "Double",
          "Units": "degC"
        },
        {
          "InstrumentID": "2424009",
          "InstrumentType": "Actual2",
          "Name": "VG03",
          "DataType": "Double",
          "Units": "Pa"
        }
      ]
    },
    "Steps": [
      {
        "Keys": {
          "StepID": "START",
          "StepResult": "NormalEnd"
        },
        "InstrumentData": [
          {
            "Keys": {
              "InstrumentID": "1723007"
            },
            "Measurements": [
              {
                "DateTime": "2021-11-16 21:18:37.000",
                "Value": 540
              },
              {
                "DateTime": "2021-11-16 21:18:37.100",
                "Value": 539
              },
              {
                "DateTime": "2021-11-16 21:18:37.200",
                "Value": 540
              },
              {
                "DateTime": "2021-11-16 21:18:37.300",
                "Value": 540
              },
            ]
          },
          {
            "Keys": {
              "InstrumentID": "2424009"
            },
            "Measurements": [
              {
                "DateTime": "2021-11-16 21:18:37.000",
                "Value": 1333.22
              },
              {
                "DateTime": "2021-11-16 21:18:37.100",
                "Value": 1333.22
              },
            ]
          }
        ]
      }
    ]
  }
}

Step 1: In the above, Job.Keys and Job.Props is common factor for all elements, needs to be included all new split files.

Step 2: Since the input file comes with Job.Steps[*].InstrumentData[*] nodes, which requires 2 levels of parsing to split the files by InstrumentData level.

Step 3: Break the file by each Steps[*] node (aka. Steps_0.json, Steps_1.json, etc.)

Step 4. Then take each StepsFiles and break them by each InstrumentData[*] node level. (aka. InstrumentData_0.json, InstrumentData_1.json, etc.)

Final expected split files should look as below.

Listing 3.1.2. Output split JSON file (InstrumentData_0.json)

{
  "Job": {
    "Keys": {
      "JobID": "test123",
      "DeviceID": "TEST01"
    },
    "Props": {
      "FileType": "Measurements",
      "InstrumentDescriptions": [
        {
          "InstrumentID": "1723007",
          "InstrumentType": "Actual1",
          "Name": "U",
          "DataType": "Double",
          "Units": "degC"
        },
        {
          "InstrumentID": "2424009",
          "InstrumentType": "Actual2",
          "Name": "VG03",
          "DataType": "Double",
          "Units": "Pa"
        }
      ]
    },
    "Steps": {
      "Keys": {
        "InstrumentID": "1723007"
      },
      "Measurements": [
        {
          "DateTime": "2021-11-16 21:18:37.000",
          "Value": 540
        },
        {
          "DateTime": "2021-11-16 21:18:37.100",
          "Value": 539
        },
        {
          "DateTime": "2021-11-16 21:18:37.200",
          "Value": 540
        },
      ]
    }
  }
}

The first thing to do is to install ChoETL.JSON/ChoETL.JSON.NETStandard nuget package. To do this, run the following command in the Package Manager Console.

.NET Framework

Install-Package ChoETL.JSON

.NET Core

Install-Package ChoETL.JSON.NETStandard

Now add ChoETL namespace to the program.

using ChoETL;

3.2 Split Operation

As JSON file comes in large in size, we need to consider deserialize InstrumentData nodes in stream model rather than loading entire file in memory to avoid memory pressure.

First, split the file by Steps nodes as below:

Capture the value of Job.Keys node.
Capture the value of Job.Props node.
Then loop through each Job.Steps node, generate output split files (aka Steps_0.json, Steps_1.json, etc.) using ChoJObjectWriter (utility class to write json values in stream manner)
Finally, the method returns list of steps split filenames for consuming to split by InstrumentData node.

Listing 3.2.1. Split by Steps nodes

static string[] SplitBySteps(string inputFilePath)
{
    List<string> stepsFiles = new List<string>();
    
    dynamic keys = null;
    dynamic props = null;
    
    //Capture Keys
    using (var r = new ChoJSONReader(inputFilePath).WithJSONPath("$..Job.Keys"))
    {
        keys = r.FirstOrDefault();
    }
    
    //Capture props
    using (var r = new ChoJSONReader(inputFilePath).WithJSONPath("$..Job.Props"))
    {
        props = r.FirstOrDefault();
    }
    
    int fileCount = 0;
    //Loop thro Steps, write to individual files
    using (var r = ChoJSONReader.LoadText(json).WithJSONPath("$..Job.Steps")
           .NotifyAfter(1)
           .Setup(s => s.RowsLoaded += (o, e) => 
            $"Step Nodes loaded: {e.RowsLoaded} <- {DateTime.Now}".Print())

           //Callback used to hook up to loader, stream the nodes to file 
           //(this avoids loading to memory)
           .Configure(c => c.CustomJObjectLoader = (sr, s) =>
                      {
                          string outFilePath = $"Steps_{fileCount++}.json";
                          $"Writing to `{outFilePath}` file...".Print();

                          using (var topJo = new ChoJObjectWriter(outFilePath))
                          {
                              topJo.Formatting = Newtonsoft.Json.Formatting.Indented;
                              using (var jo = new ChoJObjectWriter("Job", topJo))
                              {
                                  jo.WriteProperty("Keys", keys);
                                  jo.WriteProperty("Props", props);
                                  jo.WriteProperty("Steps", sr);
                              }
                          }

                          //File.ReadAllText(outFilePath).Print();
                          //"".Print();

                          stepsFiles.Add(outFilePath);
                          
                          return ChoJSONObjects.EmptyJObject;
                      })
          )
    {
        r.Loop();
    }
    
    return stepsFiles.ToArray();
}

Next step is to consume the above generated steps split files one at a time, split them by InstrumentData node. Code below shows how.

Next, split the file by InstrumentData nodes as below:

Capture the value of Job.Keys node.
Capture the value of Job.Props node.
Capture the value of Job.Steps.Keys node.
Then loop through each Job.Steps.InstrumentData node, generate output split files (aka InstrumentData_0.json, InstrumentData_1.json, etc.) using ChoJObjectWriter.
Finally, the method returns list of InstrumentData split filenames.

Listing 3.2.2. Split by InstrumentData nodes

static string[] SplitByInstrumentData(string stepsFilePath)
{
    List<string> instrumentDataFiles = new List<string>();
    
    dynamic keys = null;
    dynamic props = null;
    dynamic stepsKeys = null;
    
    //Capture Keys
    using (var r = new ChoJSONReader(stepsFilePath).WithJSONPath("$..Job.Keys"))
    {
        keys = r.FirstOrDefault();
    }
    
    //Capture props
    using (var r = new ChoJSONReader(stepsFilePath).WithJSONPath("$..Job.Props"))
    {
        props = r.FirstOrDefault();
    }
            
    //Capture steps/keys
    using (var r = new ChoJSONReader(stepsFilePath).WithJSONPath("$..Job.Steps.Keys"))
    {
        stepsKeys = r.FirstOrDefault();
    }

    int fileCount = 0;
    //Loop thro InstrumentData, write to individual files
    using (var r = ChoJSONReader.LoadText(json).WithJSONPath("$..Job.Steps.InstrumentData")
           .NotifyAfter(1)
           .Setup(s => s.RowsLoaded += (o, e) => $"InstrumentData Nodes loaded: 
                                       {e.RowsLoaded} <- {DateTime.Now}".Print())

           //Callback used to hook up to loader, stream the nodes to file 
           //(this avoids loading to memory)
           .Configure(c => c.CustomJObjectLoader = (sr, s) =>
                      {
                          string outFilePath = $"InstrumentData_{fileCount++}.json";
                          $"Writing to `{outFilePath}` file...".Print();

                          using (var topJo = new ChoJObjectWriter(outFilePath))
                          {
                              topJo.Formatting = Newtonsoft.Json.Formatting.Indented;
                              using (var jo = new ChoJObjectWriter("Job", topJo))
                              {
                                  jo.WriteProperty("Keys", keys);
                                  jo.WriteProperty("Props", props);
                                  jo.WriteProperty("Steps", sr);
                              }
                          }

                          File.ReadAllText(outFilePath).Print();
                          "".Print();

                          instrumentDataFiles.Add(outFilePath);
                          
                          return ChoJSONObjects.EmptyJObject;
                      })
          )
    {
        r.Loop();
    }
    
    return instrumentDataFiles.ToArray();
}

Finally, use the above two methods to accomplish the split process by InstrumentData as below.

Listing 3.2.3. Main() method

public static void Main()
{
    var inputFilePath = "input.json";
    
    foreach (var stepsFilePath in SplitBySteps(inputFilePath))
    {
        SplitByInstrumentData(stepsFilePath);
        File.Delete(stepsFilePath);
    }
}

Sample fiddle: https://dotnetfiddle.net/j3Y03m

For more information about Cinchoo ETL, please visit the other CodeProject articles:

Cinchoo ETL - Articles

History

12^th January, 2022: Initial version