Demo Data

Kornfeld Eliyahu Peter

5.00/5 (11 votes)

Oct 30, 2017

CPOL

12 min read

10832

142

“Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!” - LEWIS CARROLL

Download source code - 3 MB

Introduction

When you are about to present an application/feature to a potential customer or future user, the data you are using is very important. Missing details, too few records, repetitive info or gibberish can create bad impression. Wrong display can ruin all your effort you put into your code, after all what people see is the surface...
In this article you will see a tool, that I used to randomize usable data...

Background

Most of my work is connected to data. Some customers have them already or have most concrete specifics on how to create them (and they mostly do it), both make it easy to prepare a demo. However, the majority has only dreams, but nothing specific enough to start with, or keep their real data confident, so you have to improvise. In all those cases it is me who have to define the necessary data scheme and supply some demo values, all this to ignite the imagination of the customer and show how capable I'm...

Over years I used different methods to create such demo data, and in time I created my own tool to do so. In this article I will show you how the core built, how to use and extend it.
The final result is actually a professional level tool, you can use to create any demo data for any language you have to deal with...

Code

The complete code can be found in the attachment only. The code segments in the articles are there to show the idea and not always work on their own...

The Core

The Difference Between Random And Random

First thing you have to understand is the different types of data you have to be able to randomize. And I do not mean all those data types you know from SQL and others, like integers and strings. My point of view is about how humans define a piece of data, how rules applied and how strict those rules are.

For instance, think about numbers. Numbers are well defined by strict rules a computer can re-create easily. All the options of scientific notation, thousand separators or monetary signs are relevant only when presenting those numbers and have no effect on how the data stored, which is the only relevant factor when talking about randomizing it.

On the other end think about names. Names for places or for persons. These things can not be randomly generated by the same approach as numbers. Until computers will reach the level of creativity needed to fabricate a name acceptable by humans, the only way to create them is picking from a list. The longer the list the richer the result.

And in between... Addresses are a mixture of the two. It has fairly strict rules about the order of the different parts, but those parts have different rules of randomization. But if you check every part either has strict rules, like numbers or has to be picked, like names.

Summarizing all these you can see, that demo data can be one of these:

True random
List based
Complex, a mixture of the two from above with some literals added

Note: To be sure, computers alone can not create real-true random, like we do when throwing a dice or spin the roulette wheel, The .NET framework uses a pseudo-random number generator as explained in the documentation:

Pseudo-random numbers are chosen with equal probability from a finite set of numbers. The chosen numbers are not completely random because a mathematical algorithm is used to select them, but they are sufficiently random for practical purposes. The current implementation of the Random class is based on a modified version of Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA, third edition, 1997.

True Random

This is the most simple part. All you need is a random number generator, and .NET has it (same with most other modern/common languages). And using that random number generator we can create methods for all the data types with fixed rules. Let see some sample...

Any positive number of the specified length

public static string Number ( int MinLength = 0, int MaxLength = 0 )
{
    if ( MaxLength == 0 )
    {
        MaxLength = MinLength;
        MinLength = 0;
    }

    return (
        Convert.ToString(
            _Random.Next(
                Convert.ToInt32( "1".PadRight( MinLength, '0' ) ),
                Convert.ToInt32( "1".PadRight( MaxLength + 1, '0' ) )
            )
        )
    );
}

An alphanumeric value of the specified length

public static string Alpha ( int MinLength = 0, int MaxLength = 0 )
{
    if ( MaxLength == 0 )
    {
        MaxLength = MinLength;
        MinLength = 1;
    }

    int nLength = _Random.Next( MinLength, MaxLength + 1 );

    string szChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    string szResult = new string(
        Enumerable.Repeat( szChars, nLength )
                  .Select( szArray => szArray[_Random.Next( szArray.Length )] )
                  .ToArray( ) );

    return ( szResult );
}

The core library of my solution contains methods for these data types:

Numbers (signed/unsigned/in-range)
IP Address (v4/v6)
Latitude/Longitude
Date/Time/DateTime
GUID (as string/as hex)
Alphanumeric (length restricted)
Sequential ID

I picked them over time on a as-needed basis and may not represent all your need, however when we get to the extension/usage part you will understand that they cover 99.99999% of your needs anyway...

List Based Random

For simplicity I gave the name 'resources' for all the possible lists the core can handle. If you want a random string from one of your lists all you have to do is call a single function, like this:

Data.Resource("first");

Each call will return a single string from the list referenced by the name in the parameter- the 'resource' name.

public static string Resource ( string Name )
{
    if ( LoadResource( Name ) )
    {
        return ( _Resources[_CultureInfo.Name][Name][_Random.Next( _Resources[_CultureInfo.Name][Name].Length )] );
    }

    return ( string.Format( _MissingResource, Name, _CultureInfo.Name ) );
}

Looking at the code, you probably curious about the structure I use there and the meaning of all that. Be patient I will address that subject shortly in the part about the storage structure...

Complex Random

As I told before these are the most interesting randoms. A mixture of the two from above with possible literals added. To easily define (and later access) these complex randoms I created a JSON based structure (a kind of deflation language wrapped in JSON) that can hold the definition of a new method that can create that complex random.

{
    "inherit": "",
    "local": [
        {
            "name": "phone",
            "func": "<number(3)>-<number(7)>"
        },
        {
            "name": "firstname",
            "func": "[first]"
        },
        {
            "name": "mail",
            "func": "[first]@dummy.com"
        },
        {
            "name": "money",
            "func": "<sign()><number(2,5)>.<number(2)>"
        }
    ]
}

The exact explanation will follow at the code part, but even without that you can see a pattern of name-definition, where the definition has parts that enclosed with different brackets and parts that are not... The enclosed parts are either function calls or resource calls, where the non-enclosed are the literals I mentioned before...

For instance, in the function 'mail', [first] will pick a value from the resource (list) labeled 'name' and stick to it the literal value '@dummy.com'...

Storage Structure

What you have defined inside the angle brackets going to be functions, similarly to those pre-defined I mentioned in the 'True Random' part (about the how you will see in a moment). However the parts inside the square brackets should load a value picked from a list (as shown above); the storage is where all those lists - resources - are stored. To support multilingual demo data, I broke down the structure to folders, identified by culture codes, like en, en-US, fr, fr-CA, hu or he. This follows the ISO naming convention (language code alone or combined with country code). Language codes / Country codes.

In the sample above, you can see that the very same file names repeated. For instance 'first.json' holds a list for first names, once for English, once for French and once for Hebrew.
While naming your lists is entirely up to you, it is a good idea to use the same names in different culture, in which case the same command file can be run on different cultures without changes (and more about those command files later of course). Each of these files hold exactly one JSON array, from that array the code picks a value on request.

[ // content for first.json in English culture
    "Peter",
    "Noam",
    "Paul",
    "Wiliam",
    "Susan",
    "Topol"
]

It is very simple and extensible without code change of any kind - and that's where its power is...

Func.json

Beside all those JSON files representing the data for list based randoms, there is a special one, the 'func.json'... It used to define the culture specific methods to create complex randoms (as in the sample above).

The Code

This is the most interesting part of the whole, and it is so because I will not deal anymore with randoms. All the interesting parts of the randomization and its theory is already behind us and nothing left to say about. What I will deal with is how those definitions in 'func.json' and in the already mentioned command file are actually used...

In both cases (functions and commands) I used a JSON file to create some C# code and compile it - once to disk and once to memory. So most of our code experience here is how to interpret a definition 'language', create some C# code and compile that code to executable (DLL), and eventually execute it - all from within your code...

Compile The Functions

The function file represented by this structure:

internal struct Culture
{
    public struct Function
    {
        public string Name;
        public string Func;
    }

    public string Inherit;
    public Function[ ] Local;
}

And let see the explanation:

Inherit - Can define an other library to extend. The inheritance identified by the culture identifier of the other library. If it is gibberish for now, do not panic, it will be clear after the explanation about the storage structure. This property enables the reuse of common definitions, for instance between variants of Arabic or English.

Local - This part defines the new functions introduced by this culture...

Name - The name for the new method. If the method exist in the inherited library, the new one will override it, so be careful with your choice.

Func - the method body

< ... > - Encloses a method call in the classic method() form, where inside the parentheses you can add parameters in line with the method definition. Method can be any of the pre-created ones or any of the new ones too. Beware of recursion!

[ ... ] - Indicates a resource. Everywhere you put it, it will be replaced with a randomly choose value from the list by the name indicated inside the square brackets.

Everything not inside one of those bracket pairs will be interpreted as literal.

And here the sample to work with:

{
    "inherit": "",
    "local": [
        {
            "name": "name",
            "func": "[first] [last]"
        },
        {
            "name": "address",
            "func": "<number(3)> [street]"
        },
        {
            "name": "phone",
            "func": "<number(3)>-<number(7)>"
        },
        {
            "name": "email",
            "func": "[first]@dummy.com"
        },
        {
            "name": "ballance",
            "func": "<sign()><number(2,5)>.<number(2)>"
        }
    ]
}

And the code just before compilation:

using System.Linq;

namespace DemoData
{
    public class Dataen : Data
    {
        public static dynamic Name ( )
        {
            return ( string.Format( "{0} {1}", Resource( "first" ), Resource( "last" ) ) );
        }
        public static dynamic Address ( )
        {
            return ( string.Format( "{1} {0}", Resource( "street" ), Number( 3 ) ) );
        }
        public static dynamic Phone ( )
        {
            return ( string.Format( "{0}-{1}", Number( 3 ), Number( 7 ) ) );
        }
        public static dynamic Email ( )
        {
            return ( string.Format( "{0}@dummy.com", Resource( "first" ) ) );
        }
        public static dynamic Ballance ( )
        {
            return ( string.Format( "{0}{1}.{2}", Sign( ), Number( 2, 5 ), Number( 2 ) ) );
        }
    }
}

And here the code that generates that code above:

string szFile = string.Format( @"{0}\{1}\func.json", Helpers.CultureRoot, Culture );

Culture oCustom = JsonConvert.DeserializeObject<Culture>( File.ReadAllText( szFile ).ToLower( ) );
StringBuilder oCode = new StringBuilder( );

oCode.AppendLine( "using System.Linq;" );
oCode.AppendLine( "namespace DemoData {" );
oCode.AppendLine( string.Format( "public class Data{0} : Data{1} {{", Culture, oCustom.Inherit ) );

foreach ( Function oCustomFunction in oCustom.Local )
{
    List<string> oFunc = new List<string>( );
    int nIndex = 0;

    string szFinalFormat = oCustomFunction.Func;

    Match oMatch = Helpers.Resource.Match( oCustomFunction.Func );

    while ( oMatch.Success )
    {
        oFunc.Add( string.Format( "Resource({0})", Helpers.Replace( Helpers.SquareToQuoteRegex, Helpers.SquareToQuote, oMatch.Value ) ) );

        szFinalFormat = szFinalFormat.Replace( oMatch.Value, string.Format( "{{{0}}}", nIndex++ ) );

        oMatch = oMatch.NextMatch( );
    }

    oMatch = Helpers.Function.Match( szFinalFormat );

    while ( oMatch.Success )
    {
        oFunc.Add( Helpers.TextInfo.ToTitleCase( string.Format( "{0}", Helpers.Replace( Helpers.AngleToNothingRegex, Helpers.AngleToNothing, oMatch.Value ) ) ) );

        szFinalFormat = szFinalFormat.Replace( oMatch.Value, string.Format( "{{{0}}}", nIndex++ ) );

        oMatch = oMatch.NextMatch( );
    }

    string szResult = string.Format( "return( \"{0}\" );", szFinalFormat );

    if ( nIndex > 0 )
    {
        szResult = string.Format( "return( string.Format( \"{0}\", {1} ) );", szFinalFormat, string.Join( ", ", oFunc.ToArray( ) ) );
    }

    oCode.AppendLine( string.Format( "public static dynamic {0} () {{", Helpers.TextInfo.ToTitleCase( oCustomFunction.Name ) ) );
    oCode.AppendLine( szResult );
    oCode.AppendLine( "}" );
}

oCode.AppendLine( "}" );
oCode.AppendLine( "}" );

As you can see it is fairly simple. My steps are:

Loading the JSON file to the internal structure
Looping over the functions defined there
Identifying the function calls, the resource calls and what left (literals)
Building a single-line method with a String.Format statement to inject random values in-between the literals

So we have some - hopefully error free - C# code generated from the JSON file, all we need is to compile it and store to DLL...

CSharpCodeProvider oCodeProvider = new CSharpCodeProvider( );
CompilerParameters oParameters = new CompilerParameters( );

oParameters.ReferencedAssemblies.Add( "System.Core.dll" );
oParameters.ReferencedAssemblies.Add( string.Format( "Data{0}.dll", oCustom.Inherit ) );

oParameters.GenerateInMemory = false;
oParameters.GenerateExecutable = false;
oParameters.OutputAssembly = string.Format( "Data{0}.dll", Culture );
oParameters.MainClass = string.Format( "Data{0}", Culture );

CompilerResults oResults = oCodeProvider.CompileAssemblyFromSource( oParameters, szCode );

This code will create a 'Dataen.dll' (for English culture) from the code above and store it to the current working folder (the same from where the tool started) to later use... (for more details about compiler options see here: CompilerParameters)

At the lowest level all these DLLs are inherit from the hard-coded randoms mentioned in 'True Random' and 'List Based Random', so when it will come to the actual data rendering, you can use all of them...

Of course can be errors, in which case you will get them dumped to the console together with the code...

Now you have all the standard and special functions to create formatted data according to the desired culture (including language) and you can to move on to create actual table data using them...

Run The Commands

Command files (their names are not fixed, neither their location as they are or can be culture agnostic) are used to actually create data, using the functions (built-in and custom made) available for a culture; and that's the reason I talked about similar naming across different cultures. If you have the same resource names and function names for different cultures, you can create multilingual demo data using the same command file...

The flow of the command process is very similar to that of the compilation of the functions. There is only one main difference - the result is compiled into memory DLL and executed immediately (and discarded)...

Of course the JSON that defines the command structure is different according to its goal... The new JSON is a bit complicated and looks like this:

internal struct CommandList
{
    public struct Column
    {
        public string Name;
        public string Func;
    }

    public struct Relation
    {
        public string Parent;
        public string Child;
    }

    public struct Table
    {
        public string Name;
        public int Rows;
        public Relation[ ] Relations;
        public Table[ ] ChildTables;
        public Column[ ] Columns;
    }

    public bool Compile;
    public Format Output;
    public Table[ ] Tables;
}

And the explanations:

Compile - instructs the application to compile (or recompile) the functions for the specified culture (specified on command line) before compiling and running this command.

Output - the format of the output files - can be JSON or CSV.

Table - holds the definition for a table to be generated, and has these parts:

Name - a unique name for the table, that will be used as file name for the output values.

Rows - the number of rows to generate.

Relations - a list of column-to-column relations used to copy data between parent and child tables to enable foreign keys between result data (for instance if parent table has ID as key it can be declared - and copied - to any child table using this definitions).

Parent - column name from parent table.

Child - column name from child (this) table.

ChildTables - a recursive definition of the children of this table (there is no theoretic maximum to the levels you can create, but surely it will make things slower).

Columns - the list of columns defining this table in a form of name and function.

Name - a valid column name.

Func - a function definition in the same way we used to define in the function-definition file.

Now a sample command file to see how it works:

{
    "compile": false,
    "output": "csv",
    "tables": [
        {
            "name": "person",
            "rows": 1000,
            "childTables": [
                {
                    "name": "email",
                    "rows": 1,
                    "childTables": [
                        {
                            "name": "messages",
                            "rows": 7,
                            "relations": [
                                {
                                    "parent": "person",
                                    "child": "person"
                                },
                                {
                                    "parent": "id",
                                    "child": "email"
                                }
                            ],
                            "columns": [
                                {
                                    "name": "person",
                                    "func": ""
                                },
                                {
                                    "name": "email",
                                    "func": ""
                                },
                                {
                                    "name": "id",
                                    "func": "<sid()>"
                                },
                                {
                                    "name": "content",
                                    "func": "<alpha(4,12)>"
                                }
                            ]
                        }
                    ],
                    "relations": [
                        {
                            "parent": "id",
                            "child": "person"
                        }
                    ],
                    "columns": [
                        {
                            "name": "person",
                            "func": ""
                        },
                        {
                            "name": "id",
                            "func": "<sid()>"
                        },
                        {
                            "name": "email",
                            "func": "<alpha(7,15)>@lazy.com"
                        }
                    ]
                },
                {
                    "name": "action",
                    "rows": 3,
                    "relations": [
                        {
                            "parent": "id",
                            "child": "person"
                        }
                    ],
                    "columns": [
                        {
                            "name": "person",
                            "func": ""
                        },
                        {
                            "name": "id",
                            "func": "<sid()>"
                        },
                        {
                            "name": "value",
                            "func": "<ballance()>"
                        }
                    ]
                }
            ],
            "columns": [
                {
                    "name": "id",
                    "func": "<sid()>"
                },
                {
                    "name": "first_name",
                    "func": "[first]"
                },
                {
                    "name": "last_name",
                    "func": "[last]"
                },
                {
                    "name": "age",
                    "func": "<number(2,2)>"
                }
            ]
        }
    ]
}

The first thing you may recognize is those empty func definitions. Those are there to conform the class in the C# code and represent field that have values from parent table via relations...

This definition file will create code like this:

using System;
using System.Collections;
using System.Collections.Generic;
using Newtonsoft.Json.Linq;

namespace DemoData
{
    public class person
    {
        private static dynamic First;
        private static dynamic Last;

        private static void Next ( )
        {
            First = Datahe.Resource( "first" );
            Last = Datahe.Resource( "last" );
        }

        private static JObject Record ( JObject Parent = null )
        {
            Next( );
            return ( new JObject {
                { "Id", string.Format("{0}", Datahe.Sid())},
                {"First_Name", string.Format("{0}", First)},
                {"Last_Name", string.Format("{0}", Last)},
                {"Age", string.Format("{0}", Datahe.Number(2,2))},
            } );
        }

        public static void LoadData ( Dictionary<string, Stack> Storage, JObject Parent = null )
        {
            Data.PushSID( );
            for ( int i = 0; i < 1000; i++ )
            {
                Storage["person"].Push( Record( Parent ) );
                email.LoadData( Storage, ( JObject )Storage["person"].Peek( ) );
                action.LoadData( Storage, ( JObject )Storage["person"].Peek( ) );
            }
            Data.PopSID( );
        }
    }

    public class email
    {
        private static JObject Record ( JObject Parent = null )
        {
            return ( new JObject {
                { "Person", Parent["Id"].Value<string>()},
                {"Id", string.Format("{0}", Datahe.Sid())},
                {"Email", string.Format("{0}@lazy.com", Datahe.Alpha(7,15))},
            } );
        }

        public static void LoadData ( Dictionary<string, Stack> Storage, JObject Parent = null )
        {
            Data.PushSID( );
            for ( int i = 0; i < 1; i++ )
            {
                Storage["email"].Push( Record( Parent ) );
                messages.LoadData( Storage, ( JObject )Storage["email"].Peek( ) );
            }
            Data.PopSID( );
        }
    }

    public class messages
    {
        private static JObject Record ( JObject Parent = null )
        {
            return ( new JObject {
                { "Person", Parent["Person"].Value<string>()},
                {"Email", Parent["Id"].Value<string>()},
                {"Id", string.Format("{0}", Datahe.Sid())},
                {"Content", string.Format("{0}", Datahe.Alpha(4,12))},
            } );
        }

        public static void LoadData ( Dictionary<string, Stack> Storage, JObject Parent = null )
        {
            Data.PushSID( );
            for ( int i = 0; i < 7; i++ )
            {
                Storage["messages"].Push( Record( Parent ) );
            }
            Data.PopSID( );
        }
    }

    public class action
    {
        private static JObject Record ( JObject Parent = null )
        {
            return ( new JObject {
                { "Person", Parent["Id"].Value<string>()},
                {"Id", string.Format("{0}", Datahe.Sid())},
                {"Value", string.Format("{0}", Datahe.Ballance())},
            } );
        }

        public static void LoadData ( Dictionary<string, Stack> Storage, JObject Parent = null )
        {
            Data.PushSID( );
            for ( int i = 0; i < 3; i++ )
            {
                Storage["action"].Push( Record( Parent ) );
            }
            Data.PopSID( );
        }
    }

    public class Execute
    {
        public static void Run ( )
        {
            Dictionary<string, Stack> oStorage = new Dictionary<string, Stack>( );
            Data.Reset( "he" );
            oStorage.Add( "person", new Stack( ) );
            oStorage.Add( "email", new Stack( ) );
            oStorage.Add( "messages", new Stack( ) );
            oStorage.Add( "action", new Stack( ) );
            person.LoadData( oStorage );
            foreach ( KeyValuePair<string, Stack> oTable in oStorage )
            {
                Export.SetOutput( string.Format( @"C:\Users\peter\Source\Repos\demodata\DemoData\bin\Debug\results\{0}.CSV", oTable.Key ) );
                Export.ToCsv( Array.ConvertAll( oTable.Value.ToArray( ), oItem => ( JObject )oItem ) );
                Export.RestoreOutput( );
            }
        }
    }
}

You can see the repeating pattern of the classes - each represents a single table. The class has two parts, one for creating a single row and one for pushing it to the storage. The storage initialized at the top level, with place for every table included in the definition file and after the data generation stored each to its file.

One of the interesting thing is the usage of the resources - in the person class in our example. The way it was created ensures that using the same resource in the same row more than once (for instance using first name to create email address) will return the same value.

The last part is to execute this code to actually create the data. The code for that goes like this:

// ...

oParameters.GenerateInMemory = true;
oParameters.GenerateExecutable = false;
oParameters.MainClass = "Execute";

// ...

var oType = oResults.CompiledAssembly.GetType( "DemoData.Execute" );

oType.GetMethod( "Run" ).Invoke ( null, null );

The first difference from the previous compilation is that now I create an InMemory DLL, and the second that I execute it immediately.

If everything went well up to this point you will see the expected output files in a folder named 'results' under the folder of the executable. Those files can be imported to almost any DB (CSV) or used as is (JSON)... It is up to you...

The Code You Will Not See Here

There is code for handling the command line options, namely - listing cultures, compiling function definitions for culture and running command files. It is for you discover...

Summary

After learning - and maybe adjusting - this tool, you never will search for demo data online...