Click here to Skip to main content
15,881,559 members
Articles / Programming Languages / C#
Article

A User-Friendly C# Descriptive Statistic Class

Rate me:
Please Sign up or sign in to vote.
4.86/5 (31 votes)
28 Jun 2008CPOL3 min read 90.2K   2.9K   63   13
An article on most commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.

Introduction

The 80-20 rules applies: even with the advances of statistics, most of our work requires only univariate descriptive statistics – those involve the calculations of mean, standard deviation, range, skewness, kurtosis, percentile, quartiles, etc. This article describes a simple way to construct a set of classes to implement descriptive statistics in C#. The emphasis is on the ease of use at the users' end.

Requirements

To run the code, you need to have the following:

  • .NET Framework 2.0 and above
  • Microsoft Visual Studio 2005 if you want to open the project files included in the download project
  • Nunit 2.4 if you want to run the unit tests included in the download project

The download included in this article is implemented as a class library. You will need to make a reference to the project to make use of the functionalities.

The download also includes a NUnit test in case you want to make changes to the code and run your own unit test.

The Code

The goal of the code design is to simplify the usage. We envisage that the user will perform the following code to get the desired results. This involves a simple 3-steps process:

  1. Instantiate a Descriptive object
  2. Invoke its .Analyze() method
  3. Retrieve results from its .Result object

Here is a typical user’s code:

C#
double[] x  = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);
desp.Analyze(); // analyze the data
Console.WriteLine("Result is: " + desp.Result.FirstQuartile.ToString());

Two classes are implemented:

  • DescriptiveResult
  • Descriptive

DescriptiveResult is a class from which a result object derives, which holds the analysis results. In our implementation, the .Result member variable is defined as follows:

C#
/// <summary>
/// The result class the holds the analysis results
/// </summary>
public class DescriptiveResult
{
    // sortedData is used to calculate percentiles
    internal double[] sortedData;

    /// <summary>
    /// DescriptiveResult default constructor
    /// </summary>
    public DescriptiveResult() { }

    /// <summary>
    /// Count
    /// </summary>
    public uint Count;
    /// <summary>
    /// Sum
    /// </summary>
    public double Sum;
    /// <summary>
    /// Arithmetic mean
    /// </summary>
    public double Mean;
    /// <summary>
    /// Geometric mean
    /// </summary>
    public double GeometricMean;
    /// <summary>
    /// Harmonic mean
    /// </summary>
    public double HarmonicMean;
    /// <summary>
    /// Minimum value
    /// </summary>
    public double Min;
    /// <summary>
    /// Maximum value
    /// </summary>
    public double Max;
    /// <summary>
    /// The range of the values
    /// </summary>
    public double Range;
    /// <summary>
    /// Sample variance
    /// </summary>
    public double Variance;
    /// <summary>
    /// Sample standard deviation
    /// </summary>
    public double StdDev;
    /// <summary>
    /// Skewness of the data distribution
    /// </summary>
    public double Skewness;
    /// <summary>
    /// Kurtosis of the data distribution
    /// </summary>
    public double Kurtosis;
    /// <summary>
    /// Interquartile range
    /// </summary>
    public double IQR;
    /// <summary>
    /// Median, or second quartile, or at 50 percentile
    /// </summary>
    public double Median;
    /// <summary>
    /// First quartile, at 25 percentile
    /// </summary>
    public double FirstQuartile;
    /// <summary>
    /// Third quartile, at 75 percentile
    /// </summary>
    public double ThirdQuartile;

    /// <summary>
    /// Sum of Error
    /// </summary>
    internal double SumOfError;

    /// <summary>
    /// The sum of the squares of errors
    /// </summary>
    internal double SumOfErrorSquare;

    /// <summary>
    /// Percentile
    /// </summary>
    /// <param name="percent">Pecentile, between 0 to 100</param>
    /// <returns>Percentile<returns>

For simplicity, most member variables are implemented as public variables. The only member function - Percentile - allows the user to pass the argument (in percentage, e.g. 30 for 30%) and receive the percentile result.

The following table lists the available results (assuming that the Descriptive object name you use is desp:

Result Result stored in variable
Number of data points desp.Result.Count
Minimum value desp.Result.Min
Maximum value desp.Result.Max
Range of valuesdesp.Result.Range
Sum of values desp.Result.Sum
Arithmetic mean desp.Result.Mean
Geometric mean desp.Result.GeometricMean
Harmonic mean desp.Result.HarmonicMean
Sample variance desp.Result.Variance
Sample standard deviation desp.Result.StdDev
Skewness of the distribution desp.Result.Skewness
Kurtosis of the distribution desp.Result.Kurtosis
Interquartile range desp.Result.IQR
Median (50% percentile) desp.Result.Median
FirstQuartile: 25% percentile desp.Result.FirstQuartile
ThirdQuartile: 75% percentile desp.Result.ThirdQuartile
Percentiledesp.Result.Percentile()*

* The argument of percentile is values from 0 to 100, which indicates the percentile desired.

Descriptive Class

The Descriptive class does all the analysis, and it is implemented as follows:

C#
/// <summary>
/// Descriptive class
/// </summary>
public class Descriptive
{
    private double[] data;
    private double[] sortedData;

    /// <summary>
    /// Descriptive results
    /// </summary>
    public DescriptiveResult Result = new DescriptiveResult();

    #region Constructors
    /// <summary>
    /// Descriptive analysis default constructor
    /// </summary>
    public Descriptive() { } // default empty constructor

    /// <summary>
    /// Descriptive analysis constructor
    /// </summary>
    /// <param name="dataVariable">Data array</param>
    public Descriptive(double[] dataVariable)
    {
       data = dataVariable;
    }
    #endregion //  Constructors

Note that we need a sortedData class to facilitate percentile and quartile-related statistics. It stores the sorted version of the user data.

The constructor of Descriptive class allows the user to assign the data array during the object instantiation:

C#
double[] x  = {1, 2, 4, 7, 8, 9, 10, 12};
Descriptive desp = new Descriptive(x);

Once the Descriptive object is instantiated, the user only needs to call the .Analyze() method to perform the analysis. Subsequently, the user can retrieve the analysis results from the .Result object in the Descriptive object.

The Analyze() method is implemented as follows:

C#
/// <summary>
/// Run the analysis to obtain descriptive information of the data
/// </summary>
public void Analyze()
{
// initializations
Result.Count = 0;
Result.Min = Result.Max = Result.Range = Result.Mean =
Result.Sum = Result.StdDev = Result.Variance = 0.0d;

double sumOfSquare = 0.0d;
double sumOfESquare = 0.0d; // must initialize

double[] squares = new double[data.Length];
double cumProduct = 1.0d; // to calculate geometric mean
double cumReciprocal = 0.0d; // to calculate harmonic mean

// First iteration
for (int i = 0; i < data.Length; i++)
{
    if (i==0) // first data point
    {
        Result.Min = data[i];
        Result.Max = data[i];
        Result.Mean = data[i];
        Result.Range = 0.0d;
    }
    else
    { // not the first data point
        if (data[i] < Result.Min) Result.Min = data[i];
        if (data[i] > Result.Max) Result.Max = data[i];
    }
    Result.Sum += data[i];
    squares[i] = Math.Pow(data[i], 2); //TODO: may not be necessary
    sumOfSquare += squares[i];

    cumProduct *= data[i];
    cumReciprocal += 1.0d / data[i];
}

Result.Count = (uint)data.Length;
double n = (double)Result.Count; // use a shorter variable in double type
Result.Mean = Result.Sum / n;
Result.GeometricMean = Math.Pow(cumProduct, 1.0 / n);
// see http://mathworld.wolfram.com/HarmonicMean.html
Result.HarmonicMean = 1.0d / (cumReciprocal / n); 
Result.Range = Result.Max - Result.Min;

// second loop, calculate Stdev, sum of errors
//double[] eSquares = new double[data.Length];
double m1 = 0.0d;
double m2 = 0.0d;
double m3 = 0.0d; // for skewness calculation
double m4 = 0.0d; // for kurtosis calculation
// for skewness
for (int i = 0; i < data.Length; i++)
{
    double m = data[i] - Result.Mean;
    double mPow2 = m * m;
    double mPow3 = mPow2 * m;
    double mPow4 = mPow3 * m;

    m1 += Math.Abs(m);

    m2 += mPow2;

    // calculate skewness
    m3 += mPow3; // Math.Pow((data[i] - mean), 3);

    // calculate skewness
    m4 += mPow4; // Math.Pow((data[i] - mean), 4);

}

Result.SumOfError = m1;
Result.SumOfErrorSquare = m2; // Added for Excel function DEVSQ
sumOfESquare = m2;

// var and standard deviation
Result.Variance = sumOfESquare / ((double)Result.Count - 1);
Result.StdDev = Math.Sqrt(Result.Variance);

// using Excel approach
double skewCum = 0.0d; // the cum part of SKEW formula
for (int i = 0; i < data.Length; i++)
{
    skewCum += Math.Pow((data[i] - Result.Mean) / Result.StdDev, 3);
}
Result.Skewness = n / (n - 1) / (n - 2) * skewCum;

// kurtosis: see http://en.wikipedia.org/wiki/Kurtosis (heading: Sample Kurtosis)
double m2_2 = Math.Pow(sumOfESquare, 2);
Result.Kurtosis = ((n + 1) * n * (n - 1)) / ((n - 2) * (n - 3)) *
    (m4 / m2_2) -
    3 * Math.Pow(n - 1, 2) / ((n - 2) * (n - 3)); // second last formula for G2

// calculate quartiles
sortedData = new double[data.Length];
data.CopyTo(sortedData, 0);
Array.Sort(sortedData);

// copy the sorted data to result object so that
// user can calculate percentile easily
Result.sortedData = new double[data.Length];
sortedData.CopyTo(Result.sortedData, 0);

Result.FirstQuartile = percentile(sortedData, 25);
Result.ThirdQuartile = percentile(sortedData, 75);
Result.Median = percentile(sortedData, 50);
Result.IQR = percentile(sortedData, 75) - percentile(sortedData, 25);

} // end of method Analyze

The calculations of descriptive statistics are quite straightforward, except for the percentile function (and the subsequent quartile calculations), is a little tricky. Therefore, I have a separate function to handle it, as follows:

C#
/// <summary>
/// Calculate percentile of a sorted data set
/// </summary>
/// <param name="sortedData">array of double values</param>
/// <param name="p">percentile, value 0-100</param>
/// <returns></returns>
internal static double percentile(double[] sortedData, double p)
{
    // algo derived from Aczel pg 15 bottom
    if (p >= 100.0d) return sortedData[sortedData.Length - 1];

    double position = (double)(sortedData.Length + 1) * p / 100.0;
    double leftNumber = 0.0d, rightNumber = 0.0d;

    double n = p / 100.0d * (sortedData.Length - 1) + 1.0d;

    if (position >= 1)
    {
        leftNumber = sortedData[(int)System.Math.Floor(n) - 1];
        rightNumber = sortedData[(int)System.Math.Floor(n)];
    }
    else
    {
        leftNumber = sortedData[0]; // first data
        rightNumber = sortedData[1]; // first data
    }

    if (leftNumber == rightNumber)
        return leftNumber;
    else
    {
        double part = n - System.Math.Floor(n);
        return leftNumber + part * (rightNumber - leftNumber);
    }
} // end of internal function percentile

The percentile algorithm is derived from Amir Aczel’s book "Complete Business Statistics".

Conclusion

The descriptive statistics program presented here provides a simple way to obtain commonly used descriptive statistics, including standard deviations, skewness, kurtosis, percentiles, quartiles, etc.

History

  • 28th June, 2008: Initial post

About the Author

Jan Low, PhD, is a senior software architect at Foundasoft.com, Malaysia. He is also the author of various text analysis software, statistical libraries, image processing libraries, and security encryption component. He programs primarily in C#, C++ and VB.NET.
Occupation: Senior software architect
Location: Malaysia

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect Foundasoft.com
Malaysia Malaysia
Programmer and software architect.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Alejandro Zerdan Bremme20-Jan-21 1:20
professionalAlejandro Zerdan Bremme20-Jan-21 1:20 
GeneralToo slow for streaming data... Pin
krn_2k30-Jan-09 0:28
krn_2k30-Jan-09 0:28 
GeneralRe: Too slow for streaming data... Pin
krn_2k12-Feb-09 1:43
krn_2k12-Feb-09 1:43 
GeneralI made some revisions so it can handle weights Pin
cartfer8-Nov-08 4:43
cartfer8-Nov-08 4:43 
GeneralThe code doesn't ... ahem ... work Pin
jlundstocholm16-Oct-08 1:37
jlundstocholm16-Oct-08 1:37 
GeneralRe: The code doesn't ... ahem ... work Pin
jlundstocholm16-Oct-08 1:57
jlundstocholm16-Oct-08 1:57 
GeneralRe: The code doesn't ... ahem ... work Pin
Jan Low, PhD16-Oct-08 2:11
Jan Low, PhD16-Oct-08 2:11 
GeneralRe: The code doesn't ... ahem ... work Pin
Jan Low, PhD16-Oct-08 2:07
Jan Low, PhD16-Oct-08 2:07 
GeneralGood clean work Pin
Saar Yahalom28-Jun-08 23:26
Saar Yahalom28-Jun-08 23:26 
GeneralSweet Pin
Pete O'Hanlon28-Jun-08 11:43
mvePete O'Hanlon28-Jun-08 11:43 
GeneralVery interesting Pin
Paul Conrad28-Jun-08 9:45
professionalPaul Conrad28-Jun-08 9:45 
GeneralRe: Very interesting Pin
rcollina29-Jun-08 6:29
rcollina29-Jun-08 6:29 
GeneralRe: Very interesting Pin
pepepaco5-Aug-09 13:36
pepepaco5-Aug-09 13:36 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.