Click here to Skip to main content
15,886,362 members
Articles / Programming Languages / Python2.7

Using SQLite, Leverage the create_aggregate(), and SQL's Between Operator to Create a Normal Probability Distribution Histogram, or What is More Commonly Referred to as a Bell Curve

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
8 Dec 2016CPOL3 min read 30.8K   8   10
Using SQLite, leverage the create_aggregate(), and SQL's Between Operator to create a Normal Probability Distribution Histogram, or what is more commonly referred to as a Bell Curve.

One of the hats I wear at work is that of the performance analysis guy. I work at a multiplatform shop, so naturally, I have a vested interest in retaining technologies that can be reliably used on different platforms without too much fussing around, and Python was the logical choice. With minimal care taken, a Python script can be designed to run on any of the supported platforms from Andrioid to Windows. Further, dealing with different Python versions can also be dealt with dynamically at runtime as well.

Python, is SQLite3 ready right out of the box, and it’s desirable to use this combination to form a tool set for statistical processing common in performance analysis.

Unlike SQLite’s lustrous big brothers and sisters in the Enterprise Level Client-Server market, like Oracle, IBM DB2, Microsoft SQL Server, and MySQL, that offer a rich assortment of built in aggregates, SQLite offers just the basic aggregate functions like Average, Max, Min, Count, group_concat, Sum and Total. The more complex Statistical aggregate functions are sadly missing.

However, this is not an oversite on the part of the SQLite folks, but purposely designed into it. SQLite has been designed to have a small footprint, and in many applications, it is common place to be embedded in hardware, or resided in memory.

In my case, I need these statistical functions, especially Standard Deviation, as it’s the basis of a myriad computations.

Fortunately, SQLite allows the importing of custom aggregate functions, to the database at run time by means of the create_aggregate() method in what could be described as a kind of an à la carte fashion.

This is very advantageous as it allows for adding of custom adhoc functionality to the database without the need of the ugly paradigm of pulling, processing and pushing data back and forth, CPU intensive loops and iterations outside the data layer. Data functions are preformed intra-tier vs inter-tier The database can be thought of as a container object that processes data internally, that can be as smart as a whip, or as dumb as a post. The choice is yours based on your needs at runtime.

This script demonstrates the use of SQLite’s create_aggregate() as the base function to produce one of the most common Statistical Analysis tools, namely the distribution histogram, or what is more commonly referred to as a Bell Curve.

HTML
 
Python
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sqlite3
import math
import random
import os
import sys
import traceback
import random

class StdevFunc:
    def __init__(self):
        self.M = 0.0    #Mean
        self.V = 0.0    #Used to Calculate Variance
        self.S = 0.0    #Standard Deviation
        self.k = 1      #Population or Small 

    def step(self, value):
        try:
            if value is None:
                return None

            tM = self.M
            self.M += (value - tM) / self.k
            self.V += (value - tM) * (value - self.M)
            self.k += 1
        except Exception as EXStep:
            pass
            return None    

    def finalize(self):
        try:
            if ((self.k - 1) < 3):
                return None
            
            #Now with our range Calculated, and Multiplied finish the Variance Calculation
            self.V = (self.V / (self.k-2))

            #Standard Deviation is the Square Root of Variance
            self.S = math.sqrt(self.V)

            return self.S
        except Exception as EXFinal:
            pass
            return None 

def Histogram(Population):
    try:
        BinCount = 6 
        More = 0

        #a = 1          #For testing Trapping
        #b = 0          #and Trace Back
        #c = (a / b)    #with Detailed Info
        
        #If you want to store the Database
        #uncDatabase = os.path.join(os.getcwd(),"BellCurve.db3")
        #con = sqlite3.connect(uncDatabase)
        
        #If you want the database in Memory
        con = sqlite3.connect(':memory:')    

        #row_factory allows accessing fields by Row and Col Name
        con.row_factory = sqlite3.Row

        #Add our Non Persistent, Runtime Standard Deviation Function to the Database
        con.create_aggregate("Stdev", 1, StdevFunc)

        #Lets Grab a Cursor
        cur = con.cursor()

        #Lets Initialize some tables, so each run with be clear of previous run
        cur.executescript('drop table 
        if exists MyData;') #executescript requires ; at the end of the string
        cur.execute("create table IF NOT EXISTS MyData('Val' FLOAT)")
        cur.executescript('drop table 
        if exists Bins;')   #executescript requires ; at the end of the string
        cur.execute("create table IF NOT EXISTS Bins('Bin' 
        UNSIGNED INTEGER, 'Val' FLOAT, 'Frequency' UNSIGNED BIG INT)")

        #Lets generate some random data, and insert in to the Database
        for n in range(0,(Population)):
            sql = "insert into MyData(Val) values ({0})".format(random.uniform(-1,1))
            #If Whole Number Integer greater that value of 2, Range Greater that 1.5
            #sql = "insert into MyData(Val) values ({0})".format(random.randint(-1,1))
            cur.execute(sql)
            pass

        #Now let’s calculate some built in Aggregates, that SQLite comes with
        cur.execute("select Avg(Val) from MyData")
        Average = cur.fetchone()[0]
        cur.execute("select Max(Val) from MyData")
        Max = cur.fetchone()[0]
        cur.execute("select Min(Val) from MyData")
        Min = cur.fetchone()[0]
        cur.execute("select Count(Val) from MyData")
        Records = cur.fetchone()[0]

        #Now let’s get Standard Deviation using our function that we added
        cur.execute("select Stdev(Val) from MyData")
        Stdev = cur.fetchone()[0]

        #And Calculate Range
        Range = float(abs(float(Max)-float(Min)))

        if (Stdev == None):
            print("================================   Data Error ===================================")
            print("                 Insufficient Population Size, Or Bad Data.")   
            print("**********************************************************************************")
        elif (abs(Max-Min) == 0):
            print("================================   Data Error ===================================")
            print(" The entire Population Contains Identical values, Distribution Incalculable.")
            print("**********************************************************************************")            
        else:  
            Bin = []        #Holds the Bin Values
            Frequency = []  #Holds the Bin Frequency for each Bin

            #Establish the 1st Bin, which is based on (Standard Deviation * 3) 
             being subtracted from the Mean
            Bin.append(float((Average - ((3 * Stdev)))))
            Frequency.append(0)
            
            #Establish the remaining Bins, which is basically adding 1 Standard Deviation
            #for each interation, -3, -2, -1, 1, 2, 3             
            for b in range(0,(BinCount) + 1):
                Bin.append((float(Bin[(b)]) + Stdev))
                Frequency.append(0)

            for b in range(0,(BinCount / 1) + 1):
                #Lets exploit the Database and have it do the hard work calculating distribution
                #of all the Bins, with SQL's between operator, but making it left inclusive, 
                 right exclusive.
                sqlBinFreq = "select count(*) as Frequency from MyData 
                where val between {0} and {1} and Val < {2}". \
                             format(float((Bin[b])), float(Bin[(b + 1)]), float(Bin[(b + 1)]))

                #If the Database Reports Values that fall between the Current Bin, 
                 Store the Frequency to a Bins Table. 
                for rowBinFreq in cur.execute(sqlBinFreq):
                    Frequency[(b + 1)] = rowBinFreq['Frequency']
                    sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
                                       format(b, float(Bin[b]), Frequency[(b)])
                    cur.execute(sqlBinFreqInsert)

                #Allthough this Demo is not likley produce values that
                #fall outside of Standard Distribution
                #if this demo was to Calculate with real data, we want to know
                #how many non-Standard data points we have. 
                More = (More + Frequency[b])

            More = abs((Records - More))
            
            sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
                                format((BinCount + 1), float(0), More)
            cur.execute(sqlBinFreqInsert)
               
            #Now Report the Analysis
            print("================================ The Population ==================================")
            print("             {0} {1} {2} {3} {4} {5}". \
                  format("Size".rjust(10, ' '), \
                         "Max".rjust(10, ' '), \
                         "Min".rjust(10, ' '), \
                         "Mean".rjust(10, ' '), \
                         "Range".rjust(10, ' '), \
                         "Stdev".rjust(10, ' ')))
            print("Aggregates:  {0:10d} {1:10.4f} {2:10.4f} {3:10.4f} {4:10.4f} {5:10.4f}". \
                  format(Population, Max, Min, Average, Range, Stdev))
            print("================================= The Bell Curve =================================")  

            LabelString = "{0} {1}  {2}  {3}". \
                          format("Bin".ljust(8, ' '), \
                                 "Ranges".rjust(8, ' '), \
                                 "Frequency".rjust(8, ' '), \
                                 "Histogram".rjust(6, ' '))

            print(LabelString)
            print("----------------------------------------------------------------------------------")
            
            #Let's Paint a Histogram
            sqlChart = "select * from Bins order by Bin asc"
            for rowChart in cur.execute(sqlChart):
                if (rowChart['Bin'] == 7):
                    #Bin 7 is not really a bin, but where we place the values that did not fit into the
                    #Normal Distribution. This script was tested against Excel's Bell Curve Example
                    #https://support.microsoft.com/en-us/kb/213930
                    #and produces the same results. Feel free to test it.
                    BinName = "More"
                    ChartString = "{0:<6} {1:<10} {2:10.0f}". \
                                format(BinName, \
                                        "", \
                                        More)
                else:
                    BinName = (rowChart['Bin'] + 1)
                    #Scale the Chart
                    fPercent = ((float(rowChart['Frequency']) / float(Records) * 100))
                    iPrecent = int(math.ceil(fPercent))
                
                    ChartString = "{0:<6} {1:10.4f} {2:10.0f} {3}". \
                                  format(BinName, \
                                         rowChart['Val'], \
                                         rowChart['Frequency'], \
                                         "".rjust(iPrecent, '#'))
                print(ChartString)
                
            print("**********************************************************************************")

            #Commit to Database
            con.commit()

            #Clean Up
            cur.close()
            con.close()

    except Exception as EXBellCurve:
        pass
        TraceInfo = traceback.format_exc()       
        raise Exception(TraceInfo)  

            
print("**********************************************************************************")
print("Using SQLite, leverage the create_aggregate(), and SQL's Between Operator to")
print("calculate distribution, then Demonstrate and Store the results in a Database.\n")
print("This demo creates a Population of values between -1 and 1 generated with a")
print("Pseudo Random Number Generator. The Mean value should be very close to zero,")
print("with a Range value of about 2, contained within a Standard Distribution.")

PythonVersion = sys.version_info[0] 
  
#Population = 2             #To Test Insufficient Population Size 
Population = (16 ** 2)      #Population of 256
#Population = (64 ** 2)     #Population of 4096
#Population = (128 ** 2)    #Population of 16384
#Population = (256 ** 2)    #Population of 65536
#Population = (1024 ** 2)   #Population of 1048576

Histogram(Population)       #Do a Distribution Analysis

if (PythonVersion == 3):
    kb = input("Press Enter to continue...")
else:
    kb = raw_input("Press Enter to continue...")    

A few words about the script.

The script will demonstrate adding a Standard Deviation function to a SQLite3 database. For demonstration purposes, the script will 1st run a Pseudo Random Number Generator to build a population of data to be analyzed.

This script has been tested on Windows, and Linux platforms, as well as Python 2.6, - 3.4.

A few words about the Bell Curve.

Since, distribution analysis is at the core what a Bell Curve speaks to, we will produce a population, of adequate size, with random seeded numbers ranging from -1 to 1. This will assure a Standard Distribution, and where results are easy to interpret.

Basically, if you produce a long series of numbers from -1 to 1, then the expectation would be for a population to produce a mean of close to 0, a range very close to 2. The result in essence simply displays just how random the Python Random Number Generator actually is.

There are no official Bin allocation rules for histograms. This demo utilizes Bin allocation rules based Excel’s Data Analysis Add-on, Histogram feature.
 


Download Script


https://support.microsoft.com/en-us/kb/213930

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionHistogram.7z do not exist Pin
eslipak24-Oct-16 16:22
professionaleslipak24-Oct-16 16:22 
AnswerRe: Histogram.7z do not exist Pin
rerhart5855-Dec-16 1:01
rerhart5855-Dec-16 1:01 
GeneralRe: Histogram.7z do not exist Pin
eslipak5-Dec-16 2:15
professionaleslipak5-Dec-16 2:15 
AnswerRe: Histogram.7z do not exist Pin
rerhart5856-Dec-16 1:54
rerhart5856-Dec-16 1:54 
GeneralRe: Histogram.7z do not exist Pin
eslipak6-Dec-16 12:09
professionaleslipak6-Dec-16 12:09 
Dear Sir. Thanks very much. But i do know how to save a page to disk (since 1996 or so). Also tried with a download manager without luck.
I do believe you can really download both files, but i cant. Windows 8.1, Firefox 50.0.2. Tried also with Opera 41 and Avast SafeZone Browser. I tried to post a screenshot, but i don't know how to do here in CP.
If, instead of downloading i try to open the page, it says (Sorry, only can copy-paste - IIS 8.0 response):

Quote:
HTTP Error 404.3 - Not Found
The page you are requesting cannot be served because of the extension configuration. If the page is a script, add a handler. If the file should be downloaded, add a MIME map.
Most likely causes:
It is possible that a handler mapping is missing. By default, the static file handler processes all content.
The feature you are trying to use may not be installed.
The appropriate MIME map is not enabled for the Web site or application. (Warning: Do not create a MIME map for content that users should not download, such as .ASPX pages or .config files.)
If ASP.NET is not installed.

Things you can try:
In system.webServer/handlers:
    Ensure that the expected handler for the current page is mapped.
    Pay extra attention to preconditions (for example, runtimeVersion, pipelineMode, bitness) and compare them to the settings for your application pool.
    Pay extra attention to typographical errors in the expected handler line.
Please verify that the feature you are trying to use is installed.
Verify that the MIME map is enabled or add the MIME map for the Web site using the command-line tool appcmd.exe.
    To set a MIME type, use the following syntax: %SystemRoot%\windows\system32\inetsrv\appcmd set config /section:staticContent /+[fileExtension='string',mimeType='string']
    The variable fileExtension string is the file name extension and the variable mimeType string is the file type description.
    For example, to add a MIME map for a file which has the extension ".xyz": appcmd set config /section:staticContent /+[fileExtension='.xyz',mimeType='text/plain']
Warning: Ensure that this MIME mapping is needed for your Web server before adding it to the list. Configuration files such as .CONFIG or dynamic scripting pages such as .ASP or .ASPX, should not be downloaded directly and should always be processed through a handler. Other files such as database files or those used to store configuration, like .XML or .MDF, are sometimes used to store configuration information. Determine if clients can download these file types before enabling them.
Install ASP.NET.
Create a tracing rule to track failed requests for this HTTP status code. For more information about creating a tracing rule for failed requests, click here.

Detailed Error Information:
Module StaticFileModule
Notification ExecuteRequestHandler
Handler StaticFile
Error Code 0x80070032
Requested URL http://boberhar.w15.wh-2.com:80/RGE/Hist/Histogram.7z
Physical Path E:\web\boberhar\RGE\Hist\Histogram.7z
Logon Method Anonymous
Logon User Anonymous
More Information:
This error occurs when the file extension of the requested URL is for a MIME type that is not configured on the server. You can add a MIME type for the file extension for files that are not dynamic scripting pages, database, or configuration files. Process those file types using a handler. You should not allows direct downloads of dynamic scripting pages, database or configuration files.

View more information »[^]


If you pay attention to "More Information", it is a server issue. The link to Microsoft also explain this too.

Believe me, dont want to bother you. Just want to help.
Maybe i'm just out of luck

Be well.
GeneralRe: Histogram.7z do not exist Pin
rerhart5857-Dec-16 7:11
rerhart5857-Dec-16 7:11 
GeneralRe: Histogram.7z do not exist Pin
eslipak7-Dec-16 9:54
professionaleslipak7-Dec-16 9:54 
Praise:) Pin
GTillett9718-Oct-16 2:06
GTillett9718-Oct-16 2:06 
QuestionSnippet Pin
Nelek18-Oct-16 1:55
protectorNelek18-Oct-16 1:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.