15,502,573 members
Articles / Artificial Intelligence / Neural Networks
Article
Posted 12 May 2019

43K views
68 bookmarked

# The Math behind Neural Networks: Part 1 - The Rosenblatt Perceptron

Rate me:
A try it yourself guide to the basic math behind perceptrons

## Introduction

A lot of articles introduce the perceptron showing the basic mathematical formulas that define it, but without offering much of an explanation on what exactly makes it work.

And surely, it is possible to use the perceptron without really understanding the basic math involved with it, but is it not also fun to see how all this math you learned in school can help you understand the perceptron, and in extension, neural networks?

I also got inspired for this article by a series of articles on Support Vector Machines, explaining the basic mathematical concepts involved, and slowly building up to the more complex mathematics involved. So that is my intention with this article and the accompanying code: show you the math involved in the preceptron. And, if time permits, I will write articles all the way up to convolutional neural networks.

Of course, when explaining the math, the question is: where do you start and when do you stop explaining? There is some math involved that is rather basic, like for example, what is a vector?, what is a cosine?, etc. I will assume some basic knowledge of mathematics like you have some idea of what a vector is, you know the basics of geometry, etc. My assumptions will be arbitrary, so if you think I’m going too fast in some explanations, just leave a comment and I will try to expand on the subject.

So, let us get started.

### The Series

1. The Math behind Neural Networks: Part 1 - The Rosenblatt Perceptron
2. The Math behind Neural Networks: Part 2 - The ADALINE Perceptron
3. The Math behind Neural Networks: Part 3 - Neural Networks
4. The Math behind Neural Networks: Part 4 - Convolutional Neural Networks

### Setting Some Bounds

A perceptron basically takes some input values, called “features” and represented by the values $x_1, x_2, ... x_n$ in the following formula, multiplies them by some factors called “weights”, represented by $w_1, w_2, ... w_n$, takes the sum of these multiplications and depending on the value of this sum, outputs another value $o$:

$o = f(w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n)$

There are a few types of perceptrons, differing in the way the sum results in the output, thus the function $f()$ in the above formula.

In this article, we will build on the Rosenblatt Perceptron. It was one of the first perceptrons, if not the first. During this article, I will simply be using the name “Perceptron” when referring to the Rosenblatt Perceptron.

We will investigate the math involved and discuss its limitations, thereby setting the ground for future articles.

## The Basic Math Formula for the Rosenblatt Perceptron

$f(x) = \begin{cases} 1 & \text{if } w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n > b\\ 0 & \text{otherwise} \end{cases}$

So, what the perceptron basically does is take some linear combination of input values or features, compare it to a threshold value $b$, and return 1 if the threshold is exceeded and zero if not.

The feature vector is a group of characteristics describing the objects we want to classify.

In other words, we classify our objects into two classes: a set of objects with characteristics (and thus a feature vector) resulting in output of 1, and a set of objects with characteristics resulting in an output of 0.

If you search the internet on information about the perceptron, you will find alternative definitions which define the formula as follows:

$f(x) = \begin{cases} +1 & \text{if } w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n > b\\ -1 & \text{otherwise} \end{cases}$

We will see further this does not affect the workings of the perceptron.

Let's dig a little deeper:

## Take a Linear Combination of Input Values

Remember the introduction. In it, we said the perceptron takes some input value $[x_1, x_2, ..., x_i, ..., x_n]$, also called features, some weights $[w_1, w_2, ..., w_i, ..., w_n]$, multiplies them with each other and takes the sum of these multiplications:

$w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n$

This is the definition of a Linear Combination: it is the sum of some terms multiplied by constant values. In our case, the terms are the features and the constants are the weights.

If we substitute the subscript by a variable $i$, then we can write the sum as:

$\sum_{i=1}^{n} w_ix_i$

This is called the Capital-sigma notation, the $\sum$ represents the summation, the subscript $_{i=1}$ and the superscript $^{n}$ represent the range over which we take the sum and finally $w_ix_i$ represents the “things” we take the sum of.

Also, we can see all $x_i$ and all $w_i$ as so-called vectors:

\begin{aligned} \mathbf{x}&=[x_1, x_2, ..., x_i, ..., x_n]\\ \mathbf{w}&=[w_1, w_2, ..., w_i, ..., w_n] \end{aligned}

In this, $n$ represents the dimension of the vector: it is the number of scalar elements in the vector. For our discussion, it is the number of characteristics used to describe the objects we want to classify.
In this case, the summation is the so-called dot-product of the vectors:

$\mathbf{w} \cdot \mathbf{x}$

About the notation: we write simple scalars (thus simple numbers) as small letters, and vectors as bold letters. So in the above, $x$ and $w$ are vectors and $x_i$ and $w_i$ are scalars: they are simple numbers representing the components of the vector.

## Oooh, Hold Your Horses! You Say What? A ‘Vector’ ?

Ok, I may have gone a little too fast there by introducing vectors and not explaining them.

### Definition of a Vector

To make things more visual (which can help but isn’t always a good thing), I will start with a graphical representation of a 2-dimensional vector:

The above point in the coordinate space $\mathbb{R}^2$ can be represented by a vector going from the origin to that point:

$\mathbf{a} = (a_1, a_2), \text{ in }\mathbb{R}^2$

We can further extend this to 3-dimensional coordinate space and generalize it to n-dimensional space:

$\mathbf{a} = (a_1, a_2, ..., a_n), \text{ in }\mathbb{R}^n$

In text (from Wikipedia):

A (Euclidean) Vector is a geometric object that has a magnitude and a direction.

#### The Magnitude of a Vector

The magnitude of a vector, also called its norm, is defined by the root of the sum of the squares of its components and is written as $\lvert\lvert{\mathbf{a}}\lvert\lvert$
In 2-dimensions, the definition comes from Pythagoras’ Theorem:

$\lvert\lvert{\mathbf{a}}\lvert\lvert = \sqrt{(a_1)^2 + (x_2)^2}$

Extended to n-dimensional space, we talk about the Euclidean norm:

$\lvert\lvert{\mathbf{a}}\lvert\lvert = \sqrt{a_1^2 + a_2^2 + ... + a_i^2 + ... + a_n^2} = \sqrt{\sum_{i=1}^{n} a_i^2}$

Try it yourself:

#### The Direction of a Vector

The direction of a 2-dimensional vector is defined by its angle to the positive horizontal axis:

$\theta =\tan^{-1}(\frac{a_2}{a_1})$

This works well in 2 dimensions but it doesn't scale to multiple dimensions: for example, in 3 dimensions, in what plane do we measure the angle? Which is why the direction cosines were invented: this is a new vector taking the cosine of the original vector to each axis of the space.

$(\cos(\alpha_1), \cos(\alpha_2), ..., \cos(\alpha_i), ..., \cos(\alpha_n))$

We know from geometry that the cosine of an angle is defined by:

$\cos(\alpha) = \frac{\text{adjacent}}{\text{hypothenuse}}$

So, the definition of the direction cosine becomes:

$(\frac{a_1}{\lvert\lvert{\mathbf{a}}\lvert\lvert}, \frac{a_2}{\lvert\lvert{\mathbf{a}}\lvert\lvert}, ..., \frac{a_i}{\lvert\lvert{\mathbf{a}}\lvert\lvert}, ..., \frac{a_n}{\lvert\lvert{\mathbf{a}}\lvert\lvert})$

This direction cosine is a vector $\mathbf{v}$ with length 1 in the same direction as the original vector. This can be simply determined from the definition of the magnitude of a vector:

\begin{aligned} \lvert\lvert{\mathbf{v}}\lvert\lvert&=\sqrt{(\frac{a_1}{\lvert\lvert{\mathbf{a}}\lvert\lvert})^2 + (\frac{a_2}{\lvert\lvert{\mathbf{a}}\lvert\lvert})^2 + ... + (\frac{a_i}{\lvert\lvert{\mathbf{a}}\lvert\lvert})^2 + ... + (\frac{a_n}{\lvert\lvert{\mathbf{\mathbf{a}}}\lvert\lvert})^2}\\ &=\sqrt{\frac{(a_1)^2+(a_2)^2+...+(a_i)^2+...+(a_n)^2}{\lvert\lvert{\mathbf{a}}\lvert\lvert^2}}\\ &=\frac{\sqrt{(a_1)^2+(a_2)^2+...+(a_i)^2+...+(a_n)^2}}{\lvert\lvert{\mathbf{a}}\lvert\lvert}\\ &=\frac{\lvert\lvert{\mathbf{a}}\lvert\lvert}{\lvert\lvert{\mathbf{a}}\lvert\lvert}\\ &=1\\ \end{aligned}

This vector with length 1 is also called the *unit vector*.

Try it yourself:
Vector Direction interactive

### Operations with Vectors

#### Sum and Difference of Two Vectors

Say we have two vectors:

\begin{aligned} \mathbf{a} &= (a_1, a_2, ..., a_n), \text{ in }\mathbb{R}^n\\ \mathbf{b} &= (b_1, b_2, ..., b_n), \text{ in }\mathbb{R}^n \end{aligned}

The sum of two vectors is the vector resulting from the addition of the components of the original vectors.

\begin{aligned} \mathbf{c} &= \mathbf{a} + \mathbf{b}\\ &= (a_1 + b_1, a_2 + b_2, ..., a_n + b_n) \end{aligned}

Try it yourself:

The difference of two vectors is the vector resulting from the differences of the components of the original vectors:

\begin{aligned} \mathbf{c} &= \mathbf{a} - \mathbf{b}\\ &= (a_1 - b_1, a_2 - b_2, ..., a_n - b_n) \end{aligned}

Try it yourself:
Difference of vectors interactive

#### Scalar Multiplication

Say we have a vector $\mathbf{a}$ and a scalar $\lambda$ (a number):

\begin{aligned} \mathbf{a} &= (a_1, a_2, ..., a_n), \text{ in }\mathbb{R}^n\\ \lambda \end{aligned}

A vector multiplied by a scalar is the vector resulting of the multiplication of each component of the original vector by the scalar:

\begin{aligned} \mathbf{c} &= \lambda \mathbf{a}\\ &= (\lambda a_1, \lambda a_2, ..., \lambda a_n) \end{aligned}

Try it yourself:

#### Dot Product

The dot-product is the scalar (a real number) resulting of taking the sum of the products of the corresponding components of two vectors:

### Behaviour of the Rosenblat Perceptron

Because the formula of the perceptron is basically a hyperplane, we can only classify things into two classes which are linearly seperable. A first class with things above the hyper-plane and a second class with things below the hyper-plane.

### Formalizing Some Things: A Few Definitions

We’ve covered a lot of ground here, but without using a lot of the lingo surrounding perceptrons, neural networks and machine learning in general. There was already enough lingo with the mathematics that I didn’t want to bother you with even more definitions.

However, once we start diving deeper, we’ll start uncovering some pattern / structure in the way we work. At that point, it will be interesting to have some definitions which allow us to define steps in this pattern.

So, here are some definitions:

Feed Forward Single Layer Neural Network

What we have now is a feed forward single layer neural network:

##### Neural Network

A neural network is a group of nodes which are connected to each other. Thus, the output of certain nodes serves as input for other nodes: we have a network of nodes. The nodes in this network are modelled on the working of neurons in our brain, thus we speak of a neural network. In this article, our neural network had one node: the perceptron.

##### Single Layer

In a neural network, we can define multiple layers simply by using the output of perceptrons as the input for other perceptrons. If we make a diagram of this, we can view the perceptrons as being organized in layers in which the output of a layer serves as the input for the next layer.

##### Feed Forward

This stacking of layers on top of each other and the output of previous layers serving as the input for next layers results in feed forward networks. There is no feedback of upper layers to lower layers. There are no loops. For our single perceptron, we also have no loops and thus we have a feed forward network.

#### Integration Function

The calculation we make with the weight vector w and the feature vector x is called the integration function. In the Rosenblatt perceptron, the integration function is the dot-product.

#### Bias

The offset b with which we compare the result of the integration function is called the bias.

#### Activation Function (Transfer Function)

The output we receive from the perceptron based on the calculation of the integration function is determined by the activation function. The activation function for the Rosenblatt perceptron is the Heaviside step function.

#### Supervised Learning

Supervised learning is a type of learning in which we feed samples into our algorithm and tell it the result we expect. By doing this, the neural network learns how to classify the examples. After giving it enough samples, we expect to be able to give it new data which it will automatically classify correctly.

The opposite of this is Unsupervised learning in which we give some samples but without the expected result. The algorithm is then able to classify these examples correctly based on some common properties of the samples.

There are other types of learning like reïenforcement learning which we will not cover here.

#### Online Learning

The learning algorithm of the Rosenblatt preceptron is an example of an online learning algorithm: with each new sample given the weight vector is updated;

The opposite of this is batch learning in which we only update the weight vector after having fed all samples to the learning algorithm. This may be a bit abstract here but we’ll clarify this in later articles.

## What Is Wrong with the Rosenblatt Perceptron?

The main problem of the Rosenblatt preceptron is its learning algorithm. Although it works, it only works for linear seperable data. If the data we want to classify is not linearly seperable, then we do not really have any idea on when to stop the learning and neither do we know if the found hyperplane somehow minimizes the wrongly classified data.

Also, let’s say we have some data which is linearly seperable. There are several lines which can separate this data:

We would like to find the hyperplane which fits the samples best. That is, we would like to find a line similar to the following:

There are, of course, mathematical tools which allow us to find this hyperplane. They basically all define some kind of error function and then try to minimize this error. The error function is typically defined as a function of the desired output and the effective output just like we did above. The minimization is done by calculating the derivative of this error function. And herein is the problem for the Rosenblatt preceptron. Because the output is defined by the Heaviside Step function and this function does not have a derivative, because it is not continuous, we cannot have a mathematically rigorous learning method.

If the above is gong a little to fast, don’t panic. In the next article about the ADALINE perceptron, we’ll dig deeper into error functions and derivation.

## References

### JavaScript Libraries used in the Try it yourself Pages

For the SVG illustrations, I use the well known D3.js library.
For databinding, Knockout.js is used.
Mathematical formulas are displayed using MathJax.

### Vector Math

The inspiration for writing this article and a good introduction to vector math:

Some Wikipedia articles on the basics of vectors and vector math:

An understandable proof of why the dot-product is also equal to the product of the length of the vectors with the cosine of the angle between the vectors:

### Hyperplanes and Linear Seperability

Two math stackexchange Q&As on the equation of a hyperplane:

## History

• 12th May, 2019: Version 1.0: Initial release
• 22nd May, 2019: Version 1.1: Added source code for the try-it-yourself links

Written By
Software Developer (Senior)
Belgium
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

 First Prev Next
 Message Closed 23-Sep-21 0:25 Weight Loss Yard 23-Sep-21 0:25
 Message Closed 1-Jun-21 1:44 Member 15225740 1-Jun-21 1:44
 Message Closed 11-May-21 23:26 Abubaker Sadique 11-May-21 23:26
 The Math behind Neural Networks Member 85621609-Jun-20 9:15 Member 8562160 9-Jun-20 9:15
 What a resource! astodola22-Feb-20 4:57 astodola 22-Feb-20 4:57
 5 stars Twiggy Ramirezz24-Nov-19 20:02 Twiggy Ramirezz 24-Nov-19 20:02
 My vote of 5 hur10forcer1014-Nov-19 3:15 hur10forcer10 14-Nov-19 3:15
 Excellent. Can't wait for the other parts!
 Excellent! Chris Maunder30-Aug-19 9:03 Chris Maunder 30-Aug-19 9:03
 Excellent Article rob.evans5-Jul-19 20:30 rob.evans 5-Jul-19 20:30
 My vote of 5 maj00012-Jun-19 0:57 maj000 12-Jun-19 0:57
 My vote of 5 KarstenK11-Jun-19 2:19 KarstenK 11-Jun-19 2:19
 Re: My vote of 5 den2k8811-Jun-19 4:34 den2k88 11-Jun-19 4:34
 Your Vector explanation is good den2k8811-Jun-19 1:15 den2k88 11-Jun-19 1:15
 Re: Your Vector explanation is good Serge Desmedt17-Jun-19 19:56 Serge Desmedt 17-Jun-19 19:56
 Last Visit: 31-Dec-99 19:00     Last Update: 29-Nov-22 22:03 Refresh 1