14,970,219 members
Articles / Artificial Intelligence / Neural Networks
Article
Posted 12 May 2019

33.9K views
60 bookmarked

# The Math behind Neural Networks: Part 1 - The Rosenblatt Perceptron

Rate me:
A try it yourself guide to the basic math behind perceptrons

## Introduction

A lot of articles introduce the perceptron showing the basic mathematical formulas that define it, but without offering much of an explanation on what exactly makes it work.

And surely it is possible to use the perceptron without really understanding the basic math involved with it, but is it not also fun to see how all this math you learned in school can help you understand the perceptron, and in extension, neural networks?

I also got inspired for this article by a series of articles on Support Vector Machines, explaining the basic mathematical concepts involved, and slowly building up to the more complex mathematics involved. So that is my intention with this article and the accompaning code: show you the math envolved in the preceptron. And, if time permits, I will write articles all the way up to convolutional neural networks.

Of course, when explaining the math, the question is: where do you start and when do you stop explaining? There is some math involved that is rather basic, like for example what is a vector?, what is a cosine?, etc… I will assume some basic knowledge of mathematics like you have some idea of what a vector is, you know the basics of geometry, etc… My assumptions will be arbitraty, so if you think i’m going too fast in some explanations just leave a comment and I will try to expand on the subject.

So, let us get started.

### The Series

1. The Math behind Neural Networks: Part 1 - The Rosenblatt Perceptron
2. The Math behind Neural Networks: Part 2 - The ADALINE Perceptron
3. The Math behind Neural Networks: Part 3 - Neural Networks
4. The Math behind Neural Networks: Part 4 - Convolutional Neural Networks

### Setting some bounds

A perceptron basically takes some input values, called “features” and represented by the values $x_1, x_2, ... x_n$ in the following formula, multiplies them by some factors called “weights”, represented by $w_1, w_2, ... w_n$, takes the sum of these multiplications and depending on the value of this sum outputs another value $o$:

$o = f(w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n)$

There are a few types of perceptrons, differing in the way the sum results in the output, thus the function $f()$ in the above formula.

In this article we will build on the Rosenblatt Perceptron. It was one of the first perceptrons, if not the first. During this article I will simply be using the name “Perceptron” when referring to the Rosenblatt Perceptron

We will investigate the math envolved and discuss its limitations, thereby setting the ground for the future articles.

## The basic math formula for the Rosenblatt Perceptron

$f(x) = \begin{cases} 1 & \text{if } w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n > b\\ 0 & \text{otherwise} \end{cases}$

So, what the perceptron basically does is take some linear combination of input values or features, compare it to a threshold value $b$, and return 1 if the threshold is exceeded and zero if not.

The feature vector is a group of characteristics describing the objects we want to classify.

In other words, we classify our objects into two classes: a set of objects with characteristics (and thus a feature vector) resulting in in output of 1, and a set of objects with characteristics resulting in an output of 0.

If you search the internet on information about the perceptron you will find alternative definitions which define the formula as follows:

$f(x) = \begin{cases} +1 & \text{if } w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n > b\\ -1 & \text{otherwise} \end{cases}$

We will see further this does not affect the workings of the perceptron

Lets digg a little deeper:

## Take a linear combination of input values

Remember the introduction. In it we said the perceptron takes some input value $[x_1, x_2, ..., x_i, ..., x_n]$, also called features, some weights $[w_1, w_2, ..., w_i, ..., w_n]$, multiplies them with each other and takes the sum of these multiplications:

$w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n$

This is the definition of a Linear Combination: it is the sum of some terms multiplied by constant values. In our case the terms are the features and the constants are the weights.

If we substitute the subscript by a variable $i$, then we can write the sum as

$\sum_{i=1}^{n} w_ix_i$

This is called the Capital-sigma notation, the $\sum$ represents the summation, the subscript $_{i=1}$ and the superscript $^{n}$ represent the range over which we take the sum and finally $w_ix_i$ represents the “things” we take the sum of.

Also, we can see all $x_i$ and all $w_i$ as so-called vectors:

\begin{aligned} \mathbf{x}&=[x_1, x_2, ..., x_i, ..., x_n]\\ \mathbf{w}&=[w_1, w_2, ..., w_i, ..., w_n] \end{aligned}

In this, $n$ represents the dimension of the vector: it is the number of scalar elements in the vector. For our discussion, it is the number of characteristics used to describe the objects we want to classify.
In this case, the summation is the so-called dot-product of the vectors:

$\mathbf{w} \cdot \mathbf{x}$

About the notation: we write simple scalars (thus simple numbers) as small letters, and vectors as bold letters. So in the above $x$ and $w$ are vectors and $x_i$ and $w_i$ are scalars: they are simple numbers representing the components of the vector.

## Oooh, hold your horses! You say what? A ‘Vector’ ?

Ok, I may have gone a little too fast there by introducing vectors and not explaining them.

### Definition of a Vector

To make things more visual (which can help but isn’t always a good thing), I will start with a graphical representation of a 2-dimensional vector:

The above point in the coordinate space $\mathbb{R}^2$ can be represented by a vector going from the origin to that point:

$\mathbf{a} = (a_1, a_2), \text{ in }\mathbb{R}^2$

We can further extend this to 3-dimensional coordinate space and generalize it to n-dimensional space:

$\mathbf{a} = (a_1, a_2, ..., a_n), \text{ in }\mathbb{R}^n$

In text (from Wikipedia):

A (Euclidean) Vector is a geometric object that has a magnitude and a direction

#### The Magnitude of a Vector

The magnitude of a vector, also called its norm, is defined by the root of the sum of the squares of it’s components and is written as $\lvert\lvert{\mathbf{a}}\lvert\lvert$
In 2-dimensions, the definition comes from Pythagoras’ Theorem:

$\lvert\lvert{\mathbf{a}}\lvert\lvert = \sqrt{(a_1)^2 + (x_2)^2}$

Extended to n-dimensional space, we talk about the Euclidean norm:

$\lvert\lvert{\mathbf{a}}\lvert\lvert = \sqrt{a_1^2 + a_2^2 + ... + a_i^2 + ... + a_n^2} = \sqrt{\sum_{i=1}^{n} a_i^2}$

Try it yourself:
Vector Magnitude interactive

#### The Direction of a Vector

The direction of a 2-dimensional vector is defined by its angle to the positive horizontal axis:

$\theta =\tan^{-1}(\frac{a_2}{a_1})$

This works well in 2 dimensions but it doesn't scale to multiple dimensions: for example in 3 dimensions, in what plane do we measure the angle? Which is why the direction cosines where invented: this is a new vector taking the cosine of the original vector to each axis of the space.

$(\cos(\alpha_1), \cos(\alpha_2), ..., \cos(\alpha_i), ..., \cos(\alpha_n))$

We know from geometry that the cosine of an angle is defined by:

$\cos(\alpha) = \frac{\text{adjacent}}{\text{hypothenuse}}$

So, the definition of the direction cosine becomes

$(\frac{a_1}{\lvert\lvert{\mathbf{a}}\lvert\lvert}, \frac{a_2}{\lvert\lvert{\mathbf{a}}\lvert\lvert}, ..., \frac{a_i}{\lvert\lvert{\mathbf{a}}\lvert\lvert}, ..., \frac{a_n}{\lvert\lvert{\mathbf{a}}\lvert\lvert})$

This direction cosine is a vector $\mathbf{v}$ with length 1 in the same direction as the original vector. This can be simply determined from the definition of the magnitude of a vector:

\begin{aligned} \lvert\lvert{\mathbf{v}}\lvert\lvert&=\sqrt{(\frac{a_1}{\lvert\lvert{\mathbf{a}}\lvert\lvert})^2 + (\frac{a_2}{\lvert\lvert{\mathbf{a}}\lvert\lvert})^2 + ... + (\frac{a_i}{\lvert\lvert{\mathbf{a}}\lvert\lvert})^2 + ... + (\frac{a_n}{\lvert\lvert{\mathbf{\mathbf{a}}}\lvert\lvert})^2}\\ &=\sqrt{\frac{(a_1)^2+(a_2)^2+...+(a_i)^2+...+(a_n)^2}{\lvert\lvert{\mathbf{a}}\lvert\lvert^2}}\\ &=\frac{\sqrt{(a_1)^2+(a_2)^2+...+(a_i)^2+...+(a_n)^2}}{\lvert\lvert{\mathbf{a}}\lvert\lvert}\\ &=\frac{\lvert\lvert{\mathbf{a}}\lvert\lvert}{\lvert\lvert{\mathbf{a}}\lvert\lvert}\\ &=1\\ \end{aligned}

This vector with length 1 is also called the *unit vector*.

Try it yourself:
Vector Direction interactive

### Operations with Vectors

#### Sum and difference of two Vectors

Say we have two vectors:

\begin{aligned} \mathbf{a} &= (a_1, a_2, ..., a_n), \text{ in }\mathbb{R}^n\\ \mathbf{b} &= (b_1, b_2, ..., b_n), \text{ in }\mathbb{R}^n \end{aligned}

The sum of two vectors is the vector resulting from the addition of the components of the orignal vectors.

\begin{aligned} \mathbf{c} &= \mathbf{a} + \mathbf{b}\\ &= (a_1 + b_1, a_2 + b_2, ..., a_n + b_n) \end{aligned}

Try it yourself:
Sum of vectors interactive

The difference of two vectors is the vector resulting from the differences of the components of the original vectors:

\begin{aligned} \mathbf{c} &= \mathbf{a} - \mathbf{b}\\ &= (a_1 - b_1, a_2 - b_2, ..., a_n - b_n) \end{aligned}

Try it yourself:
Difference of vectors interactive

#### Scalar multiplication

Say we have a vector $\mathbf{a}$ and a scalar $\lambda$ (a number):

\begin{aligned} \mathbf{a} &= (a_1, a_2, ..., a_n), \text{ in }\mathbb{R}^n\\ \lambda \end{aligned}

A vector multiplied by a scalar is the vector resulting of the multiplication of each component of the original vector by the scalar:

\begin{aligned} \mathbf{c} &= \lambda \mathbf{a}\\ &= (\lambda a_1, \lambda a_2, ..., \lambda a_n) \end{aligned}

Try it yourself:
Scalar Multiplication for vectors interactive

#### Dot product

The dot-product is the scalar (a real number) resulting of taking the sum of the products of the corresponding components of two vectors:

### Behaviour of the Rosenblat Perceptron

Because the formula of the perceptron is basically a hyperplane, we can only classify things into two classes which are lineary seperable. A first class with things above the hyper-plane and a second class with things below the hyper-plane.

### Formalising some things: a few definitions

We’ve covered a lot of ground here, but without using a lot of the lingo surrounding perceptrons, neural networks and machine learning in general. There was already enough lingo with the mathematics that I didn’t want to bother you with even more definitions.

However, once we start diving deeper we’ll start uncovering some pattern / structure in the way we work. At that point, it will be interesting to have some definitions which allow us to define steps in this pattern.

So, here are some definitions:

Feed forward single layer neural network

What we have now is a feed forward single layer neural network:

Neural Network
A neural network is a group of nodes which are connected to each other. Thus, the output of certain nodes serves as input for other nodes: we have a network of nodes. The nodes in this network are modelled on the working of neurons in our brain, thus we speak of a neural network. In this article our neural network had one node: the perceptron.

Single Layer
In a neural network, we can define multiple layers simply by using the output of preceptrons as the input for other perceptrons. If we make a diagram of this we can view the perceptrons as being organised in layers in which the output of a layer serves as the input for the next layer.

Feed Forward
This stacking of layers on top of each other and the output of previous layers serving as the input for next layers results in feed forward networks. There is no feedback of upper layers to lower layers. There are no loops. For our single perceptron we also have no loops and thus we have a feed forward network.

Integration function
The calculation we make with the weight vector w and the feature vector x is called the integration function. In the Rosenblatt perceptron the integration function is the dot-product.

Bias
The offset b with which we compare the result of the integration function is called the bias.

Activation function (transfer function)
The output we receive from the perceptron based on the calculation of the integration function is determined by the activation function. The activation function for the Rosenblatt perceptron is the Heaviside step function.

Supervised learning
Supervised learning is a type of learning in which we feed samples into our algorithm and tell it the result we expect. By doing this the neural network learns how to classify the examples. After giving it enough samples we expect to be able to give it new data which it will automatically classify correctly.

The opposite of this is Unsupervised learning in which we give some samples but without the expected result. The algorithm is then able to classify these examples correctly based on some common properties of the samples.

There are other types of learning like reïnforcement learning which we will not cover here.

Online learning
The learning algorithm of the Rosenblatt preceptron is an example of an online learning algorithm: with each new sample given the weight vector is updated;

The opposite of this is batch learning in which we only update the weight vector after having fed all samples to the learning algorithm. This may be a bit abstract here but we’ll clarify this in later articles.

## What is wrong with the Rosenblatt perceptron?

The main problem of the Rosenblatt preceptron is its learning algorithm. Allthough it works, it only works for linear seperable data. If the data we want to classify is not linearily seperable, then we do not really have any idea on when to stop the learning and neither do we know if the found hyperplane somehow minimizes the wrongly classified data.

Also, let’s say we have some data which is linearily seperable. There are several lines which can seperate this data:

We would like to find the hyperplane which fits the samples best. That is, we would like to find a line similar to the following:

There are of course mathematical tools which allow us to find this hyperplane. They basically all define some kind of error function and then try to minimize this error. The error function is typically defined as a function of the desired output and the effective output just like we did above. The minimization is done by calculating the derivative of this error function. And herein is the problem for the Rosenblatt preceptron. Because the output is defined by the Heaviside Step function and this function does not have a derivative, because it is not continuous, we cannot have a matematically rigourous learning method.

If the above is gong a little to fast, don’t panic. In the next article about the ADALINE perceptron we’ll dig deeper into error functions and derivation.

## References

### Javascript libraries used in the Try it yourself pages

For the SVG illustrations I use the well known D3.js library
For databinding Knockout.js is used
Mathematical formulas are displayed using MathJax

### Vector Math

The inspiration for writing this article and a good introduction to vector math: SVM - Understanding the math - Part 2

Some wikipedia articles on the basics of vectors and vector math:
Euclidean vector
Magnitude
Direction cosine

An understandable proof of why the dot-product is also equal to he product of the length of the vectors with the cosine of the angle between the vectors:
Proof of dot-product

### Hyperplanes and Linear Seperability

Two math stackexchange Q&A’s on the equation of a hyperplane:
Hyperplane equation intuition / geometric interpretation
Why is the product of a normal vector and a vector on the plane equal to the equation of the plane?

### Convexity

Definition of convexity: Convex set
Discussing convexity, we also discussed Line segments: Line segment

Proving a half-plane is convex: How do I prove that half a plane is convex?

A more in depth discussion of convexity: Lecture 1 Convex Sets

### Perceptron

Wikipedia on the perceptron: Perceptron
Another explanation of the perceptron: The Simple Perceptron
A Peceptron is a special kind of linear classifier
Following article as an interesting view on what they call the duality of input and weight-space: 3. Weighted Networks – The Perceptron

### Perceptron Learning

Following article gives another intuitive explanation on why the learning algorithm works: Perceptron Learning Algorithm: A Graphical Explanation Of Why It Works

An animated gif of the perceptron learning rule: Perceptron training without bias

### Convergence of the learning algorithm

This YouTube video presents a very understandable proof: Lec-16 Perceptron Convergence Theorem

A written version of the same proof can be found in this pdf: CHAPTER 1 Rosenblatt’s Perceptron By the way, there is much more inside that pdf then just the proof.

## History

• Version 1.0: initial release (12 May 2019)
• Version 1.1: added sourcecode for the try-it-yourself links (22 May 2019)

## Share

 Software Developer (Senior) Belgium
No Biography provided

 First Prev Next
 Message Closed 1-Jun-21 0:44 Member 15225740 1-Jun-21 0:44
 Message Closed 11-May-21 22:26 Abubaker Sadique 11-May-21 22:26
 The Math behind Neural Networks Member 85621609-Jun-20 8:15 Member 8562160 9-Jun-20 8:15
 What a resource! astodola22-Feb-20 3:57 astodola 22-Feb-20 3:57
 5 stars Twiggy Ramirezz24-Nov-19 19:02 Twiggy Ramirezz 24-Nov-19 19:02
 My vote of 5 Member 1459438014-Nov-19 2:15 Member 14594380 14-Nov-19 2:15
 Excellent! Chris Maunder30-Aug-19 8:03 Chris Maunder 30-Aug-19 8:03
 Excellent Article rob.evans5-Jul-19 19:30 rob.evans 5-Jul-19 19:30
 My vote of 5 maj00011-Jun-19 23:57 maj000 11-Jun-19 23:57
 My vote of 5 KarstenK11-Jun-19 1:19 KarstenK 11-Jun-19 1:19
 Re: My vote of 5 den2k8811-Jun-19 3:34 den2k88 11-Jun-19 3:34
 Your Vector explanation is good den2k8811-Jun-19 0:15 den2k88 11-Jun-19 0:15
 Re: Your Vector explanation is good Serge Desmedt17-Jun-19 18:56 Serge Desmedt 17-Jun-19 18:56
 Last Visit: 31-Dec-99 18:00     Last Update: 23-Jul-21 15:10 Refresh 1