## Introduction

There is a big buzz these days around topics related to Artificial Intelligence, Machine Learning, Neural Networks and lots of other cognitive stuff. New ideas and technologies appear so quickly that it is close to impossible of keeping track of them all. The progress done in these areas over the last decade creates many new applications, new ways of solving known problems and of course generates great interest in learning more about it and in looking for how it could be applied to something new.

The topic of Artificial Neural Networks (ANNs or just Neural Networks to keep it simple) was very interesting to me for a long time. Started playing with them more than 15 years ago, applied to some work back at university and contributed some neural network code to open source community. The interest to neural networks was growing rapidly back in those days, but still there was not as much noise around them as now.

A lot has changed since that time - new neural network architectures have emerged, many great applications were developed and amazing ideas generated. So I felt I need to spend some time and refresh my knowledge of the topic. And, as someone mentioned on one of the ANN related blog posts: "The best way to understand internals of neural networks is to implement them". I decided to do it that way. As a result, I implemented a small C++ library for some common architectures of neural networks.

There are many great ANN libraries around for sure. Many of them are oriented to Python developers, which might be powerful indeed, but is not the programming language of my choice. Other libraries have quite complicated code base, which may not be easy to learn side by side with theory. And there is big variety of small libraries targeted to some particular neural networks architectures, etc. Anyway, since I wanted to learn all the guts, I implemented it my way. Why C++? Well, I wanted to get closer to the metal – vectorization with SIMD instructions, parallelism, thinking of GPU in the future.

This article is the first one in the series of articles about ANNT library providing implementation of some common neural network architectures and applying them to different tasks. The first one is about well-known basics – feed forward fully connected networks and back propagation learning algorithm. It will provide foundation for future articles about convolutional and recurrent networks. Each article will be accompanied with source code of the library available so far and some working examples.

## Theoretical background

As the topic is not new, there are many resources available on the theory of artificial neural networks, different architectures and their training. Here we’ll not go too much into theoretical details and describe it very briefly, providing links to other materials covering the topic more thoroughly.

### Biological inspiration

Many ideas of the modern artificial neural networks are inspired by their biological version. Neuron, or nerve cell, is the core component of the nervous system in general, and brain in particular. It is an electrically excitable cell that receives, processes and transmits information through electrical and chemical signals. These signals between neurons occur via specialized connections called synapses. Neurons can connect to each other to form neural circuits. The average human brain has about 100 billion neurons, which may be connected to up 10000 other neurons, forming about 1000 trillion synaptic connections.

A typical neuron consists of a cell body (soma), dendrites and an axon. Dendrites are thin structures that arise from the cell body, often extending for hundreds of micrometers and branching multiple times. An axon is a special cellular extension that arises from the cell body and travels for a distance as far as one meter in humans or even more in other species. Most neurons receive signals via the dendrites and send out signals down the axon. As such, dendrites can be imagined as neuron’s inputs, while axon its output.

### Artificial neuron

An artificial neuron is a mathematical function representing a model of a biological neuron. The artificial neuron receives one or more inputs (representing potential at neural dendrites) and sums them to produce an output (or activation, representing neuron's action potential transmitted along its axon). Usually each input is separately weighted, and the sum is passed through a non-linear function known as an activation function or transfer function.

Putting it into a math equation, a simple artificial neuron is described by the next formula:

where **x**_{j} values are neuron's inputs, **w**_{j} values are inputs' weights, **b** is a bias value and **m** is the number of inputs. To make things more compact, the formula can be re-written in vector notation (here **x** and **w** are inputs and weights represented as column vectors):

The first artificial neuron was the Threshold Logic Unit (TLU) proposed by Warren McCulloch and Walter Pitts in 1943. As a transfer function it employed a threshold function. Initially, only simple model was considered with binary inputs/outputs and some restrictions on the possible weights. Since the beginning it was already noticed that any boolean function could be implemented by networks of such devices, what is easily seen from the fact that one can implement the **AND** and **OR** functions.

In the late 1980s, when research on neural networks regained strength, neurons with more continuous shapes started to be considered. The possibility of differentiating the activation function allows the direct use of the gradient descent and other optimization algorithms for the adjustment of the weights and bias values.

#### AND/OR examples

As it was mentioned above, a single neuron can implement a function like **OR**, **AND** or **NAND**, for example. To implement these functions, neuron’s weight can be initialized to weights below:

| **b** | **w**_{1} | **w**_{2} |

**OR** | -0.5 | 1 | 1 |

**AND** | -1.5 | 1 | 1 |

**NAND** | 1.5 | -1 | -1 |

Putting these weights and bias values into neuron’s equation and assuming it uses threshold activation function (1 for **u** >= 0, 0 otherwise) we can check that the neuron really does its job.

**x**_{1} | **x**_{2} | | **u**_{or} | **y**_{or} | | **u**_{and} | **y**_{and} | | **u**_{nand} | **y**_{nand} |

0 | 0 | | -0.5 | 0 | | -1.5 | 0 | | 1.5 | 1 |

1 | 0 | | 0.5 | 1 | | -0.5 | 0 | | 0.5 | 1 |

0 | 1 | | 0.5 | 1 | | -0.5 | 0 | | 0.5 | 1 |

1 | 1 | | 1.5 | 1 | | 0.5 | 1 | | -0.5 | 0 |

Can we do something more complex with a single neuron? Like **XOR** function, for example? No. The reason for this is that when a single neuron is used in classification problem, it can only separate data points with a straight line. However, **XOR** inputs are not linearly separable. The picture below shows data points of all three function: **OR**, **AND** and **XOR**. For both **OR** and **AND** data points it is possible to draw a straight line separating them into classes, while this can not be done for **XOR** data points.

The separating lines above are obtained from the weights and bias values, in fact. For **OR** function we’ve used **b**=-0.5, **w**_{1}=1 and **w**_{2}=1. Which gives the next sum: 1 * **x**_{1} + 1 * **x**_{2} - 0.5. Turning it into a linear equation, we get: **x**_{2} = 0.5 - **x**_{1} - the line to separate **OR** data points.

Is it possible to implement **XOR** function with more than single neuron? Sure. Remember, that **XOR** can be implemented using **OR**, **AND** and **NAND** functions: **XOR(***x*_{1}, *x*_{2}) = AND(OR(*x*_{1}, *x*_{2}), NAND(*x*_{1}, *x*_{2})). Which means 3 neurons joined into 2 layer network will do the job.

### Artificial neural network

Since there is not much can be done with a single neuron, those are joined into networks - artificial neural networks. Each network contains number of layers, which in turn contain number of neurons. There are many different architectures of artificial neural networks, which differ in the way how neurons get connected between layers and how input signal travels through the network. In this article we’ll start with the simplest architecture - feed forward fully connected network.

In this type of artificial neural networks, each neuron of the next layer is connected to all neurons of the previous layer (and no other neurons), while each neuron in the first layer is connected to all inputs. Signal travels in one direction only in these networks - from inputs to outputs. Such type of networks can do well in different classification and regressions tasks.

**Note**: It is very common to denote network's inputs as input layer, the last layer as output layer and all other layers as hidden layers. Since input layer is more of a naming convention and it does not really represent an entity in the network itself, it will not be counted as layer throughout the article when we speak about number of layers in a network. So, if we say we have a 3-layer network, it is assumed we have a network with 2 hidden layers and an output layer.

To provide mathematical model of feed forward fully connected networks, lets agree on some variables naming and structure:

*l* - number of layers in the network; *n*^{(k)} - number of neurons in the *k*^{th} layer; *n*^{(0)} - number of inputs into the network; *m*^{(k)} - number of inputs into the *k*^{th} layer (note: *m*^{(k)} = *n*^{(k-1)}); *y*^{(k)} - column vector of outputs of the *k*^{th} layer, length of *n*^{(k)}; *y*^{(0)} - column vector of inputs into the network (vector *x*); *b*^{(k)} - column vector of bias values for the *k*^{th} layer, length of *n*^{(k)}; *W*^{(k)} - matrix of weights for the *k*^{th} layer. *i*^{th} row of the matrix contains weights of the *i*^{th} neuron of the layer. Which means the size of the matrix is *n*^{(k)} by *m*^{(k)}.

With all the definitions above, the output of a feed forward fully connected network can be computed using a simple formula below (assuming computation order goes from the first layer to the last one):

Or, to make it compact, here is the same in vector notation:

That is basically all about math of feed forward fully connected network! Or very close to it. The question is: What can be done with these formulas only? Little. Unless we have weights and bias values correctly initialized for the problem we want to solve, the artificial neural network implemented using the above formulas is useless. For the simple OR/AND functions above, we’ve handcrafted weights/biases which will do the job. But for anything more complex than that finding those values is not really a trivial process. This is where learning algorithm comes into play.

### Activation function

To complete the math required for neural network inference, we need to say more about activation functions. As it was mentioned before, the very first models of artificial neurons used threshold function to compute their output from the weighted sum of inputs. Although being simple, the threshold function has number of disadvantages. The primary one is its derivative, which is not defined at *x*=0 and everywhere else it equals to zero. As we'll see further, the gradient descent algorithm used for neural network training requires activation function to be differentiable and have non-zero gradient on the wide range of input values.

One of the most popular activation functions is sigmoid function, which is defined as:

The sigmoid function's shape reminds the shape of step function (threshold), but not as sharp. It is smooth, differentiable, non-binary, defined in the (0, 1) range – seems like a good alternative. It is not perfect though, it has its issues as well. However, it proved to work well for different classification tasks done with feed forward fully connected networks, so we'll stick to it for now to make things simple.

Sigmoid function | |
Hyperbolic Tangent function - Tanh |

Few other popular activation functions to mention are:

- Hyperbolic tangent, which has similar to sigmoid function’s shape, but provides output in the (-1, 1) range.
- Softmax function, which "squashes" a vector of arbitrary real values to a same dimensional vector of real values, where each entry is in the (0, 1) range and all the entries add up to 1 - good for classification tasks, where neural network's output can be treated as probabilities of belonging to certain class.
- Rectifier, which is a popular activation function in deep neural network architectures as it allows better gradient propagation and so has fewer vanishing gradient problems.

Why do we need activation function at all? Can we do without it? We can remove it from network's output layer in case we are doing regression task and so we need unbounded output. But we can not remove it from hidden layers. Activation function in hidden layers adds nonlinearity and so the network can learn non-linear features. This gives us ability of solving tasks like XOR problem, for example, where classes are not linear separable. Removing activation function from hidden layers will destroy the ability of learning non-linear features and in fact will turn any multilayer network into a single layer one. Yes, multiple layers without activation function can be replaced with just one layer, which will do the same job. Or better say will not do it, since there is zero point in adding any extra layer then.

So now math looks complete for neural network inference - calculating network's outputs for new data after the training phase is complete, i.e. when we have tuned network's weights/biases. However, we don't have them. We need to find a way of training neural network, so it does something useful.

### Training artificial neural network

For training feed forward fully connected artificial neural network we are going to use a supervised learning algorithm. This means we’ll have a training dataset, which provides samples of possible inputs and target outputs. The very brief idea of the learning algorithm is that untrained neural network (randomly initialized) is given sample inputs from training dataset and it computes corresponding outputs for those. The outputs produced by the network are then compared with the target outputs it needs to produce and some error value is calculated. Based on that error value the network's parameters (weights and biases) are then updated in the way to decrease this error, i.e. to make difference between produced and target outputs smaller. One cycle of calculating outputs, then error value and finally updating network’s parameters is called a training epoch. Usually the training algorithm is repeated either a specified number of epochs or until the error value becomes small enough.

#### Cost function

First thing we need to do is to define the error function or, as it is very often called, the cost function. There are number of popular functions to chose from, which fit better for different tasks. However, to make things simple, we'll start with Mean Square Error (MSE) function, which is a common choice for regressions tasks. Suppose we have a training set with *m* samples, which are represented by *x*^{(j)} vectors of inputs and *t*^{(j)} vectors of target outputs (even though most regression tasks assume single output, we'll think of it as a vector to make math general). For every possible input the network computes corresponding *y*^{(j)} vector of outputs. Now, if we drop superscripts, we can also use *y* and *t* to denote any arbitrary network's output and its corresponding target. Assuming the network has *n* neurons in its output layer and so the same number of elements in the output vector, the MSE cost function for a single training example can be defined like this:

If we want to calculate cost function's value for the entire training dataset, then we can average it across all available samples:

**Note**: as the name of the cost function suggests, it should be mean value of square error. Which logically suggests the sum of square errors should be divided by *n*. However, dividing it by **2***n* does not change the idea too much, but instead simplifies further math when it comes to derivatives.

Now, when we have cost function defined, we can get a single numeric value, which can be used to judge how well an artificial neural network performs on training dataset. When training a neural network, it is useful to monitor this value to see if it improves over time and if so, how quickly.

#### Stochastic gradient descent

Having cost function defined, we can now move further into neural network training and updating its weights/biases, so it performs better. What we have is a classical optimization problem – we need to find such network parameters, so that the cost function approaches to its minimum value (local minimum). For that we can employ the Gradient Descent optimization algorithm. The algorithm is based on the observation that if a multi-variable function *F(x)* is defined and differentiable in a neighbourhood of point *a*, then *F(x)* decreases fastest if one goes from *a* in the direction of the negative gradient of the function at that point, i.e. -∇*F(a)*. And so, the parameter update rule for the Gradient Descent algorithm is defined the next way:

For a small enough value of parameter *λ*, the *F(a*_{n+1}) <= *F(a*_{n}). With certain assumptions on the function *F*, convergence to a local minimum can be guaranteed.

In the case of training artificial neural network, we need to minimize the cost function for the training set we have. Taking into account that the training set is fixed, the input samples and target outputs can be treated as constants. And so, the cost function becomes just a function of network's weights (bias values are special kinds of weights to keep it simple for now), which we need to optimize in order to minimize the cost. Starting with randomly initialized weights, the training process of a neural network with Gradient Descent algorithm is done by iteratively updating weights using the next formula then:

The *λ* parameter is known as learning rate and affects the speed of training a neural network (speed of approaching to local minimum of the cost function). The optimal value of the parameter varies depending on the neural network's architecture, training setup, etc., so chosen based on experience and experiments. If it is set too low, convergence to local minimum may get too slow taking very long time to train the network. On the other hand, if it is set too high, the cost function may oscillate and even diverge.

Before moving further into weights update and calculation of cost function's gradient, let's see what the problem with the Gradient Descent algorithm is. Very often training sets may get very large – tens to hundreds of thousands of samples or even millions. Calculating cost function over entire set may get quite expensive, both CPU/GPU and memory wise. An alternative solution is to use Stochastic Gradient Descent (SGD) algorithm, which randomly picks a training sample (or shuffles training set at the start of training epoch), calculates cost function only for that one and then does parameters' update based on this single sample. It repeats such update iterations for all samples in the training set, but in the random order. In practice, Stochastic Gradient Descent very often leads to faster training, since the model gets small improvements many times during an epoch as opposed to single parameter's update per epoch with true Gradient Descent. This is caused by the fact, that very often training sets have many similar samples, which vary little from one another. And so, making updates for some samples, very often improves result for future samples.

So, according to the SGD algorithm, our neural network's weights update rule becomes based on single random example *j* only:

The convergence of Stochastic Gradient Descent has been analysed and it was observed that when the learning rates *λ* decrease with an appropriate rate, SGD converges almost surely to a global minimum when the objective function is convex, and otherwise converges almost surely to a local minimum.

Mini-Batch Gradient Descent (or just Batch Gradient Descent) is yet another alternative algorithm – something in between the two above. It is similar to the Gradient Descent, but instead of calculating parameters' update over the entire training set, it does it over a batch of the specified size. And similar to the Stochastic Gradient Descent, samples are chosen randomly into each batch (or shuffled upfront).

Although Batch Gradient Descent is a preferred setup for most applications these days, we'll stick to Stochastic Gradient Descent for now to simplify the rest of the training algorithm.

#### Chain rule and the gradient

Now it is time to elaborate more on the neural network's weights update rule. Lets for now look at the weights update procedure for the last layer of a feed forward fully connected neural network. We'll assume that the last layer has *n* neurons/outputs, each having *m* inputs; *y*_{i} is the output of the *i*^{th} neuron and *u*_{i} is its weighted sum of inputs (input to the activation function); *t*_{i} is the target output of the *i*^{th} neuron; *x*_{j} is the *j*^{th} input (coming from the corresponding neuron of the previous layer); *w*_{i,j} is the weight of *i*^{th} neuron for the *j*^{th} input; *b*_{i} is the bias value of the *i*^{th} neuron. According to the SGD algorithm, the update for each weight *w*_{i,j} is based on the partial derivative of cost function in respect to that weight, which can be written this way:

To calculate partial derivative of the cost function we'll need to use so called chain rule. The reason is that the cost function is not a simple function of network's weights. Instead, it is a function of network's output and target output, where network's output is a function of weighted inputs' sum and finally the weighted sum can be represented as a function of networks' weights. For example, suppose we have a function *f(x)*, where *x* is another function, *x(t)*, and finally *t* is a function as well, *t(a, b)*. Or it can be written as *f(x(t(a, b)))*. Suppose we need to find partial derivative of *f* with respect to *a*. Using chain rule it can be done this way:

Applying same idea to the partial derivative of cost function, we can get the next formula:

Let's find every partial derivative of the chain now. Although MSE cost function we are using for now assumes **mean** of square errors, it is more common to use total sum when it comes to calculating its derivative. With this in mind, the partial derivative of cost function with respect to the output of *i*^{th} neuron is written this way:

And so partial derivative of MSE cost function with respect to network's output is just a difference between actual output and the target output, which can be treated as prediction error. In the case we have more that one output neuron, we better calculate such error for each individual neuron regardless of the number of neurons in the output layer. Which is why dividing by *n* is usually omitted.

The next step is to calculate derivative of an activation function with respect to its input. Since we are using sigmoid activation function, we get the next derivative:

Note that derivative of sigmoid function can be defined in two ways. The first one is based on the function's parameter, i.e. *u*_{i}. However, no one really does it this way when it come to artificial neural networks. It is much faster to calculate sigmoid's derivative using the value of the function itself, considering it is computed anyway during calculation of network's output.

Finally, we can define partial derivatives of the neurons' weighted sum, *u*_{i}, with respect to its weights, *w*_{i,j}, and bias values, *b*_{i}:

Putting this all together, we get the next update rules for weights and bias values for neurons in the last layer:

The formulas above can be used for training a feed forward fully connected artificial neural network with a single layer only. However, most applications require multi-layer networks. This is where error backpropagation algorithm comes in place.

#### Error backpropagation

To get weights' update rules for hidden layers, we can use same chain rule technique as before. We already saw how to find partial derivative of cost function with respect to neurons' outputs in the output layer. Let's denote that as *E*_{i} – error term of I'th neuron in the output layer.

And now let's define formula for *E'*_{j} – partial derivative of cost function with respect to output of *j*^{th} neuron in the previous layer (the layer before the output layer). We'll use chain rule again for that, but we need to keep one important thing in mind. Since we have fully connected artificial neural network, every output of the previous layer is connected to every neuron in the following layer. Which gets reflected in the error term calculation.

Now let's make some substitutions. First let's pull the *E*_{i} term into the formula. And then let's recall that the *j*^{th} output of the previous layer, *y'*_{j}, can be denoted as input into the current layer, *x*_{j}. We can then rewrite the above formula in a more generic way:

The *E*_{i} term in above formula was left on purpose. If we would apply chain rule further to find error term for another hidden layer, we would come to the same formula again. Which means that once error term is calculated for output layer using partial derivative of cost with respect to network's output, the error terms for all previous layers can be calculated from error term of the following layer using the formula above.

With the above generalization, we can now write down weights' update rules for all layers of a feed forward fully connected artificial neural network.

The above described algorithm is called error backpropagation. Once error is calculated for the output layer, it is propagated backwards through neural network using the partial derivatives mechanism. And so, when it comes to artificial neural network, it is very common to speak of forward and backward passes. The forward pass is the calculation of the network's output – signals flow from inputs to the outputs. The backward pass is the calculation of the network parameters' update – error values flow from outputs to inputs.

Keep in mind that all the above is valid if we use MSE as cost function and sigmoid as activation function. If another cost or activation function is used, the above formulas will change. But not a lot – only the corresponding partial derivative term will be different.

Well, that is it with theory for now. Obviously, there is much more to say about feed forward fully connected artificial neural network and their training. But this should be enough for introduction, while the provided links serve as the extra source of information.

## The ANNT library

While implementing the code for the ANNT library, the goal was to make it flexible and easy to extend and use. And so, the object oriented paradigm was taken from the very first steps. When designing class hierarchy for artificial neural network, it was decided to take network's layers as a minimum modelling entity. This way it is possible to achieve better performance (opposed to modelling down to individual neurons as some implementations do) and get the flexibility of building different neural network architectures from layers of different types.

Although the theoretical part suggests that activation functions are part of neurons, their implementation is separated into special activation layer classes. Different cost functions are also implemented as separate classes to make it easy to chose one depending on the task being solved. As a result of such granularity, the weights update rule as it was shown in the theory part will not be found in the code. Instead, each class implements its own part of back propagation algorithm by calculating required term of error's gradient.

For example, the `XMSECost`

class calculates only the *y*_{i} – t_{i} part. Then the `XSigmoidActivation`

class adds the *y*_{i}(1-y_{i}) part on top. And finally, the `XFullyConnectedLayer`

takes care of computing partial derivatives with respect to weights and also error gradients to pass to the previous layer. This way it is possible to plug different activation and cost functions into neural network’s model without needing to hard code the entire weights' update algorithm.

The Gradient Descent update rule is also moved to a separate class. As it was mentioned before, the formula to update weights looks this was for the algorithm: *w*_{(t+1)} = w_{(t)} – λ * Δw_{(t)}. However, it is not the only possible algorithm and very often is not the one to give faster training. For example, another popular algorithm is called Gradient Descent with Momentum, which has update rule like this: *v*_{(t)} = μ * v_{(t-1)} + λ * Δw_{(t)}; w_{(t+1)} = w_{(t)} - v_{(t)}. Since there are many different varieties of gradient descent algorithms, it was logical to implement those as individual classes.

The `XNeuralNetwork`

class represents an actual neural network. The architecture of the network really depends on the type of layers put into it. In this article we'll see examples of feed forward fully connected ANNs only. However, in next articles we'll explore convolutional and recurrent neural networks as well.

Finally, there are two additional classes. The `XNetworkInference`

is used to calculate networks output only, which is what we need when neural network is already trained. While the `XNetworkTraining`

class provides the necessary infrastructure to do the actual training of a neural network. Notice that cost function and parameters' update algorithm (optimizer) are needed only on the training phase.

Another thing to note is that ANNT library makes use of SIMD instructions (SSE2 and AVX) to vectorize computations, as well as OpenMP to parallelize computations. Support for SIMD is checked at runtime and the available instructions set is the used. However, if anything of that needs to be disabled for whatever reason, the `Config.hpp`

file can be edited.

### Building the code

The code comes with MSVC (2015 version) solution files and GCC make files. Using MSVC solutions is very easy – every example's solution file includes projects of the example itself and the library. So MSVC option is as easy as opening solution file of required example and hitting build button. If using GCC, the library needs to be built first and then the required sample application by running **make**.

## Usage examples

To demonstrate how ANNT library can be used in different applications of feed forward fully connected artificial neural networks, we going to explore 5 examples provided with the code. **Note**: none of these examples claim that the demonstrated neural network's architecture is the best for its task. In fact, none of these examples even say that artificial neural networks is the way to go. Instead, their only purpose is to provide demonstration of using the library.

**Note**: the code snippets below are only small parts of the example applications. To see the complete code of the examples, refer to the source code package provided with the article.

### Function approximation

The first example to demonstrate is function approximation (regression). For this task we are given a data set, which contains **X**/**Y** values of some function with added noise to **Y** values. The task is then to train a single input single output neural network, which would output approximation of the function, **Y**, for the given input **X**. For example, below are the two sample data sets for this application. The blue line shows the base function, while the orange dots represent data points with noise added to **Y** values. Neural network will be then given noisy **X**/**Y** pairs during training. When the training is done, the network will be used calculate **Y** value from **X** values only, so that we could see how close the approximation is.

**Line data set**

**Parabola data set**

In the case of line data set, the network can be as simple as just a single neuron without activation function. This is known as linear regression. However, in the case of parabola data set, we need an extra hidden layer to cope with non-linearity. A simple 2-layer neural network can be created with the code below.

shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 1, 10 ) ); net->AddLayer( make_shared<XSigmoidActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 1 ) );

Then a training object is created for the network, which is given cost function of our choice and variation of gradient descent algorithm to use.

XNetworkTraining netTraining( net,
make_shared<XNesterovMomentumOptimizer>( ),
make_shared<XMSECost>( ) );

Finally, a training loop is defined, which runs certain number of epochs. At the start of each epoch, the training data set is shuffled to make sure samples are taken in random order.

for ( size_t epoch = 1; epoch <= trainingParams.EpochsCount; epoch++ )
{
for ( size_t i = 0; i < samplesCount / 2; i++ )
{
int swapIndex1 = rand( ) % samplesCount;
int swapIndex2 = rand( ) % samplesCount;
std::swap( ptrInputs[swapIndex1], ptrInputs[swapIndex2] );
std::swap( ptrTargetOutputs[swapIndex1], ptrTargetOutputs[swapIndex2] );
}
auto cost = netTraining.TrainEpoch( ptrInputs, ptrTargetOutputs, trainingParams.BatchSize );
}

Once the training is done, the sample application uses the trained neural network to calculate function's outputs for the given inputs. This is then saved into CSV file, so that the result could be analysed further. Below are the few examples of the approximation result. As before, the blue line is the base function (for reference) and the orange dots is the noisy data set used for training neural network. The green line is what we are interested in – the approximation of the function obtained from the noisy inputs.

**Line approximation**

**Parabola approximation**

**Sine approximation**

**Increasing sine approximation**

### Times series prediction

The second example demonstrates time series prediction. Here our data sets have only **F(t)** values of some function, while **t** values are missing. The function's values are ordered by **t**, so the data set represents a time series – values are ordered as they were generated in time. Our task is to train neural network to predict future values of the function, based on past values.

Below is example of time series used in the sample. No noise added, no time values, only the function's value, **F(t)**.

This example can be also treated as function approximation. However, we are not approximating the **F(t)**, which is finding function's value based on the specified value **t**. Instead, we need to find function's value based on the number of its past values. Let's suppose we are going to use 5 past values of the function to predict the next value. In this case we are going to approximate the next function: **F(F(t-1), F(t-2), F(t-3), F(t-4), F(t-5))**, i.e. finding function's value based on its last 5 values.

The first thing the sample application does is preparing a training set. Remember that unlike with approximation example demonstrated above, here were have only function's values. And so, we need to create a training set, which contains sample inputs for the neural network and target outputs. Suppose the original data file contains 100 values of some function. We are going reserve some of the last values, let's say 5 values, so that we can check prediction quality of the trained neural network. Out of the other 95 values we can generate 90 input/output training pairs, since we are using 5 past values to predict the next one.

Once training set is generated, the rest of the code for creating and training neural network is the same as we've seen before. The only difference is that now we have a neural network with 5 inputs.

shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 5, 10 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 1 ) );
XNetworkTraining netTraining( net,
make_shared<XNesterovMomentumOptimizer>( ),
make_shared<XMSECost>( ) );
for ( size_t epoch = 1; epoch <= trainingParams.EpochsCount; epoch++ )
{
for ( size_t i = 0; i < samplesCount / 2; i++ )
{
int swapIndex1 = rand( ) % samplesCount;
int swapIndex2 = rand( ) % samplesCount;
std::swap( ptrInputs[swapIndex1], ptrInputs[swapIndex2] );
std::swap( ptrTargetOutputs[swapIndex1], ptrTargetOutputs[swapIndex2] );
}
auto cost = netTraining.TrainEpoch( ptrInputs, ptrTargetOutputs, trainingParams.BatchSize );
}

This sample application also outputs result into CSV file, so that it could be analysed further. Again, here are few examples of the result. The blue line is the original data we've been given. The orange line is the output of the trained network for the inputs taken from the training set. No surprize here that orange line follows the blue very well, since it is the data the network was trained on. However, the green line represents prediction of the network. It is given data, which were not included into training set, and the output is recorded. Then the just produced output is used to make further prediction and then again.

**Time series example #1**

**Time series example #2**

**Time series example #3**

### Binary classification of XOR function

This example is sort of "Hello World" application for the artificial neural networks. A very simple 2-layer neural network (3 neurons total) is trained to classify XOR function's input. As we now moved to classification, we use a new cost function in this example, which is Binary Cross Entropy – a common choice when dealing with two classes only.

vector<fvector_t> inputs;
vector<fvector_t> targetOutputs;
inputs.push_back( { -1.0f, -1.0f } ); targetOutputs.push_back( { 0.0f } );
inputs.push_back( { 1.0f, -1.0f } ); targetOutputs.push_back( { 1.0f } );
inputs.push_back( { -1.0f, 1.0f } ); targetOutputs.push_back( { 1.0f } );
inputs.push_back( { 1.0f, 1.0f } ); targetOutputs.push_back( { 0.0f } );
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 2, 2 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 2, 1 ) );
net->AddLayer( make_shared<XSigmoidActivation>( ) );
XNetworkTraining netTraining( net,
make_shared<XMomentumOptimizer>( 0.1f ),
make_shared<XBinaryCrossEntropyCost>( ) );
printf( "Cost of each sample: \n" );
for ( size_t i = 0; i < 80 * 2; i++ )
{
size_t sample = rand( ) % inputs.size( );
auto cost = netTraining.TrainSample( inputs[sample], targetOutputs[sample] );
}

Although being very simple, the example allows to experiment with few ideas. For example, you can comment the first hidden layer and notice that the neural network fails learning to classify XOR function. Same happens if commenting not the hidden layer itself, but its activation function. In this case even though we still have "two layers", we destroy the non-linearity component, which turns our network into single layer only.

Below is the sample output of this application, which shows classification result before and after training, as well as decreasing over time cost function's value.

XOR example with Fully Connected ANN
Network output before training:
{ -1.00 -1.00 } -> { 0.54 }
{ 1.00 -1.00 } -> { 0.47 }
{ -1.00 1.00 } -> { 0.53 }
{ 1.00 1.00 } -> { 0.46 }
Cost of each sample:
0.6262 0.5716 0.4806 1.0270 0.8960 0.8489 0.7270 0.9774
...
0.0260 0.0164 0.0251 0.0161 0.0198 0.0199 0.0191 0.0152
Network output after training:
{ -1.00 -1.00 } -> { 0.02 }
{ 1.00 -1.00 } -> { 0.98 }
{ -1.00 1.00 } -> { 0.98 }
{ 1.00 1.00 } -> { 0.01 }

### Iris flower multiclass classification

Another example application does classification of Iris flowers, which is a very common data set for testing performance of different classification algorithms. The data set contains 150 samples belonging to 3 classes (50 samples per class). Each Iris flower is described with 4 features: the length and the width of the sepals and petals. As the result, the neural network has 4 inputs and 3 outputs – one per class. As we saw above, the XOR example used only single output, since we had only two classes. And so, it was possible to encode those classes as 0 and 1. But with 3 classes and more we need to use so called One Hot Encoding, where each class is encoded as vector of zeros with only single element set to **1** at the index corresponding to the class number. So, for the Iris flower classification, target outputs of the neural network will look like this: {1, 0, 0}, {0, 1, 0} and {0, 0, 1}. Once training is complete and new sample is provided to the network, its class is determined by the index of the output neuron, which produced the largest value.

This example uses a special helper class, which encapsulates the entire training loop making neural network training code even shorter.

shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( 4, 10 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 10 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 10, 3 ) );
net->AddLayer( make_shared<XSigmoidActivation>( ) );
shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XNesterovMomentumOptimizer>( 0.01f ),
make_shared<XCrossEntropyCost>( ) );
XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetTestSamples( testAttributes, encodedTestLabels, testLabels );
trainingHelper.RunTraining( 40, 10, trainAttributes, encodedTrainLabels, trainLabels );

The nice thing about the helper class is that it runs not only the training phase, but also runs validation and test phases as well if corresponding data sets are provided. And it does provide useful progress log showing current accuracy of training, validation, time taken, etc.

Iris classification example with Fully Connected ANN
Loaded 150 data samples
Using 120 samples for training and 30 samples for test
Learning rate: 0.0100, Epochs: 40, Batch Size: 10
Before training: accuracy = 33.33% (40/120), cost = 0.5627, 0.000s
Epoch 1 : [==================================================] 0.005s
Training accuracy = 33.33% (40/120), cost = 0.3154, 0.000s
Epoch 2 : [==================================================] 0.003s
Training accuracy = 86.67% (104/120), cost = 0.1649, 0.000s
...
Epoch 40 : [==================================================] 0.006s
Training accuracy = 93.33% (112/120), cost = 0.0064, 0.000s
Test accuracy = 96.67% (29/30), cost = 0.0064, 0.000s
Total time taken : 0s (0.00min)

### MNIST handwritten digits classification

Finally, the last example of feed forward fully connected artificial neural network is classification of MNIST handwritten digits (the data set needs to be downloaded separately). This example is not much different from Iris flower classification example above – just a bigger neural network, much larger training set and as the result taking more time to train neural network.

shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );
net->AddLayer( make_shared<XFullyConnectedLayer>( trainImages[0].size( ), 300 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 300, 100 ) );
net->AddLayer( make_shared<XTanhActivation>( ) );
net->AddLayer( make_shared<XFullyConnectedLayer>( 100, 10 ) );
net->AddLayer( make_shared<XSoftMaxActivation>( ) );
shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XAdamOptimizer>( 0.001f ),
make_shared<XCrossEntropyCost>( ) );
XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetValidationSamples( validationImages, encodedValidationLabels, validationLabels );
trainingHelper.SetTestSamples( testImages, encodedTestLabels, testLabels );
trainingHelper.RunTraining( 20, 50, trainImages, encodedTrainLabels, trainLabels );

For this example, we've used a 3-layer neural network – 300 neurons in the first hidden layer, 100 neurons in the second and 10 neurons in the output layer. Although the neural network has quite simple architecture, it manages to achieve more than 96% accuracy on the test data set (the one not used for training). In the coming article about convolutional networks we'll get that number to around 99% level.

MNIST handwritten digits classification example with Fully Connected ANN
Loaded 60000 training data samples
Loaded 10000 test data samples
Samples usage: training = 50000, validation = 10000, test = 10000
Learning rate: 0.0010, Epochs: 20, Batch Size: 50
Before training: accuracy = 10.17% (5087/50000), cost = 2.4892, 2.377s
Epoch 1 : [==================================================] 59.215s
Training accuracy = 92.83% (46414/50000), cost = 0.2349, 3.654s
Validation accuracy = 93.15% (9315/10000), cost = 0.2283, 0.636s
Epoch 2 : [==================================================] 61.675s
Training accuracy = 94.92% (47459/50000), cost = 0.1619, 2.685s
Validation accuracy = 94.91% (9491/10000), cost = 0.1693, 0.622s
...
Epoch 19 : [==================================================] 59.822s
Training accuracy = 96.81% (48404/50000), cost = 0.0978, 2.976s
Validation accuracy = 95.88% (9588/10000), cost = 0.1491, 0.527s
Epoch 20 : [==================================================] 87.108s
Training accuracy = 97.77% (48883/50000), cost = 0.0688, 2.823s
Validation accuracy = 96.60% (9660/10000), cost = 0.1242, 0.658s
Test accuracy = 96.55% (9655/10000), cost = 0.1146, 0.762s
Total time taken : 1067s (17.78min)

## Conclusion

This is it about feed forward fully connected artificial neural networks for now and their implementation in the ANNT library. As it was already mentioned, the library is going to evolve further. New articles will become available then, describing convolutional and recurrent artificial neural networks. For each of the architectures there will be new samples provided. Some will be completely new, while some examples will solve exactly same task as before, MNIST digits classification for example, so that performance of different neural networks could be compared.

At this point the library uses CPU only, there is no GPU support. However, it does exploit SIMD instructions for vectorization and OpenMP for parallelism. GPU support, and many other things, are in the list of features to develop, which, hopefully, will get implemented at some point in time.

In the case if someone wants to keep an eye on the progress of the ANNT library or dig through more code than it is provided with the article, the project can be found on GitHub, where it already evolved further beyond feed forward fully connected ANNs.

## Links

- Biological neuron
- Neuron and synapses
- Artificial neuron
- Artificial neural network
- XOR Problem in Neural Networks
- Linear separability
- Activation functions
- Understanding Activation Functions in Neural Networks
- Mean squared error
- Gradient descent
- Stochastic gradient descent
- A Gentle Introduction to Mini-Batch Gradient Descent
- Multivariable chain rule, simple version
- Backpropagation
- An overview of gradient descent optimization algorithms
- One Hot Encoding
- Iris flower data set
- MNIST database of handwritten digits