You can train neural networks in parallel

Machine / Deep Learning - How do artificial neural networks learn?

An important characteristic of systems with artificial intelligence is the ability to learn independently. Unlike traditional software, which processes problems and questions on the basis of previously defined rules, self-learning machine learning algorithms can learn the best rules for solving certain tasks themselves. This article uses the example of artificial neural networks to explain what “learning” means in this context and how it works.

An artificial neural network consists of many individual neurons, which are usually arranged in several interconnected layers. The number of layers determines, among other things, the degree of complexity that an artificial neural network can map. Many layers make a neural network “deep” - this is why we also speak of Deep learning as a subcategory of the Machine learning (Machine learning or. machine learning).

At a glance: This is how artificial neural networks learn

The learning then works as follows: After the network structure has been set up, each neuron receives a random (!) Initial weight. Then the input data is fed into the network, and each neuron weights the input signals with its weight and forwards the result to the neurons of the next layer. The total result is then calculated at the output layer - and this will usually have little to do with the known actual result, since all neurons have a random initial weight. However, one can calculate the size of the error and how much each neuron contributed to that error, and then change the weight of each neuron a tiny bit in the direction that minimizes the error. Then the next run takes place, a new measurement of the error and adjustment of the weights and so on. In this way, a neural network “learns” increasingly better to infer the known output data from the input data.

The following passages describe the individual phases and elements of this learning process in detail.

The Forward Pass

The input data is fed in on one side of the neural network. Each input signal is distributed to each individual neuron of the first layer. Each neuron then weights the incoming signal with an input-specific weight (which was assigned randomly at the beginning!), Adding a so-called weight Neuron-specific bias term and adds up all input data weighted in this way to the output of this one neuron.

Often the output is not yet achieved by one linear activation function e.g. to force a certain value range of the output (e.g. sigmoid, tanh, etc.). The output of each neuron is then passed on as input to all neurons of the following layer. This process continues until the output layer is reached, which provides the result of all calculations.

The measurement of the error

So far, the artificial neural network has not learned anything. Since all weights are chosen randomly within a given range of values ​​during the initialization of a neural network (e.g. between -1 and 1), the result will be a purely random value. The currently most popular variant of learning a network is the so-called Supervised learningwhat is meant by example learning.

An example in this case means a combination of real input-output data pairs. These examples are used as part of the training of artificial neural networks to include all weights and Bias terms optimally set so that at the end of the training the network can calculate the correct result for all input data and also for new input data that has not yet been seen.

And this is how it works:

For a set of input data (also Features called) the untrained neural network calculates a result in each case. This result is then compared with the known results of the example data set (also called targets or labels), and the size of the deviation or error is calculated. In order to be able to map both positive and negative deviations at the same time, the mean value of the square deviation (Squared Mean Error SME) or another error function is used, for example.

The Backward Pass

Then the actual “learning” begins: The measured error is fed backwards into the artificial neural network (the so-called. Backwardpassport or Backward propagation), and each weight and bias term is adjusted a little bit in the direction that makes the error smaller. The size of this adjustment is calculated firstly via the proportion that a certain neuron weight had in the result (i.e. via its current weight), and secondly via the so-called learning rate, which corresponds to the most important setting variables (Hyperparameters) heard from neural networks.

Common learning rates are e.g. 0.001 or 0.01, i.e. only one hundredth to one thousandth of the calculated error is corrected per run. If the adjustment per run is too large, it can happen that the minimum of the error curve is missed and the deviations become larger and larger instead of smaller (Overshooting). Sometimes the learning rate is therefore increasingly reduced during the training (Learning rate decay) to better determine the minimum of the error function.

Another possible problem are error functions with local minima, in which the neural network “gets stuck” and therefore cannot find the actual minimum. The direction of the correction is calculated by deriving the respective functions, the negative value of which specifies the direction in which the error function is minimized. The aim of training or learning is to minimize the selected error function.

Epochs and batches

After all the weights have been adjusted, all input data is run through again and the error is measured again, as well as the backward propagation of this error in order to adjust the weights again. A complete run of all input data is saved as a epoch designated. The number of training epochs is also an important hyper parameter for training neural networks. Depending on the size of the data set, the input data can also be divided into groups of the same size (Batches) and the training can be carried out per batch.

This can be useful, for example, to allow an artificial neural network to learn more quickly, or to comply with the limits of the computing capacity of the executing computer. When dividing into batches, it is important to have a normal distribution of the values ​​within each batch compared to the entire data set. When all batches have run through the neural network once, an epoch is complete.

Convergence, overfitting and hyperparameter tuning

The more examples an artificial neural network gets for training and the more often it has seen them, the smaller the error in the results. The approach and leaning of the error curve on the 100% line is called Convergence and during the training the course of the error curve is observed in order to be able to stop the training if necessary and to be able to make adjustments to the hyperparameters. However, a small error does not always mean good general performance of the neural network.

Because if it has seen all known data very often during the training, it can happen that the artificial neural network learns this data by heart rather than learning an abstract concept. This problem is also known as the Overfitting designated. Since neural networks can also map highly complex functions, there is a risk that at some point they will have found the perfect function for every known data point, but this function will not work well for new data.

In order to ensure that a neural network can abstract from known sample data and also provide correct results for input data that has not yet been learned, the sample data is divided into training data, test data and blind test data, e.g. in the ratio 70/20/10, before training.

During the training, only the training data is used and the error rate (Error rate) is measured for both the training data and the test data. However, the measured error in the test data is not fed back into the artificial neural network. Then the neural network is improved by adapting all variables in such a way that it achieves the maximum performance in terms of training and test data (Hyperparameter tuning). This “tuning” of neural networks is one of the core activities of artificial intelligence engineers.

The blind test data are only used when the network is supposedly fully trained. If the artificial neural network is also in Blind test does well, the likelihood is good that it learned an abstract concept well.

Dead neurons and dropout layers

Another method to avoid overfitting, i.e. memorizing data, as well as avoiding inactive or "dead" neurons, whose weights remain permanently at zero during training, are the so-called Dropout layer.

The concept of dropout layers is radical at first glance: During training, 50% or more neurons of a layer are simply switched off. The shutdown takes place per run on a random basis. All neurons of the layer are forced to learn less special and more abstract concepts in order to perform well despite the reduced number.

After the training is complete, the dropout mechanism is deactivated for ongoing operations, i.e. all neurons trained via dropout are then available for the calculation. As a result, the results are usually significantly better and the neural network is often more robust afterwards with regard to new, unknown data.

Save training results in checkpoints and fine-tune them

When the artificial neural network is fully trained, all weights and bias terms are saved as a so-called checkpoint. The neural network can then be restarted at any time with these optimized weights and bias values. While the training is often very computationally intensive and usually requires a GPU or a GPU cluster, especially for images as input data, the actual operation of a trained neural network is significantly leaner and faster and can also take place in almost real time, e.g. on mobile devices or normal laptops / PCs .

A fully trained artificial neural network can be created with the help of the Checkpoints can also be retrained with new data at any time. To do this, the existing values ​​from the checkpoint are initially loaded into the network and the new data used for training. It is also possible to use a pre-trained neural network as the basis for training with your own data, e.g. by replacing the last layer of a network trained with a large number of images with a new layer of your own, e.g. for classifying your own objects.

With this so-called Fine tuning Artificial neural networks can use general structures that have already been learned, and the network only has to learn the new classes. This is of great advantage especially when processing images, which are very computationally intensive, but also when using language and texts. The fine-tuning requires significantly less time and computing power than would normally be required for training from the beginning. Appropriately pre-trained artificial neural networks are offered by Google, for example, and almost all individually adaptable API services from IBM, Microsoft, Amazon, Google and Co are based on this method.

Training data for supervised learning

A large amount of sample data is accordingly required for the supervised learning described. A large amount here means, for example, a million examples. Artificial neural networks can in some cases achieve remarkable results even with smaller data sets, but the more data that is available, the better. For the classification of images, e.g. from approx. 1,000 sample images per class, usable results are achieved. A whole line of research in artificial intelligence is also concerned with processes for the so-called One-shot learningi.e. learning from very few or just one example.

The supervised learning (supervised learning) itself can be further subdivided into different methods of data use and transfer within artificial neural networks: in so-called Recurring Neural Networks For example, the result of the previous input data flows into the calculation of the current output so that, for example, time series and texts can be analyzed and processed, e.g. also with Long-Short-Term-Memory Networks (LSTM) and Sequence-to-sequence networksthat are used e.g. for speech recognition and for the translation of texts. For image processing, so-called Convolutional Neural Networks (CNN) uses which images are scanned with a grid and further abstracted from lower levels (points, lines etc.) in each layer to higher concepts (a face, a house etc.).

Unsupervised and reinforcement learning

These are other methods of how artificial neural networks can learn Unsupervised learning, (unsupervised learn)at the systems only receive input data and try to classify them sensibly, as well as that Reinforcement learning, in which a neural network can control the input data itself (e.g. the keys of a gaming controller) and receives dynamic output data back together with a task related to this output data (e.g. to maximize a score).


In summary, it can be said that the described learning based on examples (supervised learning) in combination with deep artificial neural networks (deep learning) is currently the basis for the majority of productively used artificial intelligence applications.

In order to remedy the partial lack of availability of sample data, large companies sometimes offer services free of charge or very cheaply, just to access the relevant usage data for the services. Another current trend is the generation of synthetic data and the use of simulations in training: For example, robot arms can be trained in a virtual environment, which drastically reduces the time and costs involved in training. In the real world, the systems trained in this way can then use their experience from the artificial world and move real robot arms precisely.

/ 11 Comments / by Roland Becker