Why do LSTM networks converge

Long Short-Term Memory Networks in the return forecast. To what extent can the results of Fischer / Krauss (2017) be replicated and understood?

Table of Contents

List of figures

List of tables


Neural networks in general


Activation of units

Training the neural network

Recurrent networks

Back propagation through time

Long Short-Term Memory Networks (LSTM)

Adam and RMSProp

Analysis of the paper to be replicated

Structure of the empirical work



Empirical evaluation



Summary and Outlook


List of figures

Figure 1 Idealized representation of a neuron (source: Füser (1995), p.26)

Figure 2 Illustration of backpropagation (Source: Cf. G. Ray, F. Beck (2018) www.neuronalesnetz.de, 2018, [April 20, 2020])

Figure 3 Illustration of the gradient descent method (Source: Cf. G. Ray, F. Beck (2018))

Figure 4 Illustration of the RNN (source: Olah (2015))

Figure 5 Unfolding of an RNN over time (source: Haselhuhn (2018), p. 7)

Figure 6 Description of the LSTM cell - input (source Olah (2015)

Figure 7 Representation of the LSTM cell output (Source: Olah (2015))

Figure 8 Results of the paper to be analyzed (Source: Fischer / Krauss (2017), page 14)

Figure 9 Overview of the results of the paper to be analyzed - annual returns (source: Fischer / Krauss (2017), page 19)

Figure 10: DM test, comparison of the forecasts of the LSTM network with the forecasts of the logistic regression

List of tables

Table 1 Comparison of the results from the LSTM network and the logistic regression (source: own illustration)

Table 2 statistical comparison of the results from the LSTM model and the logistic regression (source own illustration)

Table 3 DM test, comparison of the forecasts of the LSTM network with the forecasts of the logistic regression (source: own illustration)

Table 4 Maximum average return achieved by a Monkey Manager portfolio (source: own illustration)


Machine learning has been a widespread topic in finance and computer science since the 1990s. With the new wave of Big Data, the now aging field of science is coming back into the focus of science. The reasons are obvious. It is precisely the better computing capacities and larger amounts of publicly available capital market data that Machine learning put to the test again. This led to a renewed prominence of the research area, the " Deep learning ". The paper by the authors Fischer / Krauss (2017) on the topic of “Stock market forecasting using neural networks” caused a sensation due to its results. In their work, the authors use long short-term memory networks (LSTM networks) in particular to forecast stock market returns. These are among the most advanced methods in the field of Machine learning. In the paper they manage to beat the benchmark index S & P500 more than clearly only by means of past returns. These results conflict with the “efficiency market hypothesis” developed by Eugene Fama, as no useful information for the future can be derived from past price data.1 In addition to the replication of the paper, this work should explain the most important development steps from simple neural networks to LSTM networks and work out the advantages and disadvantages. In the second part of the work, the results of the paper by Fischer / Krauss (2017) are critically assessed. The results in this study differ in part significantly from those of the Fischer / Krauss paper. In this work, the periods of 1994, 2001, 2008 and 2015 are examined selectively. Only in the years examined in 1994 and 2001 was there a significant outperformance compared to the S & P500 due to the LSTM network, with average daily returns of 0.022 and 0.0074, respectively. A generally valid predictive capability of the model or a superiority to any benchmark models can certainly not be shown. The selected benchmark model, the logistic regression, delivers results in 1994 that are as good as the LSTM. Since the data and methods follow those of Fischer / Krauss (2017), it is important to find an explanation of the different results. The authors of this work see a logical conclusion in the conception of the “efficiency-market hypothesis” and the exploitation of market inefficiencies.

Neural networks in general

The principle of artificial neural networks refers to the neural connections in the human brain, which serves as an analogy and inspiration for today's applications of computer technology. However, this analogy has little in common with today's implementation. The first authors to deal with such applications are Warren McCulloch and Walter Pitts. Who developed the first formal model of a neuron in 1943.2 In the meantime, the topic of neural networks has been widely discussed in science and has found approval in a wide variety of sciences outside of biology. Two of the largest areas are, on the one hand, the modeling of artificial neural networks in order to better understand human behavior and the functioning of the human brain and, on the other hand, to solve specific application problems from the fields of statistics, economics, technology and other areas.3 Primarily the basics and the structure of a neural network will be explained in detail in the next chapter in order to create a basic understanding of LSTM networks and the work of Fischer / Krauss (2017).


Neural networks consist of several so-called neurons or units. These neurons are used to receive information from the environment or from other neurons as numerical values ​​and to forward linked neurons or to the environment in a modified form. There are basically three types of neurons. The input units that receive signals from the outside world. The hidden units, which are located between output and input units and process the recorded information. The last type are the output units. These then return the information that has now been processed to the outside world.4

Figure not included in this excerpt

Figure 1 Representation of a simple neural network. Dark gray: input unit, orange: hidden unit, light gray: output unit (source: own illustration)

Neurons arranged one above the other are called layers. For example, a hidden layer and / or an output layer will be constructed. In the neural networks there is usually only one input and one output layer. However, there can be any number of hidden layers. It should be noted, however, that all application problems with several hidden layers can also be solved with just one hidden layer. However, this layer must have a sufficiently large number of neurons.5

As LeCun et al. (2015) explain that neurons are connected to one another. The strength of a connection is described by a weight. It is true that the more this weight is expressed in absolute terms, the greater the influence of one neuron on the other. A distinction must also be made here between positive, negative and neutral weights. This explains different influences. The “knowledge” of a neural network is typically stored in its weights. The weights are "trained" or continuously adjusted through the so-called learning process. How the learning process of a neural network works will be explained in more detail later.

Activation of units

As described earlier, the input a neuron receives from another neuron depends on two factors. On the one hand the transmitted information (output) and on the other hand the weight vector, which connects the two neurons with each other. The greater the output contribution of a neuron and the higher the weight, the greater the influence on the receiving unit. But if one of the two terms is zero, the influence is also zero. Formally, the input can be presented as follows:6

Figure not included in this excerpt

In addition to the two factors that are responsible for the formation of the input, an activity function is required in order to be able to assign a certain activity level. The activity function can have a scaling but also a limiting effect. In this way, different activations of the neuron can be modeled. This also results in the possibility of replicating different models such as linear regression models, logistic regression models or other non-linear relationships. A wide variety of activity functions can be distinguished. Linear activity function: Here the relationship between network input and output is linear.

Binary threshold function: There are only two states of the activity level. (For example: 0 and 1)

Sigmoid activity function: This type of activity function is used in most models that simulate cognitive processes. A distinction is made between the logistic function and the tangent-hyperbollicus function. These functions behave as follows: If the net input is large and negative, then the activity level is close to 0 (logistic function) or -1 (hyperbolic tangent function). When the network input increases, the activity level initially rises slowly (a kind of threshold). After that the slope becomes steeper and resembles a linear function. With a high network input, the value finally approaches 1 asymptotically.7

This is a major benefit because activity is limited to a specific area. On the one hand, this ensures a higher biological plausibility and, on the other hand, it stops an ever increasing activity, i.e. an over-activation of a neuron. This is also called the "exploding gradient problem". This is particularly important for recurrent networks such as the LSTM. The other advantage, especially compared to a binary activation function, is that the function can be differentiated. This is important for the learning process.8

Training the neural network

The built-up neural network can be represented like a system of equations with several variables that describe an output. These variables can be optimized by resolving them to an output, so that the variables describe the output in the best possible way. The weights between the individual neurons are usually modified accordingly. This adjustment of the weights corresponds to the training process. A distinction can be made between supervised and unsupervised training. Output values ​​are specified for supervised training. In the case of time series, these are mostly past, already available data. No output values ​​are specified for unguarded training. The weight changes take place depending on the similarity of the weights with the input stimuli.9

This work relies on supervised training. This type of workout will Backpropagation called. Here, the neural network is adapted to the output by a backward adjustment of the weights. This is used to learn the network step by step. The weights are changed until the desired output is generated by entering information. The constant adaptation to an ever-increasing amount of training is the so-called

Figure not included in this excerpt

Figure 3 Illustration of back propagation (source: Cf. Ray, Beck (2018) www.neuronalesnetz.de, [accessed on: April 20, 2020])

Learning process. The error between the information fed in and the network output is referred to as "delta". This error rate is determined by the Backpropagation Step by step, optimally minimizes and converges towards zero in the course of the training. A trained network should now correctly process new, previously unknown data that is fed into the network and lead to a correct output. In this way, repetitive processes, work steps or calculations are automated.10

The time-consuming calculations of the Backpropagation are optimized using the gradient descent method.11 This was suggested by Rumelhart et al. (1986) and starts with a randomly chosen weight combination. For this, the gradient is determined and descended by a specified length (learning rate). The gradient can be defined as a function of a scalar field which indicates the rate of change and direction of the greatest change in the form of a vector field.

Figure not included in this excerpt

Figure 4 Illustration of the gradient descent method (Source: Cf. G. Ray, F. Beck (2018))

Simply put, the gradient is the description of a size that indicates for each location how much the size changes and in which direction the change is greatest.12 The gradient is again determined for the newly obtained weight combination and the weights are again modified. This process is repeated until a local minimum (or global minimum) is reached (see Figure 2).13 In spite of this, some challenges arise from the systematics of the gradient descent method. These are due to the local environment, because the lower computational effort means that the entire room is no longer considered. This leads to the following problems:14

- Local minima: The lack of knowledge in the gradient descent method whether a local or absolute minimum is found after it has been carried out. This problem occurs more intensely with a higher dimension of the network (= number of connections between the neurons). A higher dimension leads to a higher number of local minima. The figure shows a two-dimensional problem that needs to be solved. In a three-dimensional problem, the error function equals a mountain range. A valley represents a local minimum. The algorithm tends to get stuck in any valley without knowing whether it is the global or local minimum.15
- Flat plateaus: Basically, the problem here is exactly the opposite. Instead of a strong fissure, there are hardly any "mountains and valleys", but rather a relatively flat "plateau". This makes the gradient very small in the gradient descent method. The next "valley" is no longer reached because the algorithm cannot see in which direction it should move. As a result, the process stagnates.
Leaving good minima: This problem can also be seen as a counterpart to the problem of local minima. Instead of not reaching a global minimum at all, the global minimum is "skipped" here. This mainly happens when there is a "deep valley" with a relatively small extent in the hyperplane. As a result, the gradient descent method only finds a local minimum.
- Oscillation: In the case of direct oscillation, the gradient descent method detects neither a global nor a local minimum. This happens when the gradient jumps from one "slope" of a valley to the opposite "slope" and from there back to the same place. In this case, the values ​​of the gradients are the same, only the signs change back and forth. The gradient descent method does not succeed in "pushing out into the depths of the hyperplane".

Recurrent networks

Recurrent networks also offer the possibility of modeling non-linear relationships. They differ from classic neural networks in that feedback from neurons in one layer to another, the same or a previous layer is possible. This makes it possible to map and display temporal information from the data on end ceilings. This ability can be equated with a memory, which allows information from the

Letting the past flow into the present. Precisely this advantage allows

Figure not included in this excerpt

Figure 5 Unfolding of an RNN over time (source: Haselhuhn (2018), p. 7) for stock market data, a better forecasting ability than classic neural networks. These treat all information independently of one another. Figure 4 shows a simplified representation of an RNN neuron, which is unfolded on the right side according to the time. Thus shows Ot_ ± the first output which from Xt_1 and St_1 depends. In the following time step, Ot determined which one now from Xt, St and the feedback Ot_1 being affected. This allows the integration of time dependencies.


1 See Fama (1970), pp. 383-417

2 See: McCulloch, Pitts (1943), pp. 115-133

3 See Rey / Wender (2018), p.16

4 See Rattinghaus-Meyer (1993), p. 52

5 See Honrik et al. (1989), p. 363

6 See G. Ray, F. Beck (2018) www.neuronalesnetz.de, [accessed on: April 20, 2020]

7 See Hinton, G. E. (1992), p. 136

8 See chapter: Long Short Term Memory Networks (LSTM)

9 See Rumelhart et al. (1994), pp. 89ff.

10 See Zimmermann (1994), p. 37ff.

11 See Rojas (1993), pp. 200f.

12 See Papula (2014), p. 61

13 See Rumelhart et al. (1986), p. 318ff.

14 Cf. Ray, Beck (2018), Neural Networks, www.neuronalesnetz.de, [accessed on: April 20, 2020]

15 See Rojas (1993), p. 152f.

End of the reading sample from 33 pages