Using Different Reservoir Drivers

While the original implementation of the Echo State Network implemented the model using the equations of Recurrent Neural Networks to obtain non-linearity in the reservoir, other variations have been proposed in recent years. More specifically, the different drivers implemented in ReservoirComputing.jl are the multiple activation function RNN MRNN() and the Gated Recurrent Unit GRU(). To change them, it suffices to give the chosen method to the ESN keyword argument reservoir_driver. In this section, some examples, of their usage will be given, as well as a brief introduction to their equations.

Multiple Activation Function RNN

Based on the double activation function ESN (DAFESN) proposed in (Lun et al., 2015), the Multiple Activation Function ESN expands the idea and allows a custom number of activation functions to be used in the reservoir dynamics. This can be thought of as a linear combination of multiple activation functions with corresponding parameters.

\[\mathbf{x}(t+1) = (1-\alpha)\mathbf{x}(t) + \lambda_1 f_1(\mathbf{W}\mathbf{x}(t)+\mathbf{W}_{in}\mathbf{u}(t)) + \dots + \lambda_D f_D(\mathbf{W}\mathbf{x}(t)+\mathbf{W}_{in}\mathbf{u}(t))\]

where $D$ is the number of activation functions and respective parameters chosen.

The method to call to use the multiple activation function ESN is MRNN(activation_function, leaky_coefficient, scaling_factor). The arguments can be used as both args and kwargs. activation_function and scaling_factor have to be vectors (or tuples) containing the chosen activation functions and respective scaling factors ($f_1,...,f_D$ and $\lambda_1,...,\lambda_D$ following the nomenclature introduced above). The leaky_coefficient represents $\alpha$ and it is a single value.

Starting with the example, the data used is based on the following function based on the DAFESN paper (Lun et al., 2015).

u(t) = sin(t) + sin(0.51 * t) + sin(0.22 * t) + sin(0.1002 * t) + sin(0.05343 * t)

u (generic function with 1 method)

For this example, the type of prediction will be one step ahead. The metric used to assure a good prediction will be the normalized root-mean-square deviation rmsd from StatsBase. Like in the other examples, first it is needed to gather the data:

train_len = 3000
predict_len = 2000
shift = 1

data = u.(collect(0.0:0.01:500))
training_input = reduce(hcat, data[shift:(shift + train_len - 1)])
training_target = reduce(hcat, data[(shift + 1):(shift + train_len)])
testing_input = reduce(hcat,
    data[(shift + train_len):(shift + train_len + predict_len - 1)])
testing_target = reduce(hcat,
    data[(shift + train_len + 1):(shift + train_len + predict_len)])

1×2000 Matrix{Float64}:
 0.852897  0.850969  0.849127  0.847371  …  -1.4348  -1.4206  -1.40638

To follow the paper more closely, it is necessary to define a couple of activation functions. The numbering of them follows the ones in the paper. Of course, one can also use any custom-defined function, available in the base language or any activation function from NNlib.

f2(x) = (1 - exp(-x)) / (2 * (1 + exp(-x)))
f3(x) = (2 / pi) * atan((pi / 2) * x)
f4(x) = x / sqrt(1 + x * x)

f4 (generic function with 1 method)

It is now possible to build different drivers, using the parameters suggested by the paper. Also, in this instance, the numbering follows the test cases of the paper. In the end, a simple for loop is implemented to compare the different drivers and activation functions.

using ReservoirComputing, Random, StatsBase

#fix seed for reproducibility
Random.seed!(42)

#baseline case with RNN() driver. Parameter given as args
base_case = RNN(tanh, 0.85)

#MRNN() test cases
#Parameter given as kwargs
case3 = MRNN(; activation_function=[tanh, f2],
    leaky_coefficient=0.85,
    scaling_factor=[0.5, 0.3])

#Parameter given as kwargs
case4 = MRNN(; activation_function=[tanh, f3],
    leaky_coefficient=0.9,
    scaling_factor=[0.45, 0.35])

#Parameter given as args
case5 = MRNN([tanh, f4], 0.9, [0.43, 0.13])

#tests
test_cases = [base_case, case3, case4, case5]
for case in test_cases
    esn = ESN(training_input, 1, 100;
        input_layer=weighted_init(; scaling=0.3),
        reservoir=rand_sparse(; radius=0.4),
        reservoir_driver=case,
        states_type=ExtendedStates())
    wout = train(esn, training_target, StandardRidge(10e-6))
    output = esn(Predictive(testing_input), wout)
    println(rmsd(testing_target, output; normalize=true))
end

1.1442590780710705e-5
9.128307322964925e-5
0.0001348192607379247
0.0001559992855139628

In this example, it is also possible to observe the input of parameters to the methods RNN()MRNN(), both by argument and by keyword argument.

Gated Recurrent Unit

Gated Recurrent Units (GRUs) (Cho et al., 2014) have been proposed in more recent years with the intent of limiting notable problems of RNNs, like the vanishing gradient. This change in the underlying equations can be easily transported into the Reservoir Computing paradigm, by switching the RNN equations in the reservoir with the GRU equations. This approach has been explored in (Wang et al., Jul 2020) and (Sarli et al., Aug 2020). Different variations of GRU have been proposed (Dey and Salem, Aug 2017); this section is subdivided into different sections that go into detail about the governing equations and the implementation of them into ReservoirComputing.jl. Like before, to access the GRU reservoir driver, it suffices to change the reservoir_diver keyword argument for ESN with GRU(). All the variations that will be presented can be used in this package by leveraging the keyword argument variant in the method GRU() and specifying the chosen variant: FullyGated() or Minimal(). Other variations are possible by modifying the inner layers and reservoirs. The default is set to the standard version FullyGated(). The first section will go into more detail about the default of the GRU() method, and the following ones will refer to it to minimize repetitions.

Standard GRU

The equations for the standard GRU are as follows:

\[\mathbf{r}(t) = \sigma (\mathbf{W}^r_{\text{in}}\mathbf{u}(t)+\mathbf{W}^r\mathbf{x}(t-1)+\mathbf{b}_r) \\ \mathbf{z}(t) = \sigma (\mathbf{W}^z_{\text{in}}\mathbf{u}(t)+\mathbf{W}^z\mathbf{x}(t-1)+\mathbf{b}_z) \\ \tilde{\mathbf{x}}(t) = \text{tanh}(\mathbf{W}_{in}\mathbf{u}(t)+\mathbf{W}(\mathbf{r}(t) \odot \mathbf{x}(t-1))+\mathbf{b}) \\ \mathbf{x}(t) = \mathbf{z}(t) \odot \mathbf{x}(t-1)+(1-\mathbf{z}(t)) \odot \tilde{\mathbf{x}}(t)\]

Going over the GRU keyword argument, it will be explained how to feed the desired input to the model.

activation_function is a vector with default values [NNlib.sigmoid, NNlib.sigmoid, tanh]. This argument controls the activation functions of the GRU, going from top to bottom. Changing the first element corresponds to changing the activation function for $\mathbf{r}(t)$ and so on.
inner_layer is a vector with default values fill(DenseLayer(), 2). This keyword argument controls the $\mathbf{W}_{\text{in}}$s going from top to bottom like before.
reservoir is a vector with default value fill(RandSparseReservoir(), 2). Similarly to inner_layer, this keyword argument controls the reservoir matrix construction in a top to bottom order.
bias is again a vector with default value fill(DenseLayer(), 2). It is meant to control the $\mathbf{b}$s, going as usual from top to bottom.
variant controls the GRU variant. The default value is set to FullyGated().

It is important to notice that inner_layer and reservoir control every layer except $\mathbf{W}_{in}$ and $\mathbf{W}$ and $\mathbf{b}$. These arguments are given as input to the ESN() call as input_layer, reservoir and bias.

The following sections are going to illustrate the variations of the GRU architecture and how to obtain them in ReservoirComputing.jl

Type 1

The first variation of the GRU is dependent only on the previous hidden state and the bias:

\[\mathbf{r}(t) = \sigma (\mathbf{W}^r\mathbf{x}(t-1)+\mathbf{b}_r) \\ \mathbf{z}(t) = \sigma (\mathbf{W}^z\mathbf{x}(t-1)+\mathbf{b}_z) \\\]

To obtain this variation, it will suffice to set inner_layer = fill(NullLayer(), 2) and leaving the variant = FullyGated().

Type 2

The second variation only depends on the previous hidden state:

\[\mathbf{r}(t) = \sigma (\mathbf{W}^r\mathbf{x}(t-1)) \\ \mathbf{z}(t) = \sigma (\mathbf{W}^z\mathbf{x}(t-1)) \\\]

Similarly to before, to obtain this variation, it is only required to set inner_layer = fill(NullLayer(), 2) and bias = fill(NullLayer(), 2) while keeping variant = FullyGated().

Type 3

The final variation, before the minimal one, depends only on the biases

\[\mathbf{r}(t) = \sigma (\mathbf{b}_r) \\ \mathbf{z}(t) = \sigma (\mathbf{b}_z) \\\]

This means that it is only needed to set inner_layer = fill(NullLayer(), 2) and reservoir = fill(NullReservoir(), 2) while keeping variant = FullyGated().

Minimal

The minimal GRU variation merges two gates into one:

\[\mathbf{f}(t) = \sigma (\mathbf{W}^f_{\text{in}}\mathbf{u}(t)+\mathbf{W}^f\mathbf{x}(t-1)+\mathbf{b}_f) \\ \tilde{\mathbf{x}}(t) = \text{tanh}(\mathbf{W}_{in}\mathbf{u}(t)+\mathbf{W}(\mathbf{f}(t) \odot \mathbf{x}(t-1))+\mathbf{b}) \\ \mathbf{x}(t) = (1-\mathbf{f}(t)) \odot \mathbf{x}(t-1) + \mathbf{f}(t) \odot \tilde{\mathbf{x}}(t)\]

This variation can be obtained by setting variation=Minimal(). The inner_layer, reservoir and bias kwargs this time are not vectors, but must be defined like, for example inner_layer = DenseLayer() or reservoir = SparseDenseReservoir().

Examples

To showcase the use of the GRU() method, this section will only illustrate the standard FullyGated() version. The full script for this example with the data can be found here.

The data used for this example is the Santa Fe laser dataset (Hübner et al., 1989) retrieved from here. The data is split to account for a next step prediction.

using DelimitedFiles

data = reduce(hcat, readdlm("./data/santafe_laser.txt"))

train_len = 5000
predict_len = 2000

training_input = data[:, 1:train_len]
training_target = data[:, 2:(train_len + 1)]
testing_input = data[:, (train_len + 1):(train_len + predict_len)]
testing_target = data[:, (train_len + 2):(train_len + predict_len + 1)]

1×2000 Matrix{Float64}:
 49.0  107.0  126.0  67.0  31.0  22.0  …  9.0  7.0  6.0  6.0  17.0  95.0

The construction of the ESN proceeds as usual.

using ReservoirComputing, Random

res_size = 300
res_radius = 1.4

Random.seed!(42)
esn = ESN(training_input, 1, res_size;
    reservoir=rand_sparse(; radius=res_radius),
    reservoir_driver=GRU())

ESN(1 => 300)

The default inner reservoir and input layer for the GRU are the same defaults for the reservoir and input_layer of the ESN. One can use the explicit call if they choose to.

gru = GRU(; reservoir=[rand_sparse,
        rand_sparse],
    inner_layer=[scaled_rand, scaled_rand])
esn = ESN(training_input, 1, res_size;
    reservoir=rand_sparse(; radius=res_radius),
    reservoir_driver=gru)

ESN(1 => 300)

The training and prediction can proceed as usual:

training_method = StandardRidge(0.0)
output_layer = train(esn, training_target, training_method)
output = esn(Predictive(testing_input), output_layer)

1×2000 Matrix{Float64}:
 50.5302  108.952  127.601  66.6522  …  5.12525  7.94983  15.3569  101.566

The results can be plotted using Plots.jl

using Plots

plot([testing_target' output']; label=["actual" "predicted"],
    plot_title="Santa Fe Laser",
    titlefontsize=20,
    legendfontsize=12,
    linewidth=2.5,
    xtickfontsize=12,
    ytickfontsize=12,
    size=(1080, 720))

It is interesting to see a comparison of the GRU driven ESN and the standard RNN driven ESN. Using the same parameters defined before it is possible to do the following

using StatsBase

esn_rnn = ESN(training_input, 1, res_size;
    reservoir=rand_sparse(; radius=res_radius),
    reservoir_driver=RNN())

output_layer = train(esn_rnn, training_target, training_method)
output_rnn = esn_rnn(Predictive(testing_input), output_layer)

println(msd(testing_target, output))
println(msd(testing_target, output_rnn))

11.971200007092337
6.816207887932469

References

Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078.
Dey, R. and Salem, F. M. (Aug 2017). Gate-variants of Gated Recurrent Unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (IEEE); pp. 1597–1600.
Hübner, U.; Abraham, N. B. and Weiss, C. O. (1989). Dimensions and entropies of chaotic intensity pulsations in a single-mode far-infrared<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">NH</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math>laser. Physical Review A 40, 6354–6365.
Lun, S.-X.; Yao, X.-S.; Qi, H.-Y. and Hu, H.-F. (2015). A novel model of leaky integrator echo state network for time-series prediction. Neurocomputing 159, 58–66.
Sarli, D. D.; Gallicchio, C. and Micheli, A. (Aug 2020). Gated Echo State Networks: a preliminary study. In: 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA) (IEEE); pp. 1–5.
Wang, X.; Jin, Y. and Hao, K. (Jul 2020). A Gated Recurrent Unit based Echo State Network. In: 2020 International Joint Conference on Neural Networks (IJCNN) (IEEE); pp. 1–7.