🌌 Vanishing Gradients with RNNs

  • An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very hard to optimize πŸ™„

  • Same in Deep Neural Networks, deeper networks are getting into the vanishing gradient problem.

  • That also happens with RNNs with a long sequence size πŸ›

πŸ§™β€β™€οΈ Solutions

  • GRU Gated Recurrent Unit

  • LSTM Long Short-Term Memory

πŸšͺ Gated Recurrent Unit (GRU)

GRUs are improved version of standard recurrent neural network ✨, GRU uses update gate and reset gate .

  • Basically, these are two vectors which decide what information should be passed to the output.

  • The special thing about them is that they can be trained to keep information from long ago

    • Without washing it through time or removing information which is relevant to the prediction.

Gate

Description

πŸ” Update Gate

Helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future

0️⃣ Reset Gate

Helps the model to decide how much of the past information to forget

πŸ” Update Gate

Given this gate the issue of the vanishing gradient is eliminated since the model on its own learn how much of the past information to pass to the future.

In short: How much past should matter now? πŸ™„

0️⃣ Reset Gate

This gate has the opposite functionality in comparison with the update gate since it is used by the model to decide how much of the past information to forget.

In short: Drop previous information? πŸ™„

πŸ’¬ Current Memory Content

Memory content which will use the reset gate to store the relevant information from the past.

🎈 Final Memory at Current Time Step

A vector which holds information for the current unit and it will pass it further down to the network.

πŸ‘€ Visualization

πŸŽ‰ GRU Conclusion

  • A solution to eliminate the vanishing gradient problem

  • The model is not washing out the new input every single time but keeps the relevant information and passes it down to the next time steps of the network.

πŸ€Έβ€β™€οΈ Long Short-Term Memory

0️⃣ Forget Gate

  • Let's assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural.

  • If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state.

  • In an LSTM, the forget gate let us do this:

​Γf<t>=Οƒ(Wf[a<tβˆ’1>,x<t>]+bf)\Gamma ^{<t>}_f = \sigma(W_f[a^{<t-1>}, x^{<t>}]+b_f)​

  • Here, $W_f$ are weights that govern the forget gate's behavior. We concatenate [a<tβˆ’1>,x<t>][a^{<t-1>}, x^{<t>}] and multiply by WfW_f. The equation above results in a vector Ξ“f<t>\Gamma_f^{<t>} with values between 0 and 1.

  • This forget gate vector will be multiplied element-wise by the previous cell state c<tβˆ’1>c^{<t-1>}.

  • So if one of the values of Ξ“f<t>\Gamma_f^{<t>} is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of c<tβˆ’1>c^{<t-1>} .

  • If one of the values is 1, then it will keep the information.

πŸ”„ Update Gate

Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formula for the update gate:

​Γu<t>=Οƒ(Wu[a<tβˆ’1>,x<t>]+bu)\Gamma ^{<t>}_u = \sigma(W_u[a^{<t-1>}, x^{<t>}]+b_u)​

Similar to the forget gate, here Ξ“u<t>\Gamma_u^{<t>} is again a vector of values between 0 and 1. This will be multiplied element-wise with c~<t>\tilde{c}^{<t>}, in order to compute c⟨t⟩c^{⟨t⟩}.

πŸ‘©β€πŸ”§ Updating the Cell

To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:

​c~<t>=tanh(Wc[a<tβˆ’1>,x<t>]+bc)\tilde{c}^{<t>}=tanh(W_c[a^{<t-1>}, x^{<t>}]+b_c)​

Finally, the new cell state is:

​c<t>=Ξ“f<t>βˆ—c<tβˆ’1>+Ξ“u<t>βˆ—c~<t>c^{<t>}=\Gamma _f^{<t>}*c^{<t-1>} + \Gamma _u^{<t>}*\tilde{c}^{<t>}​

πŸšͺ Output Gate

To decide which outputs we will use, we will use the following two formulas:

​Γo<t>=Οƒ(Wo[a<tβˆ’1>,x<t>]+bo)\Gamma _o^{<t>}=\sigma(W_o[a^{<t-1>}, x^{<t>}]+b_o)​

​a<t>=Ξ“o<t>βˆ—tanh(c<t>)a^{<t>} = \Gamma _o^{<t>}*tanh(c^{<t>})​

Where in first equation we decide what to output using a sigmoid function and in second equation we multiply that by the tanh of the previous state.

GRU is newer than LSTM, LSTM is more powerful but GRU is easier to implement 🚧

🧐 Read More