A class of neural networks that allow previous outputs to be used as inputs to the next layers

They remember things they learned during training β¨

Basic RNN cell. Takes as input $x^{β¨tβ©}$ (current input) and $a^{β¨tβ1β©}$ (previous hidden state containing information from the past), and outputs $a^{β¨tβ©}$ which is given to the next RNN cell and also used to predict $y^{β¨tβ©}$β

**To find **$a^{<t>}$**:**

β$a^{<t>}=g(W_{aa}a^{<t-1>}+W_{ax}x^{<t>}+b_a)$β

**To find **$\hat{y}^{<t>}$**:**

β$\hat{y}^{<t>} = g(W_{ya}a^{<t>}+b_y)$β

**Loss Function is defined like the following**

β$L^{<t>}(\hat{y}^{<t>}, y^{<t>})=-y^{<t>}log(\hat{y})-(1-y^{<t>})log(1-\hat{y}^{<t>})$β

β$L(\hat{y},y)=\sum_{t=1}^{T_y}L^{<t>}(\hat{y}^{<t>}, y^{<t>})$β

1οΈβ£ β‘ 1οΈβ£

**One-to-One**(Traditional ANN)1οΈβ£ β‘ π’

**One-to-Many**(Music Generation)π’ β‘ 1οΈβ£

**Many-to-One**(Semantic Analysis)π’ β‘ π’

**Many-to-Many**$T_x = T_y$ (Speech Recognition)π’ β‘ π’

**Many-to-Many**$T_x \neq T_y$ (Machine Translation)

In many applications we want to output a prediction of $y^{(t)}$ which may depend on the whole input sequence

Bidirectional RNNs combine an RNN that moves

**forward**through time beginning from the start of the sequence with another RNN that moves**backward**through time beginning from the end of the sequence β¨

Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together.

The input sequence is fed in normal time order for one network, and in reverse time order for another.

The outputs of the two networks are usually concatenated at each time step.

π This structure allows the networks to have both backward and forward information about the sequence at every time step.

We need the entire sequence of data before we can make prediction anywhere.

e.g: not suitable for real time speech recognition

The computation in most RNNs can be decomposed into three blocks of parameters and associated transformations: 1. From the input to the hidden state, $x^{(t)}$ β‘ $a^{(t)}$ 2. From the previous hidden state to the next hidden state, $a^{(t-1)}$ β‘ $a^{(t)}$ 3. From the hidden state to the output, $a^{(t)}$ β‘ $y^{(t)}$β

We can use multiple layers for each of the above transformations, which results in deep recurrent networks π

An RNN that processes a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very hard to optimize π

Same in Deep Neural Networks, deeper networks are getting into the vanishing gradient problem π₯½

That also happens with RNNs with a long sequence size π

Read Part-2 for my notes on Vanishing Gradients with RNNs π€ΈββοΈ

βAll About RNNs πβ