What is a Recurrent Neural Network (RNN)?

A Recurrent Neural Network (RNN) is a type of neural network designed for sequential data. Unlike standard neural networks that process each input independently, RNNs have a hidden state that carries information from previous time steps. At each step, the network takes the current input and the previous hidden state, producing a new hidden state and an output. This memory mechanism makes RNNs suitable for tasks where order and context matter.

What is an LSTM and why was it created?

LSTM (Long Short-Term Memory) is an improved RNN architecture designed to solve the vanishing gradient problem that prevents standard RNNs from learning long-range dependencies. LSTMs use three gates — forget, input, and output — and a cell state to selectively remember or forget information over long sequences. They were introduced by Hochreiter & Schmidhuber in 1997 and became the dominant sequence model until transformers.

What is the difference between RNN and LSTM?

Standard RNNs have a simple hidden state updated at each step, but suffer from vanishing gradients over long sequences — they forget information from early steps. LSTMs have a more complex architecture with a separate cell state and three gating mechanisms that control what information to keep, add, or output. LSTMs can maintain relevant information over hundreds of time steps, making them suitable for tasks requiring long-term memory.

RNNs & LSTMs

Sequential Data & Time Series — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

RNN: Neural network with memory — processes sequences by maintaining a hidden state that carries context from previous steps.
Problem with vanilla RNN: Vanishing gradients — cannot learn long-range dependencies (forgets information from many steps ago).
LSTM: Solves vanishing gradients with three gates (forget, input, output) and a cell state for long-term memory.
GRU: Simplified LSTM with two gates — similar performance, faster training.
Applications: Time series forecasting, language modelling, machine translation, speech recognition, sentiment analysis.
Note: Transformers have largely replaced RNNs/LSTMs for NLP tasks, but RNNs remain relevant for certain time series and control applications.

1. The Sequential Data Problem

Standard neural networks process each input independently — they have no memory of previous inputs. This is fine for fixed-size, unordered data like tabular records or image pixels. But for sequential data — where order and context matter — this is a fundamental limitation.

Examples of sequential data: text (each word depends on previous words), time series (stock prices, sensor readings), speech (each sound depends on surrounding sounds), video (each frame follows the previous), DNA sequences.

To process sequential data, a model needs to maintain a memory of past inputs and use it to inform predictions about future inputs. Recurrent Neural Networks provide exactly this capability.

2. RNN Architecture

An RNN processes sequences one element at a time. At each time step t, it takes two inputs: the current input xᵗ and the previous hidden state hᵗ₋₁ (its “memory”). It produces a new hidden state hᵗ and an optional output yᵗ:

hᵗ = tanh(Wₕ × hᵗ₋₁ + Wₓ × xᵗ + b)

yᵗ = Wᴾ × hᵗ + bᴾ

The same weights (Wₕ, Wₓ, Wᴾ) are used at every time step — parameter sharing across time, analogous to how CNNs share parameters across space. This means an RNN has far fewer parameters than a fully connected network that processes all time steps simultaneously.

Unrolling Through Time

An RNN can be visualised as a chain of identical networks, one per time step, each passing its hidden state to the next. Backpropagation through this unrolled network is called Backpropagation Through Time (BPTT).

3. The Vanishing Gradient Problem in RNNs

During BPTT, gradients must flow backward through every time step. With long sequences (say, 100 time steps), the gradient is multiplied by the weight matrix and the tanh derivative at each step. Since tanh derivatives are < 1, after many steps the gradient becomes exponentially small — vanishing.

The result: the model cannot adjust weights based on information from many steps ago. In practice, vanilla RNNs can only effectively use context from the last ~10 steps. For tasks requiring longer memory (e.g., understanding the subject of a sentence at the end of a long paragraph), vanilla RNNs fail.

This is the fundamental motivation for LSTMs.

4. LSTM — Long Short-Term Memory

LSTM (introduced by Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem with two key innovations:

Cell state (cᵗ): A separate “conveyor belt” that runs through the entire sequence with only minor, controlled modifications at each step. Gradients can flow through the cell state without multiplying through tanh derivatives repeatedly — solving the vanishing gradient problem.
Three gates: Sigmoid-activated gates that selectively control what information flows through the cell state, protecting long-term memory from being overwritten by irrelevant short-term inputs.

5. The Three LSTM Gates

Forget Gate — What to erase from memory

fᵗ = sigmoid(Wᶠ × [hᵗ₋₁, xᵗ] + bᶠ)

Output: values between 0 and 1. 0 = completely forget; 1 = completely keep. Applied to the previous cell state — decides what to discard.

Input Gate — What new information to add

iᵗ = sigmoid(Wᵢ × [hᵗ₋₁, xᵗ] + bᵢ)

c̃ᵗ = tanh(Wᶜ × [hᵗ₋₁, xᵗ] + bᶜ)

cᵗ = fᵗ × cᵗ₋₁ + iᵗ × c̃ᵗ

iᵗ decides how much of the new candidate c̃ᵗ to add to the cell state. The new cell state cᵗ is the old state (selectively forgotten) plus new information (selectively added).

Output Gate — What to output

oᵗ = sigmoid(Wᵒ × [hᵗ₋₁, xᵗ] + bᵒ)

hᵗ = oᵗ × tanh(cᵗ)

The output gate decides what part of the cell state to expose as the hidden state hᵗ, which is passed to the next step and used for predictions.

6. GRU — Gated Recurrent Unit

The GRU (Cho et al., 2014) simplifies LSTM by merging the cell state and hidden state, and using two gates instead of three:

Update gate (zᵗ): Controls how much of the previous hidden state to keep vs the new candidate. Combines the functions of LSTM’s forget and input gates.
Reset gate (rᵗ): Controls how much of the previous hidden state to use when computing the new candidate.

GRUs have fewer parameters than LSTMs, train faster, and often achieve similar or better performance on smaller datasets. They are a good default choice when LSTMs seem to overfit or train too slowly.

7. RNN vs LSTM vs GRU

Feature	Vanilla RNN	LSTM	GRU
Gates	None	3 (forget, input, output)	2 (update, reset)
Memory	Short-term only	Long-term + short-term	Moderate long-term
Vanishing gradient	Severe	Solved	Largely solved
Parameters	Fewest	Most	~25% fewer than LSTM
Training speed	Fastest	Slowest	Faster than LSTM
Best for	Very short sequences	Long sequences, complex tasks	Moderate sequences, faster training

8. Applications

Application	Input	Output	Model Type
Sentiment analysis	Text sequence	Positive/Negative	Many-to-one
Stock price forecasting	Price time series	Next price	Many-to-one
Machine translation	Source sentence	Target sentence	Many-to-many (seq2seq)
Speech recognition	Audio frames	Text characters	Many-to-many
Text generation	Seed text	Generated text	Many-to-many
Anomaly detection	Sensor readings	Normal/Anomaly	Many-to-one

9. Python Code


import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# --- LSTM for Time Series Forecasting ---
# Generate a simple sine wave dataset
t = np.linspace(0, 100, 1000)
data = np.sin(t)

# Create sequences: use 20 time steps to predict the next value
SEQ_LEN = 20
X = np.array([data[i:i+SEQ_LEN] for i in range(len(data)-SEQ_LEN)])
y = np.array([data[i+SEQ_LEN] for i in range(len(data)-SEQ_LEN)])
X = X.reshape(-1, SEQ_LEN, 1)  # Shape: (samples, timesteps, features)

# Split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Build LSTM model
model = keras.Sequential([
    layers.LSTM(64, return_sequences=True, input_shape=(SEQ_LEN, 1)),
    layers.Dropout(0.2),
    layers.LSTM(32),
    layers.Dense(1)  # Regression output
])

model.compile(optimizer='adam', loss='mse')
model.summary()

model.fit(X_train, y_train, epochs=20, batch_size=32,
          validation_data=(X_test, y_test), verbose=0)

loss = model.evaluate(X_test, y_test, verbose=0)
print(f"Test MSE: {loss:.6f}")

# --- GRU alternative ---
gru_model = keras.Sequential([
    layers.GRU(64, return_sequences=True, input_shape=(SEQ_LEN, 1)),
    layers.Dropout(0.2),
    layers.GRU(32),
    layers.Dense(1)
])
gru_model.compile(optimizer='adam', loss='mse')

10. Frequently Asked Questions

Have Transformers replaced LSTMs completely?

For most NLP tasks (text classification, translation, question answering), yes — Transformers (BERT, GPT, T5) have largely replaced LSTMs due to better performance and parallelisability. However, LSTMs and GRUs are still competitive for: real-time streaming data (where the full sequence is not available upfront), certain time series tasks (especially with very long sequences and limited data), embedded systems (transformers are much larger), and reinforcement learning. LSTMs are not obsolete — they are just more specialised now.

What does return_sequences=True mean in Keras LSTM?

By default (return_sequences=False), an LSTM layer returns only the final hidden state — one output per sequence. With return_sequences=True, it returns the hidden state at every time step — one output per input time step. Use return_sequences=True when stacking LSTM layers (each layer needs the full sequence from the previous one) or when you need predictions at every time step (e.g., sequence labelling).

Next Steps

👉 Transformers — Attention Is All You Need
👉 Back — CNNs for Images
👉 Back — Neural Networks Foundations
👉 Back to AI & ML Hub