RNNs & LSTMs
Sequential Data & Time Series — Explained for Engineering Students
Last Updated: March 2026
📌 Key Takeaways
- RNN: Neural network with memory — processes sequences by maintaining a hidden state that carries context from previous steps.
- Problem with vanilla RNN: Vanishing gradients — cannot learn long-range dependencies (forgets information from many steps ago).
- LSTM: Solves vanishing gradients with three gates (forget, input, output) and a cell state for long-term memory.
- GRU: Simplified LSTM with two gates — similar performance, faster training.
- Applications: Time series forecasting, language modelling, machine translation, speech recognition, sentiment analysis.
- Note: Transformers have largely replaced RNNs/LSTMs for NLP tasks, but RNNs remain relevant for certain time series and control applications.
1. The Sequential Data Problem
Standard neural networks process each input independently — they have no memory of previous inputs. This is fine for fixed-size, unordered data like tabular records or image pixels. But for sequential data — where order and context matter — this is a fundamental limitation.
Examples of sequential data: text (each word depends on previous words), time series (stock prices, sensor readings), speech (each sound depends on surrounding sounds), video (each frame follows the previous), DNA sequences.
To process sequential data, a model needs to maintain a memory of past inputs and use it to inform predictions about future inputs. Recurrent Neural Networks provide exactly this capability.
2. RNN Architecture
An RNN processes sequences one element at a time. At each time step t, it takes two inputs: the current input xᵗ and the previous hidden state hᵗ₋₁ (its “memory”). It produces a new hidden state hᵗ and an optional output yᵗ:
hᵗ = tanh(Wₕ × hᵗ₋₁ + Wₓ × xᵗ + b)
yᵗ = Wᴾ × hᵗ + bᴾ
The same weights (Wₕ, Wₓ, Wᴾ) are used at every time step — parameter sharing across time, analogous to how CNNs share parameters across space. This means an RNN has far fewer parameters than a fully connected network that processes all time steps simultaneously.
Unrolling Through Time
An RNN can be visualised as a chain of identical networks, one per time step, each passing its hidden state to the next. Backpropagation through this unrolled network is called Backpropagation Through Time (BPTT).
3. The Vanishing Gradient Problem in RNNs
During BPTT, gradients must flow backward through every time step. With long sequences (say, 100 time steps), the gradient is multiplied by the weight matrix and the tanh derivative at each step. Since tanh derivatives are < 1, after many steps the gradient becomes exponentially small — vanishing.
The result: the model cannot adjust weights based on information from many steps ago. In practice, vanilla RNNs can only effectively use context from the last ~10 steps. For tasks requiring longer memory (e.g., understanding the subject of a sentence at the end of a long paragraph), vanilla RNNs fail.
This is the fundamental motivation for LSTMs.
4. LSTM — Long Short-Term Memory
LSTM (introduced by Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem with two key innovations:
- Cell state (cᵗ): A separate “conveyor belt” that runs through the entire sequence with only minor, controlled modifications at each step. Gradients can flow through the cell state without multiplying through tanh derivatives repeatedly — solving the vanishing gradient problem.
- Three gates: Sigmoid-activated gates that selectively control what information flows through the cell state, protecting long-term memory from being overwritten by irrelevant short-term inputs.
5. The Three LSTM Gates
Forget Gate — What to erase from memory
fᵗ = sigmoid(Wᶠ × [hᵗ₋₁, xᵗ] + bᶠ)
Output: values between 0 and 1. 0 = completely forget; 1 = completely keep. Applied to the previous cell state — decides what to discard.
Input Gate — What new information to add
iᵗ = sigmoid(Wᵢ × [hᵗ₋₁, xᵗ] + bᵢ)
c̃ᵗ = tanh(Wᶜ × [hᵗ₋₁, xᵗ] + bᶜ)
cᵗ = fᵗ × cᵗ₋₁ + iᵗ × c̃ᵗ
iᵗ decides how much of the new candidate c̃ᵗ to add to the cell state. The new cell state cᵗ is the old state (selectively forgotten) plus new information (selectively added).
Output Gate — What to output
oᵗ = sigmoid(Wᵒ × [hᵗ₋₁, xᵗ] + bᵒ)
hᵗ = oᵗ × tanh(cᵗ)
The output gate decides what part of the cell state to expose as the hidden state hᵗ, which is passed to the next step and used for predictions.
6. GRU — Gated Recurrent Unit
The GRU (Cho et al., 2014) simplifies LSTM by merging the cell state and hidden state, and using two gates instead of three:
- Update gate (zᵗ): Controls how much of the previous hidden state to keep vs the new candidate. Combines the functions of LSTM’s forget and input gates.
- Reset gate (rᵗ): Controls how much of the previous hidden state to use when computing the new candidate.
GRUs have fewer parameters than LSTMs, train faster, and often achieve similar or better performance on smaller datasets. They are a good default choice when LSTMs seem to overfit or train too slowly.
7. RNN vs LSTM vs GRU
| Feature | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Gates | None | 3 (forget, input, output) | 2 (update, reset) |
| Memory | Short-term only | Long-term + short-term | Moderate long-term |
| Vanishing gradient | Severe | Solved | Largely solved |
| Parameters | Fewest | Most | ~25% fewer than LSTM |
| Training speed | Fastest | Slowest | Faster than LSTM |
| Best for | Very short sequences | Long sequences, complex tasks | Moderate sequences, faster training |
8. Applications
| Application | Input | Output | Model Type |
|---|---|---|---|
| Sentiment analysis | Text sequence | Positive/Negative | Many-to-one |
| Stock price forecasting | Price time series | Next price | Many-to-one |
| Machine translation | Source sentence | Target sentence | Many-to-many (seq2seq) |
| Speech recognition | Audio frames | Text characters | Many-to-many |
| Text generation | Seed text | Generated text | Many-to-many |
| Anomaly detection | Sensor readings | Normal/Anomaly | Many-to-one |
9. Python Code
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# --- LSTM for Time Series Forecasting ---
# Generate a simple sine wave dataset
t = np.linspace(0, 100, 1000)
data = np.sin(t)
# Create sequences: use 20 time steps to predict the next value
SEQ_LEN = 20
X = np.array([data[i:i+SEQ_LEN] for i in range(len(data)-SEQ_LEN)])
y = np.array([data[i+SEQ_LEN] for i in range(len(data)-SEQ_LEN)])
X = X.reshape(-1, SEQ_LEN, 1) # Shape: (samples, timesteps, features)
# Split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Build LSTM model
model = keras.Sequential([
layers.LSTM(64, return_sequences=True, input_shape=(SEQ_LEN, 1)),
layers.Dropout(0.2),
layers.LSTM(32),
layers.Dense(1) # Regression output
])
model.compile(optimizer='adam', loss='mse')
model.summary()
model.fit(X_train, y_train, epochs=20, batch_size=32,
validation_data=(X_test, y_test), verbose=0)
loss = model.evaluate(X_test, y_test, verbose=0)
print(f"Test MSE: {loss:.6f}")
# --- GRU alternative ---
gru_model = keras.Sequential([
layers.GRU(64, return_sequences=True, input_shape=(SEQ_LEN, 1)),
layers.Dropout(0.2),
layers.GRU(32),
layers.Dense(1)
])
gru_model.compile(optimizer='adam', loss='mse')
10. Frequently Asked Questions
Have Transformers replaced LSTMs completely?
For most NLP tasks (text classification, translation, question answering), yes — Transformers (BERT, GPT, T5) have largely replaced LSTMs due to better performance and parallelisability. However, LSTMs and GRUs are still competitive for: real-time streaming data (where the full sequence is not available upfront), certain time series tasks (especially with very long sequences and limited data), embedded systems (transformers are much larger), and reinforcement learning. LSTMs are not obsolete — they are just more specialised now.
What does return_sequences=True mean in Keras LSTM?
By default (return_sequences=False), an LSTM layer returns only the final hidden state — one output per sequence. With return_sequences=True, it returns the hidden state at every time step — one output per input time step. Use return_sequences=True when stacking LSTM layers (each layer needs the full sequence from the previous one) or when you need predictions at every time step (e.g., sequence labelling).