Neural Networks



Neural Networks

Perceptron, Layers & Activation Functions — Explained from Scratch

Last Updated: March 2026

📌 Key Takeaways

  • Definition: A neural network is a computational model made of layers of interconnected neurons that learns patterns from data by adjusting connection weights.
  • Structure: Input layer → one or more hidden layers → output layer.
  • Activation functions: ReLU (hidden layers), Sigmoid (binary output), Softmax (multi-class output).
  • Learning: Forward propagation (compute output) → Loss calculation → Backpropagation (compute gradients) → Gradient Descent (update weights).
  • Power: Can approximate any function given enough neurons — the Universal Approximation Theorem.
  • Key hyperparameters: Number of layers, neurons per layer, activation function, learning rate, batch size, epochs.

1. Biological Inspiration

The human brain contains approximately 86 billion neurons, each connected to thousands of others. A neuron receives electrical signals through dendrites, processes them in the cell body, and — if the combined signal exceeds a threshold — fires an output signal through its axon to other neurons.

Artificial neural networks are a mathematical abstraction of this process. An artificial neuron receives numerical inputs, multiplies each by a weight (strength of connection), sums them up, adds a bias, and passes the result through an activation function to produce an output. Learning occurs by adjusting the weights — just as the brain strengthens synaptic connections through repeated use.

Important: Modern neural networks are engineering tools, not accurate models of the brain. The biological analogy is motivational, not literal.

2. The Perceptron — Single Neuron

The perceptron is the simplest neural network — a single artificial neuron. Given inputs x₁, x₂, …, xₙ:

z = w₁x₁ + w₂x₂ + … + wₙxₙ + b = w · x + b

output = activation(z)

Symbol Name Meaning
xᵢ Input Feature values fed into the neuron
wᵢ Weight Strength of each connection — learned during training
b Bias Shifts the activation threshold — also learned
z Pre-activation (logit) Weighted sum before activation
activation(z) Output Non-linear transformation of z

A single perceptron can only learn linearly separable patterns — the same limitation as logistic regression. To model non-linear relationships, multiple layers are needed.

3. Multi-Layer Neural Network (MLP)

A Multi-Layer Perceptron (MLP) — also called a fully connected neural network or deep neural network — adds one or more hidden layers between the input and output:

  • Input Layer: One neuron per input feature. No computation — just passes values forward.
  • Hidden Layers: One or more layers of neurons that learn intermediate representations. Each neuron connects to every neuron in the previous layer (fully connected). The non-linear activation functions here are what give neural networks their power.
  • Output Layer: Produces the final prediction. Number of neurons = number of output classes (for classification) or 1 (for regression).

The Universal Approximation Theorem states that a neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function to arbitrary accuracy. This is why neural networks are so powerful — they are universal function approximators.

4. Activation Functions

Activation functions introduce non-linearity into the network. Without them, stacking multiple layers would be equivalent to a single linear transformation — no matter how many layers, the network could only learn linear functions.

Function Formula Output Range Best Used In
ReLU max(0, z) [0, ∞) Hidden layers — the default choice for most networks
Leaky ReLU max(0.01z, z) (−∞, ∞) Hidden layers — fixes dying ReLU problem
Sigmoid 1 / (1 + e⁻ᶻ) (0, 1) Output layer for binary classification
Tanh (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ) (−1, 1) Hidden layers in RNNs; zero-centred output
Softmax e^zᵢ / Σ e^z⸉ (0, 1), sums to 1 Output layer for multi-class classification

Why ReLU is the Default Choice

ReLU is preferred because: it does not suffer from the vanishing gradient problem for positive inputs (gradient = 1 for z > 0); it is computationally efficient (simple max operation); it creates sparse activations (many neurons output 0), reducing computation; it converges faster than sigmoid/tanh in practice.

The Vanishing Gradient Problem

Sigmoid and tanh saturate at their extremes — their gradients approach zero for large or small input values. During backpropagation, these near-zero gradients get multiplied together across layers, causing gradients in early layers to become vanishingly small. This makes deep networks very slow to train with sigmoid/tanh. ReLU avoids this for positive values.

5. Forward Propagation

Forward propagation is the process of computing the network’s output from an input. For each layer l:

Z⁽ˡ⁾ = W⁽ˡ⁾ × A⁽ˡ⁻¹⁾ + b⁽ˡ⁾

A⁽ˡ⁾ = activation(Z⁽ˡ⁾)

Where W⁽ˡ⁾ is the weight matrix for layer l, A⁽ˡ⁻¹⁾ is the output of the previous layer (A⁽⁰⁾ = X, the input), b⁽ˡ⁾ is the bias vector, and A⁽ˡ⁾ is the activated output — which becomes the input to the next layer.

The final layer’s output Aˢ is the network’s prediction. For binary classification, Aˢ = sigmoid(Zˢ); for multi-class, Aˢ = softmax(Zˢ).

6. Loss Functions

Task Loss Function Formula
Binary Classification Binary Cross-Entropy −[y log(ŷ) + (1−y) log(1−ŷ)]
Multi-class Classification Categorical Cross-Entropy −Σ yᵢ log(ŷᵢ)
Regression Mean Squared Error (1/m) Σ (yᵢ − ŷᵢ)²
Regression (robust) Mean Absolute Error (1/m) Σ |yᵢ − ŷᵢ|

7. Training — The Four-Step Loop

Neural network training repeats this loop for each batch of data, for many epochs:

  1. Forward Propagation: Pass the input batch through all layers to compute predictions.
  2. Compute Loss: Calculate the loss function value — how wrong the predictions are.
  3. Backpropagation: Compute the gradient of the loss with respect to every weight and bias in the network using the chain rule. This tells us which direction to adjust each parameter to reduce loss.
  4. Gradient Descent Update: Adjust all weights and biases in the direction that reduces loss: W := W − α × ∂L/∂W

One epoch = one complete pass through the entire training dataset. An iteration = one gradient update using one batch. If you have 1,000 examples and batch size 100, one epoch = 10 iterations.

8. Key Hyperparameters

Hyperparameter Effect Typical Values
Learning rate (α) Step size for gradient descent — most important hyperparameter 0.001 (Adam), 0.01–0.1 (SGD)
Number of hidden layers Model depth — more layers = more complex patterns 1–5 for tabular; 10–100+ for images/NLP
Neurons per layer Model width — more neurons = finer discrimination 64, 128, 256, 512
Batch size Examples per gradient update — larger = more stable, slower 32, 64, 128, 256
Epochs Training iterations — use early stopping to avoid overfitting 10–1000 depending on dataset size
Optimizer Algorithm for gradient descent Adam (default), SGD with momentum, RMSprop
Dropout rate Fraction of neurons to drop during training — regularisation 0.2–0.5

9. Python Code


import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, to_categorical

# Load and prepare data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (essential for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# One-hot encode labels for multi-class
y_train_cat = to_categorical(y_train, num_classes=3)
y_test_cat  = to_categorical(y_test, num_classes=3)

# Build the neural network
model = keras.Sequential([
    keras.layers.Input(shape=(4,)),               # Input layer: 4 features
    keras.layers.Dense(64, activation='relu'),     # Hidden layer 1
    keras.layers.Dropout(0.3),                     # Regularisation
    keras.layers.Dense(32, activation='relu'),     # Hidden layer 2
    keras.layers.Dense(3, activation='softmax')    # Output: 3 classes
])

# Compile
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

# Train with early stopping
early_stop = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
history = model.fit(
    X_train, y_train_cat,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    callbacks=[early_stop],
    verbose=0
)

# Evaluate
loss, accuracy = model.evaluate(X_test, y_test_cat, verbose=0)
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Training stopped at epoch: {len(history.history['loss'])}")
    

10. Common Mistakes Students Make

  • Not scaling input features: Neural networks are extremely sensitive to input scale. Unscaled features cause very slow convergence or training failure. Always use StandardScaler or normalise to [0,1] before training.
  • Using sigmoid in hidden layers: Sigmoid saturates and causes vanishing gradients in deep networks. Use ReLU (or Leaky ReLU) in all hidden layers. Only use sigmoid in the output layer for binary classification.
  • Training for a fixed number of epochs without early stopping: Training too long leads to overfitting. Always use EarlyStopping with a validation set — stop training when validation loss stops improving.
  • Not using a validation set: Without a validation set, you cannot monitor overfitting during training. Always split data into train/validation/test, or use validation_split in Keras.
  • Starting with a very complex architecture: Start simple — one or two hidden layers with a moderate number of neurons. Add complexity only if underfitting. A simple model that works is always better than a complex one that does not.

11. Frequently Asked Questions

How many hidden layers should a neural network have?

For most structured/tabular data problems, 1–3 hidden layers is sufficient. Deep architectures (10+ layers) are needed for complex tasks like image recognition, speech, and natural language processing. Always start simple — one hidden layer — and add depth only if the model underfits.

What is the difference between deep learning and neural networks?

Neural networks are the broader category — any architecture of interconnected artificial neurons. Deep learning specifically refers to neural networks with many layers (deep architectures). A neural network with 2 hidden layers is technically deep learning, though the term is typically used for networks with many layers processing complex data like images, audio, or text.

What optimizer should I use for neural networks?

Adam (Adaptive Moment Estimation) is the best default choice — it adapts the learning rate for each parameter automatically and converges faster than vanilla SGD. Start with Adam and learning rate 0.001. For very large models or when fine-tuning, SGD with momentum and learning rate scheduling sometimes outperforms Adam in the long run.

Next Steps