Python for Machine Learning



Python for Machine Learning

NumPy, Pandas & Matplotlib Cheatsheet — For Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Core ML stack: NumPy + Pandas + Matplotlib + Scikit-learn — master these four first.
  • NumPy: Fast numerical arrays, linear algebra, mathematical operations.
  • Pandas: Tabular data loading, cleaning, exploration, and transformation.
  • Matplotlib/Seaborn: Data visualisation — histograms, scatter plots, correlation heatmaps.
  • Scikit-learn: ML algorithms, preprocessing, model evaluation — the Swiss army knife of ML.
  • Deep learning: PyTorch (research, learning) or TensorFlow/Keras (production).

1. Environment Setup


# Install all core ML libraries
pip install numpy pandas matplotlib seaborn scikit-learn jupyter

# For deep learning (choose one or both)
pip install torch torchvision          # PyTorch
pip install tensorflow keras           # TensorFlow

# For NLP
pip install nltk spacy transformers gensim

# Start Jupyter Notebook
jupyter notebook

# Recommended: Use Anaconda for easy environment management
# Download from https://www.anaconda.com/
    

For students without a powerful local machine: use Google Colab (free GPU access, no setup needed) at colab.research.google.com. All major ML libraries are pre-installed.

2. NumPy — The Foundation

NumPy provides fast, vectorised operations on multi-dimensional arrays. All ML frameworks internally use NumPy-compatible arrays.


import numpy as np

# --- Array Creation ---
a = np.array([1, 2, 3, 4, 5])           # 1D array
b = np.array([[1,2,3],[4,5,6]])          # 2D array (matrix)
c = np.zeros((3, 4))                     # 3x4 matrix of zeros
d = np.ones((2, 3))                      # 2x3 matrix of ones
e = np.eye(4)                            # 4x4 identity matrix
f = np.random.randn(100, 5)             # 100x5 random normal
g = np.linspace(0, 1, 50)               # 50 evenly spaced values from 0 to 1
h = np.arange(0, 10, 0.5)               # Array from 0 to 10, step 0.5

# --- Array Properties ---
print(b.shape)    # (2, 3)
print(b.dtype)    # int64
print(b.ndim)     # 2
print(b.size)     # 6

# --- Indexing & Slicing ---
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(arr[0, 1])     # 2 -- element at row 0, col 1
print(arr[:, 1])     # [2, 5, 8] -- entire column 1
print(arr[1:, :2])   # [[4,5],[7,8]] -- rows 1+, columns 0-1
print(arr[arr > 5])  # [6, 7, 8, 9] -- boolean indexing

# --- Mathematical Operations (vectorised -- no loops needed!) ---
x = np.array([1, 2, 3, 4, 5])
print(x * 2)           # [2, 4, 6, 8, 10]
print(x ** 2)          # [1, 4, 9, 16, 25]
print(np.sqrt(x))      # element-wise square root
print(np.exp(x))       # element-wise e^x
print(x.mean())        # 3.0
print(x.std())         # standard deviation
print(x.sum())         # 15

# --- Linear Algebra ---
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B)            # Matrix multiplication
print(np.linalg.det(A)) # Determinant
print(np.linalg.inv(A)) # Inverse
vals, vecs = np.linalg.eig(A)  # Eigenvalues and eigenvectors
    

3. Pandas — Data Manipulation


import pandas as pd
import numpy as np

# --- Creating DataFrames ---
data = {
    'name': ['Alice', 'Bob', 'Carol', 'Dave'],
    'age': [25, 30, 28, 35],
    'score': [88.5, 92.0, 79.5, 95.0],
    'grade': ['B', 'A', 'C', 'A']
}
df = pd.DataFrame(data)

# --- Loading Data ---
df_csv = pd.read_csv('data.csv')
df_excel = pd.read_excel('data.xlsx')

# --- Exploring Data ---
print(df.head(3))          # First 3 rows
print(df.tail(2))          # Last 2 rows
print(df.shape)            # (rows, columns)
print(df.dtypes)           # Data types
print(df.describe())       # Statistical summary (mean, std, min, max...)
print(df.info())           # Overview with missing values
print(df.isnull().sum())   # Count missing values per column

# --- Selecting Data ---
print(df['name'])                    # Single column (Series)
print(df[['name', 'score']])         # Multiple columns
print(df.loc[0])                     # Row by label
print(df.iloc[1:3])                  # Rows 1 and 2 by position
print(df[df['score'] > 85])          # Filter rows
print(df[(df['age'] > 25) & (df['grade'] == 'A')])  # Multiple conditions

# --- Modifying Data ---
df['score_scaled'] = df['score'] / 100      # Add new column
df['score'] = df['score'].fillna(df['score'].mean())  # Fill missing
df.drop('grade', axis=1, inplace=True)       # Remove column
df.rename(columns={'name': 'student_name'}, inplace=True)

# --- Aggregation ---
print(df.groupby('grade')['score'].mean())   # Mean score by grade
print(df['score'].value_counts())            # Frequency count

# --- Handling Missing Values ---
df.dropna()                     # Drop rows with any NaN
df.fillna(0)                    # Replace NaN with 0
df.fillna(df.mean())            # Replace with column mean

# --- Sorting ---
df.sort_values('score', ascending=False)
    

4. Matplotlib & Seaborn — Visualisation


import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# --- Basic Matplotlib Plots ---
x = np.linspace(0, 2*np.pi, 100)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Line plot
axes[0,0].plot(x, np.sin(x), 'b-', label='sin(x)')
axes[0,0].plot(x, np.cos(x), 'r--', label='cos(x)')
axes[0,0].set_title('Trigonometric Functions')
axes[0,0].legend()

# Scatter plot
np.random.seed(42)
axes[0,1].scatter(np.random.randn(100), np.random.randn(100),
                  c='blue', alpha=0.6)
axes[0,1].set_title('Scatter Plot')

# Histogram
data = np.random.randn(1000)
axes[1,0].hist(data, bins=30, color='green', alpha=0.7, edgecolor='black')
axes[1,0].set_title('Histogram')

# Bar chart
categories = ['A', 'B', 'C', 'D']
values = [25, 40, 30, 55]
axes[1,1].bar(categories, values, color='orange')
axes[1,1].set_title('Bar Chart')

plt.tight_layout()
plt.savefig('plots.png', dpi=150)
plt.show()

# --- Seaborn (Statistical Visualisation) ---
df = sns.load_dataset('iris')

# Pairplot -- relationships between all numerical features
sns.pairplot(df, hue='species', diag_kind='hist')
plt.show()

# Heatmap -- correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(df.drop('species', axis=1).corr(),
            annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

# Distribution plot
sns.histplot(df['sepal_length'], kde=True, bins=20)
plt.title('Sepal Length Distribution')
plt.show()
    

5. Scikit-learn — Complete ML Pipeline


import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, f1_score, roc_auc_score,
                              confusion_matrix, classification_report)
import matplotlib.pyplot as plt

# --- 1. Load Data ---
X, y = load_breast_cancer(return_X_y=True)
feature_names = load_breast_cancer().feature_names
print(f"Dataset shape: {X.shape}")      # (569, 30)
print(f"Class balance: {np.bincount(y)}")

# --- 2. Split Data ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- 3. Build Pipeline (preprocessing + model) ---
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

# --- 4. Cross-Validation ---
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

# --- 5. Hyperparameter Tuning ---
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10, 20]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

# --- 6. Evaluate on Test Set ---
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Test F1:       {f1_score(y_test, y_pred):.3f}")
print(f"Test AUC-ROC:  {roc_auc_score(y_test, y_prob):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=['Malignant', 'Benign']))

# --- 7. Feature Importance ---
importances = best_model.named_steps['model'].feature_importances_
top_features = sorted(zip(feature_names, importances), key=lambda x: -x[1])[:10]
print("\nTop 10 Most Important Features:")
for feat, imp in top_features:
    print(f"  {feat}: {imp:.4f}")
    

6. PyTorch Basics


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# --- Prepare Data ---
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.LongTensor(y_train)
X_test_t  = torch.FloatTensor(X_test)
y_test_t  = torch.LongTensor(y_test)

# DataLoader for mini-batch training
dataset = TensorDataset(X_train_t, y_train_t)
loader  = DataLoader(dataset, batch_size=16, shuffle=True)

# --- Define Model ---
class IrisNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(4, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 3)  # 3 classes
        )

    def forward(self, x):
        return self.network(x)

model = IrisNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# --- Training Loop ---
for epoch in range(100):
    model.train()
    for X_batch, y_batch in loader:
        optimizer.zero_grad()       # Clear gradients
        outputs = model(X_batch)    # Forward pass
        loss = criterion(outputs, y_batch)  # Compute loss
        loss.backward()             # Backpropagation
        optimizer.step()            # Update weights

# --- Evaluation ---
model.eval()
with torch.no_grad():
    outputs = model(X_test_t)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == y_test_t).float().mean()
    print(f"Test Accuracy: {accuracy.item():.3f}")
    

7. Typical ML Project Workflow

  1. Load data — pd.read_csv() or sklearn.datasets
  2. Explore — df.describe(), df.info(), df.isnull().sum(), sns.pairplot()
  3. Clean — handle missing values, remove duplicates, fix data types
  4. Visualise — distributions, correlations, class balance
  5. Preprocess — scale features, encode categoricals, split train/test
  6. Build pipeline — Pipeline([(‘scaler’, …), (‘model’, …)])
  7. Cross-validate — cross_val_score() to estimate generalisation
  8. Tune hyperparameters — GridSearchCV or RandomizedSearchCV
  9. Final evaluation — evaluate best model on held-out test set
  10. Interpret — feature importance, confusion matrix, error analysis

8. Frequently Asked Questions

Do I need to be good at Python before learning ML?

You need basic Python — variables, loops, functions, lists, and dictionaries. You do not need advanced Python (decorators, metaclasses, concurrency). The ML libraries abstract away most complexity. A good benchmark: if you can write a function that reads a CSV file and computes the average of a column, you have enough Python to start ML. Learn more Python as you go.

Is Google Colab good enough for ML projects?

Colab is excellent for learning and small-to-medium projects — it provides free GPU access (T4 GPU), pre-installed libraries, and easy sharing. Limitations: session disconnects after inactivity, limited RAM (12GB free tier), and the free GPU is not always available. For serious deep learning training (large models, long training runs), use Colab Pro or cloud platforms (AWS, GCP, Azure).

Next Steps