What is feature engineering in machine learning?

Feature engineering is the process of using domain knowledge to transform raw data into features that better represent the underlying patterns, making machine learning algorithms perform better. It includes creating new features, transforming existing ones, selecting the most informative ones, and handling missing or incorrect values.

What is the difference between normalisation and standardisation?

Normalisation (Min-Max Scaling) scales features to a fixed range, usually [0, 1]: x_norm = (x - x_min) / (x_max - x_min). Standardisation (Z-score scaling) transforms features to have mean = 0 and standard deviation = 1: x_std = (x - mean) / std. Use normalisation when you need bounded values (e.g., neural networks with sigmoid). Use standardisation when the algorithm assumes normally distributed features (e.g., SVM, PCA, Logistic Regression).

Feature Engineering

Techniques Every ML Student Should Know

Last Updated: March 2026

📌 Key Takeaways

Definition: Feature engineering is transforming raw data into better inputs for ML models — the step that often makes the biggest difference in model performance.
Key techniques: Feature scaling, encoding categoricals, polynomial features, binning, feature selection, handling missing values, creating interaction features.
“Garbage in, garbage out” — no algorithm can compensate for poorly prepared features.
Always scale features for: SVM, KNN, Logistic Regression, PCA, Neural Networks, and K-Means.
Tree-based models (Decision Tree, Random Forest, XGBoost) do NOT require scaling.
Feature engineering is as much art as science — domain knowledge is invaluable.

1. What is Feature Engineering?

Feature engineering is the process of using domain knowledge and data analysis to transform raw data into features (input variables) that make machine learning algorithms work better.

Raw data is rarely in a form that ML algorithms can use directly. Dates need to be converted to useful quantities (day of week, month, days since an event). Text needs to be converted to numbers. Categories need to be encoded. Numerical values need to be scaled. New informative features can be created by combining existing ones.

In practice, feature engineering is often more impactful than algorithm selection. A well-engineered feature set with a simple algorithm typically outperforms a poorly prepared dataset with a sophisticated one. This is why experienced ML practitioners spend 70-80% of their time on data preparation and feature engineering.

2. Feature Scaling — Normalisation vs Standardisation

Many ML algorithms are sensitive to the scale of input features. If one feature has values in thousands (e.g., salary) and another in single digits (e.g., age in years), the large-scale feature will dominate distance calculations and gradient updates.

Normalisation (Min-Max Scaling)

x_norm = (x − x_min) / (x_max − x_min)

Scales features to the range [0, 1]. Sensitive to outliers — one extreme value compresses all other values toward the same end. Use when: features must be bounded (e.g., pixel values, neural networks with sigmoid/tanh activations).

Standardisation (Z-score Scaling)

x_std = (x − μ) / σ

Transforms features to have mean = 0 and standard deviation = 1. More robust to outliers than normalisation. Use when: algorithm assumes normally distributed features (SVM, Logistic Regression, PCA, Lasso/Ridge, Neural Networks).

Algorithm	Needs Scaling?	Preferred Method
Linear/Logistic Regression	Yes	Standardisation
SVM	Yes (critical)	Standardisation
K-Nearest Neighbours	Yes	Normalisation or Standardisation
K-Means	Yes	Standardisation
PCA	Yes	Standardisation
Neural Networks	Yes	Normalisation (0-1) or Standardisation
Decision Tree	No	Not needed
Random Forest	No	Not needed
XGBoost / LightGBM	No	Not needed
Naive Bayes	No	Not needed

3. Encoding Categorical Variables

ML algorithms require numerical inputs. Categorical variables (e.g., “city”, “colour”, “grade”) must be converted to numbers. The right encoding method depends on the nature of the category.

Label Encoding

Assigns an integer to each category: Red=0, Green=1, Blue=2. Simple but implies an ordering — the algorithm may interpret Blue > Green > Red. Only use for ordinal categories (e.g., Low=0, Medium=1, High=2) where order is meaningful.

One-Hot Encoding

Creates a separate binary column for each category. {Red, Green, Blue} becomes three columns: is_Red, is_Green, is_Blue. No ordering implied. Use for nominal categories (no meaningful order). Limitation: creates many columns for high-cardinality features (e.g., city with 500 values).

Target Encoding (Mean Encoding)

Replaces each category with the mean of the target variable for that category. Example: replace “Mumbai” with the mean house price for Mumbai. Reduces dimensionality for high-cardinality features. Risk of data leakage — must be done carefully within cross-validation folds.

Frequency Encoding

Replaces each category with its frequency (count) in the dataset. Simple, avoids dimensionality explosion. Useful when rare vs common categories have different predictive power.

4. Handling Missing Values

Real-world datasets almost always contain missing values. The strategy depends on how much data is missing and why.

Strategy	Method	When to Use	Risk
Remove rows	Drop examples with missing values	Very few missing rows (<1%)	Data loss
Remove columns	Drop features with >50% missing	Feature has too little usable data	Loss of potentially useful signal
Mean/Median imputation	Replace missing with mean (numerical) or mode (categorical)	Missing at random, not too many	Reduces variance, may distort distributions
KNN imputation	Fill using values from K nearest neighbours	When relationships between features are informative	Slow; risk of data leakage
Model-based imputation	Train a model to predict the missing feature	Many missing values, feature is important	Complex; must be done inside CV loop
Add “missing” indicator	Create a binary column: “was this feature missing?”	When missingness itself is informative	Adds a feature

Important: Always fit imputation parameters (e.g., the mean) on the training set only, then apply them to the validation and test sets. Fitting on the full dataset causes data leakage.

5. Polynomial Features

Polynomial features allow linear models to capture non-linear relationships by adding powers and interaction terms of existing features.

For a single feature x, degree-2 polynomial features add: x, x²

For two features x₁ and x₂, degree-2 polynomial features add: x₁, x₂, x₁², x₁x₂, x₂²

This allows logistic regression, for example, to learn circular or parabolic decision boundaries instead of just straight lines. The tradeoff: polynomial features increase model complexity and risk of overfitting — always combine with regularisation (Ridge/Lasso).

6. Binning / Discretisation

Binning converts a continuous feature into discrete categories (bins). Example: Age → Young (18-30), Middle-aged (31-50), Senior (51+).

Benefits: Reduces sensitivity to outliers; can help linear models capture non-linear relationships; sometimes aligns with domain knowledge (clinical age groups, income brackets).

Drawbacks: Loss of information within bins; choice of bin boundaries is arbitrary; tree-based models typically do not benefit from binning (they learn their own splits).

7. Interaction Features

Interaction features are new features created by combining two or more existing features. They capture relationships that no single feature captures alone.

Examples from engineering applications:

Power = Voltage × Current (physical interaction)
Price per square metre = Price / Area (ratio feature)
Revenue growth = (Revenue_this_year − Revenue_last_year) / Revenue_last_year (change feature)
Time since last purchase (difference feature from date fields)

Domain knowledge is the key to creating meaningful interaction features. Random combinations of features rarely add value — purposeful combinations based on physical relationships or business logic are what make the difference.

8. Feature Selection

Not all features are useful. Irrelevant or redundant features can hurt model performance by increasing complexity, training time, and overfitting risk. Feature selection identifies the most informative subset of features.

Filter Methods (fast, model-independent):

Correlation coefficient — remove features with low correlation to the target
Chi-squared test — for categorical features vs categorical targets
Mutual information — measures statistical dependence between feature and target

Wrapper Methods (more accurate, slower):

Forward selection — start with no features, add the most useful one at a time
Backward elimination — start with all features, remove the least useful one at a time
Recursive Feature Elimination (RFE) — trains a model, ranks features by importance, removes the weakest

Embedded Methods (built into training):

L1 Regularisation (Lasso) — drives irrelevant feature weights to exactly zero
Random Forest feature importance — ranks features by their contribution to impurity reduction

9. Python Code


import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures

# Sample dataset
df = pd.DataFrame({
    'age': [25, 32, np.nan, 45, 28],
    'income': [30000, 60000, 45000, np.nan, 35000],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Chennai', 'Delhi'],
    'approved': [0, 1, 1, 1, 0]
})

# --- 1. Handle Missing Values ---
imputer = SimpleImputer(strategy='median')
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])
print("After imputation:\n", df)

# --- 2. Encode Categorical Variable ---
df = pd.get_dummies(df, columns=['city'], drop_first=True)
print("\nAfter one-hot encoding:\n", df)

# --- 3. Feature Scaling ---
X = df.drop('approved', axis=1)
y = df['approved']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 4. Polynomial Features ---
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_scaled)
print(f"\nOriginal features: {X.shape[1]} | Polynomial features: {X_poly.shape[1]}")

# --- 5. Feature Selection using Random Forest Importance ---
from sklearn.datasets import load_iris
X_iris, y_iris = load_iris(return_X_y=True)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_iris, y_iris)
importances = rf.feature_importances_
feature_names = load_iris().feature_names
for name, imp in sorted(zip(feature_names, importances), key=lambda x: -x[1]):
    print(f"{name}: {imp:.4f}")

# --- 6. SelectKBest with mutual information ---
selector = SelectKBest(mutual_info_classif, k=2)
X_selected = selector.fit_transform(X_iris, y_iris)
print(f"\nSelected {X_selected.shape[1]} best features out of {X_iris.shape[1]}")

10. Common Mistakes Students Make

Fitting preprocessing on the full dataset: The most dangerous mistake in ML. Computing the mean for imputation, or the min/max for scaling, using the test set contaminates your model with future information. Always fit preprocessing on training data only. Use Scikit-learn Pipelines — they handle this automatically and correctly.
One-hot encoding high-cardinality features: A city feature with 500 unique values becomes 500 columns with one-hot encoding — this causes dimensionality explosion and is very sparse. Use target encoding or frequency encoding for high-cardinality categorical features instead.
Ignoring feature scaling for distance-based algorithms: KNN, K-Means, and SVM are completely at the mercy of feature scales. A feature in thousands will dominate one in single digits. Always scale for these algorithms.
Creating features without domain understanding: Random feature combinations rarely help. The best features come from understanding the problem — why would this combination be predictive? Domain knowledge drives the best feature engineering.

11. Frequently Asked Questions

Does feature engineering matter for deep learning?

Less so than for classical ML — deep neural networks can learn complex feature representations automatically from raw data (especially for images, text, and audio). However, feature engineering still matters for tabular/structured data in deep learning. Good preprocessing (handling missing values, scaling, encoding) always helps, and domain-specific features still provide a meaningful boost even for deep models.

What is the difference between feature engineering and feature selection?

Feature engineering creates new features from existing ones (or transforms them). Feature selection chooses the most informative subset of features from what is available. They are complementary steps — first engineer good features, then select the best ones if there are too many. Feature selection is especially important when the number of features exceeds the number of training examples.

Should I always use polynomial features?

Only for linear models (Linear Regression, Logistic Regression) when you suspect non-linear relationships. Tree-based models (Decision Tree, Random Forest, XGBoost) learn their own splits and do not benefit from polynomial features. Neural networks learn non-linear transformations internally. Polynomial features increase computational cost and risk of overfitting — always combine with regularisation.

Feature Engineering — Techniques Every ML Student Should Know