Feature Engineering
Techniques Every ML Student Should Know
Last Updated: March 2026
📌 Key Takeaways
- Definition: Feature engineering is transforming raw data into better inputs for ML models — the step that often makes the biggest difference in model performance.
- Key techniques: Feature scaling, encoding categoricals, polynomial features, binning, feature selection, handling missing values, creating interaction features.
- “Garbage in, garbage out” — no algorithm can compensate for poorly prepared features.
- Always scale features for: SVM, KNN, Logistic Regression, PCA, Neural Networks, and K-Means.
- Tree-based models (Decision Tree, Random Forest, XGBoost) do NOT require scaling.
- Feature engineering is as much art as science — domain knowledge is invaluable.
1. What is Feature Engineering?
Feature engineering is the process of using domain knowledge and data analysis to transform raw data into features (input variables) that make machine learning algorithms work better.
Raw data is rarely in a form that ML algorithms can use directly. Dates need to be converted to useful quantities (day of week, month, days since an event). Text needs to be converted to numbers. Categories need to be encoded. Numerical values need to be scaled. New informative features can be created by combining existing ones.
In practice, feature engineering is often more impactful than algorithm selection. A well-engineered feature set with a simple algorithm typically outperforms a poorly prepared dataset with a sophisticated one. This is why experienced ML practitioners spend 70-80% of their time on data preparation and feature engineering.
2. Feature Scaling — Normalisation vs Standardisation
Many ML algorithms are sensitive to the scale of input features. If one feature has values in thousands (e.g., salary) and another in single digits (e.g., age in years), the large-scale feature will dominate distance calculations and gradient updates.
Normalisation (Min-Max Scaling)
x_norm = (x − x_min) / (x_max − x_min)
Scales features to the range [0, 1]. Sensitive to outliers — one extreme value compresses all other values toward the same end. Use when: features must be bounded (e.g., pixel values, neural networks with sigmoid/tanh activations).
Standardisation (Z-score Scaling)
x_std = (x − μ) / σ
Transforms features to have mean = 0 and standard deviation = 1. More robust to outliers than normalisation. Use when: algorithm assumes normally distributed features (SVM, Logistic Regression, PCA, Lasso/Ridge, Neural Networks).
| Algorithm | Needs Scaling? | Preferred Method |
|---|---|---|
| Linear/Logistic Regression | Yes | Standardisation |
| SVM | Yes (critical) | Standardisation |
| K-Nearest Neighbours | Yes | Normalisation or Standardisation |
| K-Means | Yes | Standardisation |
| PCA | Yes | Standardisation |
| Neural Networks | Yes | Normalisation (0-1) or Standardisation |
| Decision Tree | No | Not needed |
| Random Forest | No | Not needed |
| XGBoost / LightGBM | No | Not needed |
| Naive Bayes | No | Not needed |
3. Encoding Categorical Variables
ML algorithms require numerical inputs. Categorical variables (e.g., “city”, “colour”, “grade”) must be converted to numbers. The right encoding method depends on the nature of the category.
Label Encoding
Assigns an integer to each category: Red=0, Green=1, Blue=2. Simple but implies an ordering — the algorithm may interpret Blue > Green > Red. Only use for ordinal categories (e.g., Low=0, Medium=1, High=2) where order is meaningful.
One-Hot Encoding
Creates a separate binary column for each category. {Red, Green, Blue} becomes three columns: is_Red, is_Green, is_Blue. No ordering implied. Use for nominal categories (no meaningful order). Limitation: creates many columns for high-cardinality features (e.g., city with 500 values).
Target Encoding (Mean Encoding)
Replaces each category with the mean of the target variable for that category. Example: replace “Mumbai” with the mean house price for Mumbai. Reduces dimensionality for high-cardinality features. Risk of data leakage — must be done carefully within cross-validation folds.
Frequency Encoding
Replaces each category with its frequency (count) in the dataset. Simple, avoids dimensionality explosion. Useful when rare vs common categories have different predictive power.
4. Handling Missing Values
Real-world datasets almost always contain missing values. The strategy depends on how much data is missing and why.
| Strategy | Method | When to Use | Risk |
|---|---|---|---|
| Remove rows | Drop examples with missing values | Very few missing rows (<1%) | Data loss |
| Remove columns | Drop features with >50% missing | Feature has too little usable data | Loss of potentially useful signal |
| Mean/Median imputation | Replace missing with mean (numerical) or mode (categorical) | Missing at random, not too many | Reduces variance, may distort distributions |
| KNN imputation | Fill using values from K nearest neighbours | When relationships between features are informative | Slow; risk of data leakage |
| Model-based imputation | Train a model to predict the missing feature | Many missing values, feature is important | Complex; must be done inside CV loop |
| Add “missing” indicator | Create a binary column: “was this feature missing?” | When missingness itself is informative | Adds a feature |
Important: Always fit imputation parameters (e.g., the mean) on the training set only, then apply them to the validation and test sets. Fitting on the full dataset causes data leakage.
5. Polynomial Features
Polynomial features allow linear models to capture non-linear relationships by adding powers and interaction terms of existing features.
For a single feature x, degree-2 polynomial features add: x, x²
For two features x₁ and x₂, degree-2 polynomial features add: x₁, x₂, x₁², x₁x₂, x₂²
This allows logistic regression, for example, to learn circular or parabolic decision boundaries instead of just straight lines. The tradeoff: polynomial features increase model complexity and risk of overfitting — always combine with regularisation (Ridge/Lasso).
6. Binning / Discretisation
Binning converts a continuous feature into discrete categories (bins). Example: Age → Young (18-30), Middle-aged (31-50), Senior (51+).
Benefits: Reduces sensitivity to outliers; can help linear models capture non-linear relationships; sometimes aligns with domain knowledge (clinical age groups, income brackets).
Drawbacks: Loss of information within bins; choice of bin boundaries is arbitrary; tree-based models typically do not benefit from binning (they learn their own splits).
7. Interaction Features
Interaction features are new features created by combining two or more existing features. They capture relationships that no single feature captures alone.
Examples from engineering applications:
- Power = Voltage × Current (physical interaction)
- Price per square metre = Price / Area (ratio feature)
- Revenue growth = (Revenue_this_year − Revenue_last_year) / Revenue_last_year (change feature)
- Time since last purchase (difference feature from date fields)
Domain knowledge is the key to creating meaningful interaction features. Random combinations of features rarely add value — purposeful combinations based on physical relationships or business logic are what make the difference.
8. Feature Selection
Not all features are useful. Irrelevant or redundant features can hurt model performance by increasing complexity, training time, and overfitting risk. Feature selection identifies the most informative subset of features.
Filter Methods (fast, model-independent):
- Correlation coefficient — remove features with low correlation to the target
- Chi-squared test — for categorical features vs categorical targets
- Mutual information — measures statistical dependence between feature and target
Wrapper Methods (more accurate, slower):
- Forward selection — start with no features, add the most useful one at a time
- Backward elimination — start with all features, remove the least useful one at a time
- Recursive Feature Elimination (RFE) — trains a model, ranks features by importance, removes the weakest
Embedded Methods (built into training):
- L1 Regularisation (Lasso) — drives irrelevant feature weights to exactly zero
- Random Forest feature importance — ranks features by their contribution to impurity reduction
9. Python Code
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
# Sample dataset
df = pd.DataFrame({
'age': [25, 32, np.nan, 45, 28],
'income': [30000, 60000, 45000, np.nan, 35000],
'city': ['Mumbai', 'Delhi', 'Mumbai', 'Chennai', 'Delhi'],
'approved': [0, 1, 1, 1, 0]
})
# --- 1. Handle Missing Values ---
imputer = SimpleImputer(strategy='median')
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])
print("After imputation:\n", df)
# --- 2. Encode Categorical Variable ---
df = pd.get_dummies(df, columns=['city'], drop_first=True)
print("\nAfter one-hot encoding:\n", df)
# --- 3. Feature Scaling ---
X = df.drop('approved', axis=1)
y = df['approved']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- 4. Polynomial Features ---
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_scaled)
print(f"\nOriginal features: {X.shape[1]} | Polynomial features: {X_poly.shape[1]}")
# --- 5. Feature Selection using Random Forest Importance ---
from sklearn.datasets import load_iris
X_iris, y_iris = load_iris(return_X_y=True)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_iris, y_iris)
importances = rf.feature_importances_
feature_names = load_iris().feature_names
for name, imp in sorted(zip(feature_names, importances), key=lambda x: -x[1]):
print(f"{name}: {imp:.4f}")
# --- 6. SelectKBest with mutual information ---
selector = SelectKBest(mutual_info_classif, k=2)
X_selected = selector.fit_transform(X_iris, y_iris)
print(f"\nSelected {X_selected.shape[1]} best features out of {X_iris.shape[1]}")
10. Common Mistakes Students Make
- Fitting preprocessing on the full dataset: The most dangerous mistake in ML. Computing the mean for imputation, or the min/max for scaling, using the test set contaminates your model with future information. Always fit preprocessing on training data only. Use Scikit-learn Pipelines — they handle this automatically and correctly.
- One-hot encoding high-cardinality features: A city feature with 500 unique values becomes 500 columns with one-hot encoding — this causes dimensionality explosion and is very sparse. Use target encoding or frequency encoding for high-cardinality categorical features instead.
- Ignoring feature scaling for distance-based algorithms: KNN, K-Means, and SVM are completely at the mercy of feature scales. A feature in thousands will dominate one in single digits. Always scale for these algorithms.
- Creating features without domain understanding: Random feature combinations rarely help. The best features come from understanding the problem — why would this combination be predictive? Domain knowledge drives the best feature engineering.
11. Frequently Asked Questions
Does feature engineering matter for deep learning?
Less so than for classical ML — deep neural networks can learn complex feature representations automatically from raw data (especially for images, text, and audio). However, feature engineering still matters for tabular/structured data in deep learning. Good preprocessing (handling missing values, scaling, encoding) always helps, and domain-specific features still provide a meaningful boost even for deep models.
What is the difference between feature engineering and feature selection?
Feature engineering creates new features from existing ones (or transforms them). Feature selection chooses the most informative subset of features from what is available. They are complementary steps — first engineer good features, then select the best ones if there are too many. Feature selection is especially important when the number of features exceeds the number of training examples.
Should I always use polynomial features?
Only for linear models (Linear Regression, Logistic Regression) when you suspect non-linear relationships. Tree-based models (Decision Tree, Random Forest, XGBoost) learn their own splits and do not benefit from polynomial features. Neural networks learn non-linear transformations internally. Polynomial features increase computational cost and risk of overfitting — always combine with regularisation.