Data preprocessing is a crucial step in the machine learning pipeline, where raw data is transformed into a clean and structured format suitable for analysis. Proper preprocessing can significantly enhance the performance of machine learning models.
Common Preprocessing Techniques
Handling Missing Values: Missing data can lead to incorrect analyses and model inaccuracies. Common techniques to handle missing values include:
- Mean/Median Imputation: Replace missing values with the mean or median of the column.
- Mode Imputation: Replace missing values with the most frequent value in the column.
- Row/Column Deletion: Remove rows or columns with excessive missing values.
import pandas as pd
# Replace missing values with the column mean
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# Remove rows with missing values
df.dropna(inplace=True)
Encoding Categorical Variables: Machine learning models often require numerical inputs. Encoding is used to convert categorical data into numerical format. Techniques include:
- One-Hot Encoding: Create binary columns for each category. (For example, Red, Green, Blue becomes [1, 0, 0], [0, 1, 0], [0, 0, 1]).
- Label Encoding: Assign a unique integer to each category. (For example, Red, Green, Blue becomes 0, 1, 2)
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# One-Hot Encoding
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['category_column']])
# Label Encoding
label_encoder = LabelEncoder()
df['category_column'] = label_encoder.fit_transform(df['category_column'])
Scaling: Scaling ensures that numerical features are on a similar scale, which improves model performance. Techniques include:
- Normalization: Scale values to a range of [0, 1].
- Standardization: Scale values to have a mean of 0 and standard deviation of 1.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column_name']])
# Normalization
normalizer = MinMaxScaler()
df['normalized_column'] = normalizer.fit_transform(df[['column_name']])
Outlier Detection and Removal: Outliers are data points that significantly differ from other observations. Outliers can distort data distributions and affect model performance. Techniques include:
- Z-Score Method: Remove data points that deviate significantly from the mean.
- IQR Method: Use the interquartile range to identify and remove outliers.
import numpy as np
# IQR Method
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR)))]
Data Transformation: Transform data into a more suitable format for modeling. Techniques include:
- Log Transformation: Reduce skewness in data.
- Box-Cox Transformation: Normalize data distributions.
Feature Selection: Select the most relevant features to reduce dimensionality and improve model performance. Techniques include:
- Filter Methods: Use statistical tests (e.g., correlation) to select features.
- Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE).
- Embedded Methods: Use model-based methods like feature importance from tree-based models.