Data Preprocessing

Data preprocessing is a crucial step in the machine learning pipeline, where raw data is transformed into a clean and structured format suitable for analysis. Proper preprocessing can significantly enhance the performance of machine learning models.

Common Preprocessing Techniques

Handling Missing Values: Missing data can lead to incorrect analyses and model inaccuracies. Common techniques to handle missing values include:

Mean/Median Imputation: Replace missing values with the mean or median of the column.
Mode Imputation: Replace missing values with the most frequent value in the column.
Row/Column Deletion: Remove rows or columns with excessive missing values.

import pandas as pd

# Replace missing values with the column mean
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

# Remove rows with missing values
df.dropna(inplace=True)

Encoding Categorical Variables: Machine learning models often require numerical inputs. Encoding is used to convert categorical data into numerical format. Techniques include:

One-Hot Encoding: Create binary columns for each category. (For example, Red, Green, Blue becomes [1, 0, 0], [0, 1, 0], [0, 0, 1]).
Label Encoding: Assign a unique integer to each category. (For example, Red, Green, Blue becomes 0, 1, 2)

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    
# One-Hot Encoding
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['category_column']])

# Label Encoding
label_encoder = LabelEncoder()
df['category_column'] = label_encoder.fit_transform(df['category_column'])

Scaling: Scaling ensures that numerical features are on a similar scale, which improves model performance. Techniques include:

Normalization: Scale values to a range of [0, 1].
Standardization: Scale values to have a mean of 0 and standard deviation of 1.

from sklearn.preprocessing import StandardScaler, MinMaxScaler
    
# Standardization
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column_name']])

# Normalization
normalizer = MinMaxScaler()
df['normalized_column'] = normalizer.fit_transform(df[['column_name']])

Outlier Detection and Removal: Outliers are data points that significantly differ from other observations. Outliers can distort data distributions and affect model performance. Techniques include:

Z-Score Method: Remove data points that deviate significantly from the mean.
IQR Method: Use the interquartile range to identify and remove outliers.

import numpy as np

# IQR Method
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR)))]

Data Transformation: Transform data into a more suitable format for modeling. Techniques include:

Log Transformation: Reduce skewness in data.
Box-Cox Transformation: Normalize data distributions.

Feature Selection: Select the most relevant features to reduce dimensionality and improve model performance. Techniques include:

Filter Methods: Use statistical tests (e.g., correlation) to select features.
Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE).
Embedded Methods: Use model-based methods like feature importance from tree-based models.

NumPy