Chapter 8. Data Preparation and Feature Engineering in Python
"Garbage in, garbage out."
This old adage is especially true in analytics. You can have the most sophisticated machine learning algorithm, but if your data is messy, incomplete, or poorly structured, your results will be worthless.
Data scientists often say they spend 80% of their time on data preparation and only 20% on modeling. This isn't a sign of inefficiency—it's the reality of working with real-world data. The quality of your data preparation directly determines the quality of your insights.
This chapter covers the practical skills you need to transform raw, messy data into clean, structured datasets ready for analysis. We'll focus on Python and pandas, the standard tools for data preparation in business analytics.
8.1 The Importance of Data Preparation in Analytics Projects
Why Data Preparation Matters
Raw data is rarely analysis-ready. Real-world datasets have:
- Missing values : Customers who didn't fill out optional fields, sensors that failed, incomplete records
- Outliers : Data entry errors, fraudulent transactions, legitimate but extreme values
- Inconsistencies : "New York", "NY", "new york", "New York City" all referring to the same place
- Wrong formats : Dates stored as text, numbers stored as strings, mixed units
- Duplicates : Same customer entered multiple times, repeated transactions
- Irrelevant information : Columns you don't need, noise that obscures signal
The consequences of poor data preparation:
- Wrong conclusions : Outliers skew averages, missing data creates bias
- Model failures : Algorithms break on missing values or inconsistent formats
- Wasted time : Debugging model problems that are actually data problems
- Lost trust : Stakeholders lose confidence when they spot obvious data errors
The benefits of good data preparation:
- Better models : Clean data → better predictions
- Faster iteration : Automated pipelines let you test ideas quickly
- Reproducibility : Documented processes can be repeated and audited
- Trust : Stakeholders trust analysis built on clean, well-understood data
The Data Preparation Workflow
RAW DATA
↓
1. UNDERSTAND
• What does each column mean?
• What's the grain of the data?
• What are the data types?
↓
2. CLEAN
• Handle missing values
• Remove/fix outliers
• Fix inconsistencies
• Remove duplicates
↓
3. TRANSFORM
• Scale/normalize
• Encode categories
• Parse dates
↓
4. ENGINEER FEATURES
• Create new variables
• Aggregate information
• Extract patterns
↓
5. VALIDATE
• Check distributions
• Verify logic
• Test for leakage
↓
ANALYSIS-READY DATA
Key principle : Understand before you clean. Don't blindly apply transformations. First understand what the data represents, then decide how to prepare it.
8.2 Data Cleaning
8.2.1 Handling Missing Data
Missing data is inevitable. The question is: how do you handle it?
Types of missing data:
-
Missing Completely at Random (MCAR)
: Missingness has no relationship to any variable
- Example: Survey responses lost due to server error
- Impact : Reduces sample size but doesn't bias results
-
Missing at Random (MAR)
: Missingness related to observed variables, not the missing value itself
- Example: Younger customers less likely to provide phone numbers
- Impact : Can bias results if not handled properly
-
Missing Not at Random (MNAR)
: Missingness related to the unobserved value
- Example: High earners refuse to disclose income
- Impact : Serious bias, difficult to correct
Why this matters : The type of missingness determines the best handling strategy.
Strategies for Handling Missing Data
1. Delete rows with missing values
import pandas as pd
import numpy as np
# Sample data
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'age': [25, 30, np.nan, 45, 28],
'income': [50000, np.nan, 75000, 90000, 60000],
'purchases': [5, 3, 8, np.nan, 4]
})
# Drop any row with missing values
df_complete = df.dropna()
print(f"Original rows: {len(df)}, After dropna: {len(df_complete)}")
# Output: Original rows: 5, After dropna: 2
# Drop rows where specific columns are missing
df_age_complete = df.dropna(subset=['age'])
print(f"Rows with age: {len(df_age_complete)}")
# Output: Rows with age: 4
When to use:
- ✓ Missing data is MCAR
- ✓ You have plenty of data (losing rows doesn't hurt)
- ✓ Missing values are rare (<5%)
When NOT to use:
- ✗ Missing data is MAR or MNAR (creates bias)
- ✗ You have limited data
- ✗ Many rows have at least one missing value
2. Impute with simple statistics
# Mean imputation
df['age_imputed'] = df['age'].fillna(df['age'].mean())
# Median imputation (better for skewed data)
df['income_imputed'] = df['income'].fillna(df['income'].median())
# Mode imputation (for categorical data)
df['category'] = ['A', 'B', np.nan, 'A', 'B']
df['category_imputed'] = df['category'].fillna(df['category'].mode()[0])
print(df[['age', 'age_imputed', 'income', 'income_imputed']])
When to use:
- ✓ Missing data is MCAR or MAR
- ✓ Quick solution needed
- ✓ Missingness is moderate (5-20%)
Caution :
- Reduces variance (all missing values get the same number)
- Can distort relationships between variables
- Mean is sensitive to outliers (use median for skewed data)
3. Forward fill / Backward fill (for time series)
# Time series data
df_ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5),
'temperature': [20, np.nan, np.nan, 23, 24]
})
# Forward fill: use last known value
df_ts['temp_ffill'] = df_ts['temperature'].fillna(method='ffill')
# Backward fill: use next known value
df_ts['temp_bfill'] = df_ts['temperature'].fillna(method='bfill')
print(df_ts)
Output:
date temperature temp_ffill temp_bfill
0 2024-01-01 20.0 20.0 20.0
1 2024-01-02 NaN 20.0 23.0
2 2024-01-03 NaN 20.0 23.0
3 2024-01-04 23.0 23.0 23.0
4 2024-01-05 24.0 24.0 24.0
When to use:
- ✓ Time series data where values change slowly
- ✓ Sensor data with occasional missing readings
4. Impute with predictive models
from sklearn.impute import KNNImputer
# Use K-Nearest Neighbors to impute based on similar records
imputer = KNNImputer(n_neighbors=3)
df_numeric = df[['age', 'income', 'purchases']]
df_imputed = pd.DataFrame(
imputer.fit_transform(df_numeric),
columns=df_numeric.columns
)
print("Original:\n", df_numeric)
print("\nImputed:\n", df_imputed)
When to use:
- ✓ Missing data is MAR
- ✓ You have enough data to build reliable models
- ✓ Relationships between variables are important
Other sophisticated methods:
- Multiple imputation (creates several imputed datasets)
- Model-based imputation (regression, random forest)
8.2.2 Detecting and Treating Outliers
Outliers are data points that are significantly different from others.
Two types:
-
Errors
: Data entry mistakes, measurement errors, system glitches
- Example: Age = 250, Price = -$100
- Action : Remove or correct
-
Legitimate extremes
: Real but unusual values
- Example: CEO salary in employee dataset, bulk purchase in retail data
- Action : Keep but handle carefully (may need separate analysis)
Detecting Outliers
Method 1: Visual inspection
import matplotlib.pyplot as plt
# Sample data with outliers
data = pd.DataFrame({
'salary': [50000, 55000, 52000, 48000, 51000, 53000, 500000, 49000, 54000, 50000]
})
# Box plot
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.boxplot(data['salary'])
plt.title('Box Plot')
plt.ylabel('Salary')
plt.subplot(1, 2, 2)
plt.hist(data['salary'], bins=20, edgecolor='black')
plt.title('Histogram')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Method 2: Statistical methods
IQR (Interquartile Range) method:
def detect_outliers_iqr(df, column):
"""Detect outliers using IQR method"""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")
print(f"Number of outliers: {len(outliers)}")
return outliers
outliers = detect_outliers_iqr(data, 'salary')
print("\nOutliers:")
print(outliers)
Z-score method:
from scipy import stats
def detect_outliers_zscore(df, column, threshold=3):
"""Detect outliers using Z-score method"""
z_scores = np.abs(stats.zscore(df[column]))
outliers = df[z_scores > threshold]
print(f"Number of outliers (|z| > {threshold}): {len(outliers)}")
return outliers
outliers_z = detect_outliers_zscore(data, 'salary')
print("\nOutliers (Z-score):")
print(outliers_z)
When to use each:
- IQR : Robust to outliers, works well for skewed data
- Z-score : Assumes normal distribution, sensitive to extreme outliers
Treating Outliers
Option 1: Remove
# Remove outliers using IQR
Q1 = data['salary'].quantile(0.25)
Q3 = data['salary'].quantile(0.75)
IQR = Q3 - Q1
data_clean = data[
(data['salary'] >= Q1 - 1.5 * IQR) &
(data['salary'] <= Q3 + 1.5 * IQR)
]
print(f"Original: {len(data)} rows, After removal: {len(data_clean)} rows")
When to use:
- ✓ Outliers are clearly errors
- ✓ You have plenty of data
- ✗ Don't use if outliers are legitimate (you'll lose important information)
Option 2: Cap (Winsorize)
# Cap at 5th and 95th percentiles
lower = data['salary'].quantile(0.05)
upper = data['salary'].quantile(0.95)
data['salary_capped'] = data['salary'].clip(lower=lower, upper=upper)
print("Original vs Capped:")
print(data[['salary', 'salary_capped']])
When to use:
- ✓ Outliers are legitimate but you want to reduce their influence
- ✓ For modeling where extreme values can dominate
Option 3: Transform
# Log transformation (reduces impact of large values)
data['salary_log'] = np.log1p(data['salary']) # log1p = log(1 + x), handles zeros
# Compare distributions
print("Original - Mean: {:.0f}, Std: {:.0f}".format(
data['salary'].mean(), data['salary'].std()
))
print("Log transformed - Mean: {:.2f}, Std: {:.2f}".format(
data['salary_log'].mean(), data['salary_log'].std()
))
When to use:
- ✓ Data is right-skewed (common for income, prices, counts)
- ✓ You want to preserve all data but reduce outlier impact
Business Context Matters
Example: Retail transactions
transactions = pd.DataFrame({
'transaction_id': range(1, 11),
'amount': [25, 30, 28, 32, 27, 29, 5000, 31, 26, 30]
})
# $5,000 transaction is an outlier
# But is it an error or a bulk purchase?
# Check context
transactions['is_outlier'] = (
np.abs(stats.zscore(transactions['amount'])) > 3
)
print(transactions)
# Decision:
# - If it's an error → remove
# - If it's a bulk purchase → keep but analyze separately
# - For average transaction analysis → exclude
# - For total revenue analysis → include
Key insight : There's no universal rule. The right approach depends on:
- Why the outlier exists
- What question you're answering
- What impact it has on your analysis
8.2.3 Dealing with Inconsistent and Duplicate Records
Inconsistencies make it impossible to group, aggregate, or analyze data correctly.
Common inconsistencies:
- Text variations : "New York", "NY", "new york", "N.Y."
- Date formats : "01/02/2024", "2024-01-02", "Jan 2, 2024"
- Units : Some prices in dollars, others in cents
- Encoding : Special characters, extra spaces
- Categories : Typos, abbreviations, multiple naming conventions
Cleaning Text Data
# Sample data with inconsistencies
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'city': ['New York', 'new york', 'NY', 'NEW YORK', 'New York City'],
'product': [' iPhone 14 ', 'iphone 14', 'iPhone14', 'IPHONE 14', 'iPhone-14']
})
# Clean text
df['city_clean'] = (
df['city']
.str.strip() # Remove leading/trailing spaces
.str.lower() # Convert to lowercase
.str.replace('.', '') # Remove periods
)
df['product_clean'] = (
df['product']
.str.strip()
.str.lower()
.str.replace(' ', '') # Remove spaces
.str.replace('-', '') # Remove hyphens
)
print(df[['city', 'city_clean', 'product', 'product_clean']])
Output:
city city_clean product product_clean
0 New York new york iPhone 14 iphone14
1 new york new york iphone 14 iphone14
2 NY ny iPhone14 iphone14
3 NEW YORK new york IPHONE 14 iphone14
4 New York City new york city iPhone-14 iphone14
Standardizing with Mappings
# Create mapping for known variations
city_mapping = {
'new york': 'New York',
'ny': 'New York',
'new york city': 'New York',
'nyc': 'New York',
'los angeles': 'Los Angeles',
'la': 'Los Angeles',
'l.a.': 'Los Angeles'
}
df['city_standardized'] = df['city_clean'].map(city_mapping)
print(df[['city', 'city_standardized']])
For large datasets with many variations:
# Use fuzzy matching
from fuzzywuzzy import process
def standardize_city(city, valid_cities, threshold=80):
"""Match to closest valid city name"""
match, score = process.extractOne(city, valid_cities)
if score >= threshold:
return match
return city
valid_cities = ['New York', 'Los Angeles', 'Chicago', 'Houston']
df['city_fuzzy'] = df['city_clean'].apply(
lambda x: standardize_city(x, valid_cities)
)
Detecting and Removing Duplicates
Types of duplicates:
- Exact duplicates : All columns identical
- Partial duplicates : Key columns identical (e.g., same customer_id)
- Fuzzy duplicates : Nearly identical (e.g., typos in names)
# Sample data with duplicates
df = pd.DataFrame({
'customer_id': [1, 2, 2, 3, 4, 4],
'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'],
'email': ['alice@email.com', 'bob@email.com', 'bob@email.com',
'charlie@email.com', 'david@email.com', 'david2@email.com'],
'purchase_date': ['2024-01-01', '2024-01-02', '2024-01-02',
'2024-01-03', '2024-01-04', '2024-01-05']
})
# Check for duplicates
print("Duplicate rows (all columns):")
print(df[df.duplicated()])
print("\nDuplicate customer_ids:")
print(df[df.duplicated(subset=['customer_id'], keep=False)])
# Remove exact duplicates
df_no_exact_dupes = df.drop_duplicates()
print(f"\nOriginal: {len(df)}, After removing exact duplicates: {len(df_no_exact_dupes)}")
# Remove duplicates based on customer_id (keep first occurrence)
df_no_customer_dupes = df.drop_duplicates(subset=['customer_id'], keep='first')
print(f"After removing customer_id duplicates: {len(df_no_customer_dupes)}")
# Keep last occurrence instead
df_keep_last = df.drop_duplicates(subset=['customer_id'], keep='last')
Which duplicates to keep?
# Keep the most recent record
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
df_sorted = df.sort_values('purchase_date', ascending=False)
df_latest = df_sorted.drop_duplicates(subset=['customer_id'], keep='first')
print("Keeping most recent record per customer:")
print(df_latest)
Identifying Fuzzy Duplicates
# Customers with similar names (possible duplicates)
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['John Smith', 'Jon Smith', 'John Smyth', 'Jane Doe'],
'email': ['john@email.com', 'jon@email.com', 'jsmith@email.com', 'jane@email.com']
})
from fuzzywuzzy import fuzz
# Compare all pairs
for i in range(len(customers)):
for j in range(i+1, len(customers)):
name1 = customers.iloc[i]['name']
name2 = customers.iloc[j]['name']
similarity = fuzz.ratio(name1, name2)
if similarity > 80: # Threshold for potential duplicates
print(f"Potential duplicate: '{name1}' vs '{name2}' (similarity: {similarity})")
Output:
Potential duplicate: 'John Smith' vs 'Jon Smith' (similarity: 95)
Potential duplicate: 'John Smith' vs 'John Smyth' (similarity: 91)
Action : Manually review potential duplicates and merge if appropriate.
8.3 Data Transformation
8.3.1 Scaling and Normalization
Why scale data?
Many machine learning algorithms are sensitive to the scale of features:
- Features with larger ranges dominate the model
- Gradient descent converges faster with scaled data
- Distance-based algorithms (KNN, clustering) require similar scales
Example of the problem:
df = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [50000, 60000, 70000, 80000, 90000],
'purchases': [5, 8, 12, 15, 20]
})
print("Standard deviations:")
print(df.std())
Output:
age 7.91
income 15811.39
purchases 5.70
Income has a much larger scale than age or purchases. In a distance-based algorithm, income would dominate.
Scaling Methods
1. Standardization (Z-score normalization)
Transform to mean=0, std=1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_standardized = pd.DataFrame(
scaler.fit_transform(df),
columns=df.columns
)
print("Standardized data:")
print(df_standardized)
print("\nMeans:", df_standardized.mean())
print("Stds:", df_standardized.std())
Output:
Standardized data:
age income purchases
0 -1.414214 -1.414214 -1.414214
1 -0.707107 -0.707107 -0.707107
2 0.000000 0.000000 0.000000
3 0.707107 0.707107 0.707107
4 1.414214 1.414214 1.414214
Means: age -7.401487e-17
income -7.401487e-17
purchases -7.401487e-17
Stds: age 1.0
income 1.0
purchases 1.0
When to use:
- ✓ Features have different units/scales
- ✓ Using algorithms sensitive to scale (SVM, neural networks, PCA)
- ✓ Data is approximately normally distributed
2. Min-Max Scaling
Transform to range [0, 1]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_minmax = pd.DataFrame(
scaler.fit_transform(df),
columns=df.columns
)
print("Min-Max scaled data:")
print(df_minmax)
print("\nMins:", df_minmax.min())
print("Maxs:", df_minmax.max())
Output:
Min-Max scaled data:
age income purchases
0 0.0 0.0 0.0
1 0.25 0.25 0.2
2 0.5 0.5 0.467
3 0.75 0.75 0.667
4 1.0 1.0 1.0
Mins: age 0.0
income 0.0
purchases 0.0
Maxs: age 1.0
income 1.0
purchases 1.0
When to use:
- ✓ Need bounded range (e.g., for neural networks with sigmoid activation)
- ✓ Data doesn't have outliers (outliers compress the range)
3. Robust Scaling
Uses median and IQR instead of mean and std (robust to outliers):
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_robust = pd.DataFrame(
scaler.fit_transform(df),
columns=df.columns
)
print("Robust scaled data:")
print(df_robust)
When to use:
- ✓ Data has outliers
- ✓ Want scaling that's not affected by extreme values
Important: Fit on training data only
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
# Fit scaler on training data only
scaler = StandardScaler()
scaler.fit(X_train)
# Transform both training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# WRONG: Don't do this
# scaler.fit(X_test) # This would leak information from test set
Why? If you fit on test data, you're using information from the future, which creates data leakage.
8.3.2 Encoding Categorical Variables
Machine learning algorithms need numbers, not text. Encoding converts categorical variables to numerical format.
Types of categorical variables:
- Nominal : No inherent order (color, city, product type)
- Ordinal : Natural order (education level, satisfaction rating)
Encoding Methods
1. Label Encoding
Assign each category a number:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']
})
encoder = LabelEncoder()
df['city_encoded'] = encoder.fit_transform(df['city'])
print(df)
print("\nMapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))
Output:
city city_encoded
0 New York 2
1 Los Angeles 1
2 Chicago 0
3 New York 2
4 Chicago 0
Mapping: {'Chicago': 0, 'Los Angeles': 1, 'New York': 2}
Problem : Implies ordering (Chicago < Los Angeles < New York), which doesn't make sense for nominal variables.
When to use:
- ✓ Ordinal variables (low < medium < high)
- ✓ Tree-based models (can handle arbitrary encodings)
- ✗ Linear models (will treat as numeric)
2. One-Hot Encoding
Create binary column for each category:
df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']
})
# Using pandas
df_onehot = pd.get_dummies(df, columns=['city'], prefix='city')
print(df_onehot)
Output:
city_Chicago city_Los Angeles city_New York
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
Using sklearn (better for pipelines):
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first') # drop='first' avoids multicollinearity
city_encoded = encoder.fit_transform(df[['city']])
df_encoded = pd.DataFrame(
city_encoded,
columns=encoder.get_feature_names_out(['city'])
)
print(df_encoded)
When to use:
- ✓ Nominal variables
- ✓ Linear models, neural networks
- ✓ Small number of categories (<10-20)
Problem : High cardinality (many categories) creates too many columns.
3. Frequency Encoding
Replace category with its frequency:
df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago',
'New York', 'Chicago', 'Chicago']
})
# Calculate frequencies
freq = df['city'].value_counts(normalize=True)
df['city_freq'] = df['city'].map(freq)
print(df)
Output:
city city_freq
0 New York 0.375
1 Los Angeles 0.125
2 Chicago 0.500
3 New York 0.375
4 Chicago 0.500
5 New York 0.375
6 Chicago 0.500
7 Chicago 0.500
When to use:
- ✓ High cardinality variables
- ✓ Frequency is informative (common categories behave differently)
4. Target Encoding
Replace category with mean of target variable for that category:
df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
'sales': [100, 150, 200, 120, 180]
})
# Calculate mean sales per city
target_means = df.groupby('city')['sales'].mean()
df['city_target_encoded'] = df['city'].map(target_means)
print(df)
Output:
city sales city_target_encoded
0 New York 100 110.0
1 Los Angeles 150 150.0
2 Chicago 200 190.0
3 New York 120 110.0
4 Chicago 180 190.0
When to use:
- ✓ High cardinality variables
- ✓ Strong relationship between category and target
Caution : Can cause overfitting. Use cross-validation or smoothing.
Handling Ordinal Variables
df = pd.DataFrame({
'education': ['High School', 'Bachelor', 'Master', 'PhD', 'High School']
})
# Define order
education_order = {
'High School': 1,
'Bachelor': 2,
'Master': 3,
'PhD': 4
}
df['education_encoded'] = df['education'].map(education_order)
print(df)
Output:
education education_encoded
0 High School 1
1 Bachelor 2
2 Master 3
3 PhD 4
4 High School 1
8.3.3 Date/Time Handling and Time-Based Features
Dates contain rich information, but need to be extracted properly.
Parsing Dates
df = pd.DataFrame({
'transaction_date': ['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05']
})
# Convert to datetime
df['date'] = pd.to_datetime(df['transaction_date'])
print(df.dtypes)
Extracting Time Components
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek # Monday=0, Sunday=6
df['day_name'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
print(df)
Output:
transaction_date date year month day day_of_week day_name quarter is_weekend
0 2024-01-15 2024-01-15 2024 1 15 0 Monday 1 0
1 2024-02-20 2024-02-20 2024 2 20 1 Tuesday 1 0
2 2024-03-10 2024-03-10 2024 3 10 6 Sunday 1 1
3 2024-04-05 2024-04-05 2024 4 5 4 Friday 2 0
Time-Based Features
# Days since reference date
reference_date = pd.to_datetime('2024-01-01')
df['days_since_start'] = (df['date'] - reference_date).dt.days
# Time between events
df = df.sort_values('date')
df['days_since_last'] = df['date'].diff().dt.days
# Cyclical encoding (for periodic features like month, day of week)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
print(df[['date', 'month', 'month_sin', 'month_cos']])
Why cyclical encoding?
December (month=12) and January (month=1) are adjacent, but numerically far apart. Cyclical encoding preserves this relationship.
Business-Relevant Date Features
# Is it a holiday?
holidays = pd.to_datetime(['2024-01-01', '2024-07-04', '2024-12-25'])
df['is_holiday'] = df['date'].isin(holidays).astype(int)
# Is it month-end? (important for B2B sales)
df['is_month_end'] = (df['date'].dt.is_month_end).astype(int)
# Days until month-end
df['days_to_month_end'] = df['date'].dt.days_in_month - df['date'].dt.day
# Fiscal quarter (if fiscal year starts in April)
df['fiscal_quarter'] = ((df['month'] - 4) % 12) // 3 + 1
print(df[['date', 'is_month_end', 'days_to_month_end', 'fiscal_quarter']])
8.4 Feature Engineering
Feature engineering is the process of creating new variables from existing data to improve model performance.
Why it matters:
"Applied machine learning is basically feature engineering." — Andrew Ng
Good features can make a simple model outperform a complex model with poor features.
8.4.1 Domain-Driven Feature Construction
The best features come from business understanding, not just data manipulation.
Example: Customer Churn Prediction
Raw data:
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'signup_date': pd.to_datetime(['2023-01-15', '2023-03-20', '2023-06-10', '2023-08-05', '2023-10-12']),
'last_login': pd.to_datetime(['2024-01-10', '2023-12-15', '2024-01-08', '2023-09-20', '2024-01-12']),
'total_purchases': [15, 3, 22, 1, 8],
'total_spent': [1500, 200, 3200, 50, 650],
'support_tickets': [2, 5, 1, 8, 3],
'email_opens': [45, 10, 78, 5, 32]
})
Domain-driven features:
# Recency: Days since last activity (recent activity = less likely to churn)
today = pd.to_datetime('2024-01-15')
customers['days_since_last_login'] = (today - customers['last_login']).dt.days
# Tenure: How long they've been a customer
customers['tenure_days'] = (today - customers['signup_date']).dt.days
# Engagement: Activity per unit time
customers['purchases_per_month'] = customers['total_purchases'] / (customers['tenure_days'] / 30)
customers['logins_per_month'] = 30 / customers['days_since_last_login'] # Approximate
# Value: Average order value
customers['avg_order_value'] = customers['total_spent'] / customers['total_purchases']
# Support intensity: Tickets per purchase (high = problems)
customers['tickets_per_purchase'] = customers['support_tickets'] / customers['total_purchases']
# Engagement score: Email engagement
customers['email_engagement'] = customers['email_opens'] / customers['tenure_days'] * 30
print(customers[[
'customer_id', 'days_since_last_login', 'tenure_days',
'purchases_per_month', 'avg_order_value', 'tickets_per_purchase'
]])
Business logic behind each feature:
- days_since_last_login : Inactive customers are more likely to churn
- tenure_days : New customers churn more than established ones
- purchases_per_month : Frequent buyers are engaged
- avg_order_value : High-value customers are worth retaining
- tickets_per_purchase : Many support issues indicate dissatisfaction
- email_engagement : Engaged customers open emails
8.4.2 Aggregations, Ratios, and Interaction Features
Aggregations
Summarize detailed data to customer/product level:
# Transaction-level data
transactions = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'transaction_date': pd.to_datetime([
'2024-01-01', '2024-01-15', '2024-01-20',
'2024-01-05', '2024-01-25',
'2024-01-10', '2024-01-12', '2024-01-18', '2024-01-22'
]),
'amount': [100, 150, 200, 50, 75, 300, 100, 250, 180],
'category': ['Electronics', 'Clothing', 'Electronics',
'Clothing', 'Clothing',
'Electronics', 'Home', 'Electronics', 'Clothing']
})
# Aggregate to customer level
customer_features = transactions.groupby('customer_id').agg({
'amount': ['sum', 'mean', 'std', 'min', 'max', 'count'],
'transaction_date': ['min', 'max']
}).reset_index()
customer_features.columns = ['customer_id', 'total_spent', 'avg_transaction',
'std_transaction', 'min_transaction', 'max_transaction',
'num_transactions', 'first_purchase', 'last_purchase']
# Time-based features
customer_features['days_active'] = (
customer_features['last_purchase'] - customer_features['first_purchase']
).dt.days
customer_features['purchase_frequency'] = (
customer_features['num_transactions'] / (customer_features['days_active'] + 1)
)
print(customer_features)
Ratios
Ratios often reveal more than raw numbers:
# E-commerce metrics
df = pd.DataFrame({
'sessions': [100, 200, 150, 300],
'page_views': [500, 800, 600, 1200],
'add_to_cart': [20, 35, 25, 50],
'purchases': [5, 10, 8, 15],
'revenue': [500, 1200, 900, 1800]
})
# Conversion funnel ratios
df['pages_per_session'] = df['page_views'] / df['sessions']
df['add_to_cart_rate'] = df['add_to_cart'] / df['sessions']
df['conversion_rate'] = df['purchases'] / df['sessions']
df['cart_to_purchase_rate'] = df['purchases'] / df['add_to_cart']
# Value metrics
df['avg_order_value'] = df['revenue'] / df['purchases']
df['revenue_per_session'] = df['revenue'] / df['sessions']
print(df[[
'conversion_rate', 'cart_to_purchase_rate',
'avg_order_value', 'revenue_per_session'
]])
Why ratios matter:
- 100 sessions with 10 purchases is very different from 1000 sessions with 10 purchases
- Ratios normalize for scale and reveal efficiency
Interaction Features
Capture relationships between variables:
# Marketing data
df = pd.DataFrame({
'ad_spend': [1000, 2000, 1500, 3000],
'email_sends': [5000, 10000, 7500, 15000],
'conversions': [50, 120, 80, 200]
})
# Interaction: Combined effect of ad spend and email
df['ad_email_interaction'] = df['ad_spend'] * df['email_sends']
# Polynomial features: Non-linear relationships
df['ad_spend_squared'] = df['ad_spend'] ** 2
# Efficiency ratios
df['cost_per_conversion'] = df['ad_spend'] / df['conversions']
df['emails_per_conversion'] = df['email_sends'] / df['conversions']
print(df)
When to use interactions:
- When you suspect two variables work together (e.g., price × quality)
- When relationships are non-linear
- For tree-based models: less important (they find interactions automatically)
- For linear models: very important (they can't find interactions on their own)
8.4.3 Handling Categorical Data
Categorical data is common in business analytics and often needs to be transformed into numerical formats for machine learning models. Here are three widely used techniques to handle categorical variables:
One-Hot Encoding
One-hot encoding converts each category value into a new binary column (feature) with 1s and 0s indicating the presence of that category. This method is simple and effective for nominal categories without intrinsic order.
Example in Python:
import pandas as pd
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue']}) one_hot = pd.get_dummies(data['Color'], prefix='Color') print(one_hot)
Output:
|
Blue |
Green |
Red |
|
0 |
0 |
1 |
|
1 |
0 |
0 |
|
0 |
1 |
0 |
|
1 |
0 |
0 |
Use case: Best for categorical variables with a small number of unique values. Beware of high dimensionality if categories are many.
Advantages:
- Simple and intuitive representation.
- Does not impose any ordinal relationship between categories, preserving nominal data integrity.
- Compatible with most machine learning algorithms.
Disadvantages:
- Can lead to high dimensionality if the categorical variable has many unique values (high cardinality), causing the "curse of dimensionality."
- Sparse representation can increase memory usage and slow down training.
- May cause overfitting if many rare categories exist.
Handling Large Number of Categories:
- Limit one-hot encoding to categorical variables with relatively few unique values (e.g., less than 20-30).
- For high-cardinality variables, consider grouping rare categories into an "Other" category before encoding.
- Use dimensionality reduction techniques (e.g., PCA) on one-hot encoded features, though this may reduce interpretability.
Label Encoding
Label encoding assigns each unique category an integer value. This is useful for ordinal categories where the order matters but can mislead models if used on nominal data.
Example in Python:
from sklearn.preprocessing import LabelEncoder
data = ['Low', 'Medium', 'High', 'Medium']
le = LabelEncoder()
encoded = le.fit_transform(data)
print(encoded)
Output:
[1 2 0 2]
Advantages:
- Very memory efficient since it uses a single integer per category.
- Simple and fast to compute.
- Useful for ordinal data where the order matters.
Disadvantages:
- Imposes an arbitrary ordinal relationship on nominal categories, which can mislead many algorithms (e.g., linear models may interpret higher integers as "larger" or "better").
- Not suitable for nominal variables unless the model can handle categorical integers properly (e.g., tree-based models).
Handling Large Number of Categories:
- Label encoding itself scales well with many categories since it only assigns integers.
- However, if the model is sensitive to ordinal relationships, label encoding may cause bias.
- Consider combining with target encoding or embeddings for high-cardinality nominal variables.
Embeddings
Embeddings map categories to dense, low-dimensional vectors learned during model training, capturing semantic relationships between categories. Commonly used in deep learning models, embeddings can represent high-cardinality categorical variables efficiently.
Example concept (using TensorFlow/Keras):
from tensorflow.keras.layers import Input, Embedding, Flatten
from tensorflow.keras.models import Model
input_cat = Input(shape=(1,))
embedding = Embedding(input_dim=100, output_dim=8)(input_cat)
flat = Flatten()(embedding)
model = Model(inputs=input_cat, outputs=flat)
Use case: Ideal for large cardinality categorical variables and when relationships between categories matter. Requires more complex models and training.
Advantages:
- Efficiently represent high-cardinality categorical variables in a low-dimensional continuous space.
- Capture semantic relationships and similarities between categories learned from data.
- Reduce dimensionality compared to one-hot encoding.
- Improve model performance, especially in deep learning and complex models.
Disadvantages:
- Require more complex models and training to learn meaningful embeddings.
- Less interpretable than one-hot or label encoding.
- Need sufficient data to learn good embeddings; sparse categories may have poor representations.
- Implementation complexity is higher.
Handling Large Number of Categories:
- Embeddings naturally handle large cardinality by mapping categories to dense vectors.
- Can be combined with hashing tricks to reduce embedding table size.
- Use regularization and embedding dropout to prevent overfitting on rare categories.
- For extremely large cardinality, consider hashing-based embeddings or dimensionality reduction before embedding.
Additional Techniques for Very Large Cardinality Categorical Variables
- Frequency or Count Encoding: Replace categories with their frequency or count in the dataset. Simple and scalable but loses category identity.
- Target Encoding: Replace categories with the mean of the target variable for that category. Powerful but prone to leakage; requires careful cross-validation.
- Hashing Trick: Map categories to a fixed number of buckets using a hash function, reducing dimensionality but introducing collisions. Useful for streaming or very large datasets.
- Clustering Categories: Group similar categories based on domain knowledge or data-driven similarity before encoding.
Summary Table
|
Encoding Method |
Advantages |
Disadvantages |
Handling Large Cardinality |
|
One-Hot Encoding |
Simple, no ordinal assumptions |
High dimensionality, sparse |
Group rare categories, dimensionality reduction |
|
Label Encoding |
Memory efficient, fast |
Imposes ordinal relation on nominal data |
Use only for ordinal data, combine with other methods |
|
Embeddings |
Low-dimensional, captures relations |
Complex, less interpretable |
Natural for large cardinality, use hashing or regularization |
|
Frequency Encoding |
Simple, scalable |
Loses category identity |
Good for very large cardinality |
|
Target Encoding |
Powerful, uses target info |
Risk of leakage |
Use with cross-validation |
|
Hashing Trick |
Fixed size, scalable |
Collisions possible |
Useful for streaming or huge datasets |
Choosing the right approach depends on the data, model, and business context. For very large cardinality categorical variables, embeddings or hashing-based methods are often preferred in modern analytics pipelines.
8.4.4 Text Features (Basic NLP Features)
Text data is everywhere: customer reviews, support tickets, product descriptions, emails.
Basic Text Features
reviews = pd.DataFrame({
'review_id': [1, 2, 3, 4],
'text': [
'Great product! Love it.',
'Terrible quality. Very disappointed.',
'Good value for money. Recommended.',
'Amazing! Best purchase ever!!!'
]
})
# Length features
reviews['char_count'] = reviews['text'].str.len()
reviews['word_count'] = reviews['text'].str.split().str.len()
reviews['avg_word_length'] = reviews['char_count'] / reviews['word_count']
# Punctuation (excitement, emphasis)
reviews['exclamation_count'] = reviews['text'].str.count('!')
reviews['question_count'] = reviews['text'].str.count('\?')
# Uppercase (shouting, emphasis)
reviews['uppercase_ratio'] = (
reviews['text'].str.count(r'[A-Z]') / reviews['char_count']
)
print(reviews)
Sentiment and Keywords
# Simple sentiment (count positive/negative words)
positive_words = ['great', 'love', 'good', 'amazing', 'best', 'excellent', 'recommended']
negative_words = ['terrible', 'bad', 'disappointed', 'worst', 'poor', 'awful']
reviews['text_lower'] = reviews['text'].str.lower()
reviews['positive_word_count'] = reviews['text_lower'].apply(
lambda x: sum(word in x for word in positive_words)
)
reviews['negative_word_count'] = reviews['text_lower'].apply(
lambda x: sum(word in x for word in negative_words)
)
reviews['sentiment_score'] = (
reviews['positive_word_count'] - reviews['negative_word_count']
)
print(reviews[['text', 'positive_word_count', 'negative_word_count', 'sentiment_score']])
For more sophisticated NLP:
# Using TextBlob for sentiment analysis
from textblob import TextBlob
reviews['polarity'] = reviews['text'].apply(
lambda x: TextBlob(x).sentiment.polarity
)
reviews['subjectivity'] = reviews['text'].apply(
lambda x: TextBlob(x).sentiment.subjectivity
)
print(reviews[['text', 'polarity', 'subjectivity']])
Polarity
: -1 (negative) to +1 (positive)
Subjectivity
: 0 (objective) to 1 (subjective)
TF-IDF for Document Similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=10, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(reviews['text'])
# Convert to dataframe
tfidf_df = pd.DataFrame(
tfidf_matrix.toarray(),
columns=vectorizer.get_feature_names_out()
)
print("TF-IDF features:")
print(tfidf_df)
Use cases:
- Document classification (categorize support tickets)
- Similarity search (find similar products/reviews)
- Feature extraction for ML models
8.5 Train–Test Splits and Data Leakage Prevention
Data leakage is when information from the future "leaks" into your training data, making your model look better than it actually is.
The result : Great performance in development, terrible performance in production.
Types of Data Leakage
1. Target leakage
Using information that wouldn't be available at prediction time.
Example: Predicting loan default
# WRONG: Using payment_status to predict default
# payment_status is only known AFTER you know if they defaulted
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'income': [50000, 60000, 45000, 70000],
'payment_status': ['late', 'on_time', 'late', 'on_time'], # ← This is leakage!
'defaulted': [1, 0, 1, 0]
})
# payment_status is determined by whether they default
# You can't use it to predict default
How to avoid : Ask "Would I have this information at the time I need to make the prediction?"
2. Train-test contamination
Information from test set leaking into training.
Example: Scaling before splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'target': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
})
# WRONG: Scale before split
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1']]) # Uses ALL data
X_train, X_test = train_test_split(df_scaled, test_size=0.2)
# RIGHT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(
df[['feature1']], df['target'], test_size=0.2, random_state=42
)
scaler = StandardScaler()
scaler.fit(X_train) # Fit only on training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Transform test using training parameters
Why it matters : If you scale using the full dataset, the test set's statistics influence the training data.
3. Temporal leakage
Using future information to predict the past.
Example: Time series forecasting
# Sales data
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10),
'sales': [100, 110, 105, 115, 120, 125, 130, 135, 140, 145]
})
# WRONG: Random split for time series
X_train, X_test = train_test_split(df, test_size=0.2) # Mixes past and future
# RIGHT: Time-based split
train_size = int(0.8 * len(df))
train = df[:train_size] # First 80%
test = df[train_size:] # Last 20%
print("Training period:", train['date'].min(), "to", train['date'].max())
print("Test period:", test['date'].min(), "to", test['date'].max())
Rule : Always predict future from past, never past from future.
4. Feature leakage from aggregations
Example: Customer churn
# WRONG: Including future behavior in features
df = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2],
'month': [1, 2, 3, 1, 2],
'purchases': [2, 1, 0, 3, 2],
'churned_month_3': [1, 1, 1, 0, 0] # Churn status at month 3
})
# Calculate total purchases (includes future purchases!)
df['total_purchases'] = df.groupby('customer_id')['purchases'].transform('sum')
# This is leakage: total_purchases includes purchases AFTER prediction time
RIGHT: Use only past information
# Calculate cumulative purchases up to current month
df = df.sort_values(['customer_id', 'month'])
df['cumulative_purchases'] = df.groupby('customer_id')['purchases'].cumsum()
# For month 3 prediction, use data from months 1-2 only
prediction_month = 3
train_data = df[df['month'] < prediction_month]
Proper Train-Test Split
from sklearn.model_selection import train_test_split
# Basic split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Reproducibility
stratify=y # Maintain class proportions (for classification)
)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"Class distribution in train: {y_train.value_counts(normalize=True)}")
print(f"Class distribution in test: {y_test.value_counts(normalize=True)}")
Cross-Validation (Better than single split)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
Why cross-validation?
- Uses all data for both training and testing
- Gives more reliable estimate of performance
- Reduces variance from single train-test split
Time Series Cross-Validation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and evaluate model
# ...
Ensures : Each test set is always in the future relative to its training set.
8.6 Documenting and Automating Data Preparation Pipelines
Why pipelines matter:
- Reproducibility : Same transformations every time
- Efficiency : Automate repetitive tasks
- Deployment : Easy to apply to new data
- Debugging : Clear what transformations were applied
Building a Pipeline with sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
# Define pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Fit pipeline (applies all steps)
pipeline.fit(X_train, y_train)
# Predict (applies same transformations)
y_pred = pipeline.predict(X_test)
# Score
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.3f}")
Benefits:
- All transformations in one object
- No risk of forgetting a step
- Easy to save and load
Handling Different Column Types
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define which columns get which transformations
numeric_features = ['age', 'income', 'purchases']
categorical_features = ['city', 'product_category']
# Numeric pipeline
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', LogisticRegression())
])
# Fit and predict
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
Custom Transformers
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureEngineer(BaseEstimator, TransformerMixin):
"""Custom transformer for domain-specific features"""
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
# Create new features
X['purchases_per_month'] = X['total_purchases'] / X['tenure_months']
X['avg_order_value'] = X['total_spent'] / X['total_purchases']
X['support_intensity'] = X['support_tickets'] / X['total_purchases']
return X
# Use in pipeline
pipeline = Pipeline([
('feature_engineer', FeatureEngineer()),
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
Saving and Loading Pipelines
import joblib
# Save pipeline
joblib.dump(pipeline, 'model_pipeline.pkl')
# Load pipeline
loaded_pipeline = joblib.load('model_pipeline.pkl')
# Use on new data
new_predictions = loaded_pipeline.predict(new_data)
Documentation Best Practices
Data Preparation Pipeline for Customer Churn Prediction
Author: [Your Name]
Date: 2024-01-15
Last Updated: 2024-01-15
DATA SOURCE:
- customers.csv: Customer demographic and behavioral data
- transactions.csv: Transaction history
PREPROCESSING STEPS:
1. Missing Value Handling:
- age: Impute with median
- income: Impute with median, create missing indicator
- purchases: Impute with mean
2. Outlier Treatment:
- income: Cap at 5th and 95th percentiles
- purchases: Remove values > 3 standard deviations
3. Feature Engineering:
- purchases_per_month: total_purchases / tenure_months
- avg_order_value: total_spent / total_purchases
- days_since_last_purchase: today - last_purchase_date
4. Encoding:
- city: One-hot encoding
- education: Ordinal encoding (HS=1, Bachelor=2, Master=3, PhD=4)
5. Scaling:
- All numeric features: StandardScaler
VALIDATION:
- Train-test split: 80/20
- Stratified by churn status
- Random state: 42
KNOWN ISSUES:
- Small sample size for some cities (< 10 customers)
- Missing data for income is not random (higher for high earners)
NEXT STEPS:
- Consider target encoding for high-cardinality categorical variables
- Experiment with polynomial features for non-linear relationships
Key Takeaways
-
Data preparation is 80% of the work
—and it determines the quality of your results.
-
Understand before you clean
: Don't blindly apply transformations. Understand what the data represents and why it's messy.
-
Missing data strategy depends on why it's missing
: MCAR, MAR, and MNAR require different approaches.
-
Outliers aren't always errors
: Understand whether they're mistakes or legitimate extremes before removing them.
-
Scaling matters for some algorithms
: Distance-based and gradient-based methods need scaled features.
-
Encoding depends on variable type
: Nominal vs. ordinal, low vs. high cardinality.
-
Feature engineering is where domain knowledge shines
: The best features come from understanding the business, not just manipulating data.
-
Data leakage will ruin your model
: Always ask "Would I have this information at prediction time?"
-
Pipelines ensure reproducibility
: Automate your preprocessing to avoid errors and enable deployment.
-
Document everything
: Future you (and your colleagues) will thank you.
Exercises
Exercise 1: Clean a Raw Dataset
You're given a customer dataset with various data quality issues. Clean it using pandas.
Dataset: customers_raw.csv
i
mport pandas as pd
import numpy as np
# Create sample messy data
np.random.seed(42)
df = pd.DataFrame({
'customer_id': [1, 2, 2, 3, 4, 5, 6, 7, 8, 9], # Duplicate
'name': ['Alice Smith', 'bob jones', 'Bob Jones', 'CHARLIE BROWN',
'Diana Prince', 'Eve Adams', None, 'Frank Miller', 'Grace Lee', 'Henry Ford'],
'age': [25, 30, 30, 250, 45, np.nan, 28, 35, 40, 50], # Outlier, missing
'city': ['New York', 'new york', 'NY', 'Los Angeles', 'Chicago',
'chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio'],
'income': [50000, 60000, 60000, 75000, np.nan, 55000, 62000, 58000, 70000, 80000], 'signup_date': ['2023-01-15', '2023-02-20', '2023-02-20', '01/03/2023',
'2023-04-10', '2023-05-15', '2023-06-20', '2023-07-25', '2023-08-30', '2023-09-15']
})
df.to_csv('customers_raw.csv', index=False)
Your tasks:
a) Identify and remove duplicate records
b) Standardize text fields (name, city):
- Remove extra spaces
- Consistent capitalization
- Map city variations to standard names
c) Handle missing values:
- Decide on strategy for each column
- Implement and justify your choice
d) Detect and treat outliers:
- Identify outliers in age and income
- Decide whether to remove, cap, or keep