Chapter 8. Data Preparation and Feature Engineering in Python

"Garbage in, garbage out."

This old adage is especially true in analytics. You can have the most sophisticated machine learning algorithm, but if your data is messy, incomplete, or poorly structured, your results will be worthless.

Data scientists often say they spend 80% of their time on data preparation and only 20% on modeling. This isn't a sign of inefficiency—it's the reality of working with real-world data. The quality of your data preparation directly determines the quality of your insights.

This chapter covers the practical skills you need to transform raw, messy data into clean, structured datasets ready for analysis. We'll focus on Python and pandas, the standard tools for data preparation in business analytics.

8.1 The Importance of Data Preparation in Analytics Projects

Why Data Preparation Matters

Raw data is rarely analysis-ready. Real-world datasets have:

Missing values : Customers who didn't fill out optional fields, sensors that failed, incomplete records
Outliers : Data entry errors, fraudulent transactions, legitimate but extreme values
Inconsistencies : "New York", "NY", "new york", "New York City" all referring to the same place
Wrong formats : Dates stored as text, numbers stored as strings, mixed units
Duplicates : Same customer entered multiple times, repeated transactions
Irrelevant information : Columns you don't need, noise that obscures signal

The consequences of poor data preparation:

Wrong conclusions : Outliers skew averages, missing data creates bias
Model failures : Algorithms break on missing values or inconsistent formats
Wasted time : Debugging model problems that are actually data problems
Lost trust : Stakeholders lose confidence when they spot obvious data errors

The benefits of good data preparation:

Better models : Clean data → better predictions
Faster iteration : Automated pipelines let you test ideas quickly
Reproducibility : Documented processes can be repeated and audited
Trust : Stakeholders trust analysis built on clean, well-understood data

The Data Preparation Workflow

RAW DATA

↓

1. UNDERSTAND

• What does each column mean?

• What's the grain of the data?

• What are the data types?

↓

2. CLEAN

• Handle missing values

• Remove/fix outliers

• Fix inconsistencies

• Remove duplicates

↓

3. TRANSFORM

• Scale/normalize

• Encode categories

• Parse dates

↓

4. ENGINEER FEATURES

• Create new variables

• Aggregate information

• Extract patterns

↓

5. VALIDATE

• Check distributions

• Verify logic

• Test for leakage

↓

ANALYSIS-READY DATA

Key principle : Understand before you clean. Don't blindly apply transformations. First understand what the data represents, then decide how to prepare it.

8.2 Data Cleaning

8.2.1 Handling Missing Data

Missing data is inevitable. The question is: how do you handle it?

Types of missing data:

Missing Completely at Random (MCAR) : Missingness has no relationship to any variable

Example: Survey responses lost due to server error
Impact : Reduces sample size but doesn't bias results

Missing at Random (MAR) : Missingness related to observed variables, not the missing value itself

Example: Younger customers less likely to provide phone numbers
Impact : Can bias results if not handled properly

Missing Not at Random (MNAR) : Missingness related to the unobserved value

Example: High earners refuse to disclose income
Impact : Serious bias, difficult to correct

Why this matters : The type of missingness determines the best handling strategy.

Strategies for Handling Missing Data

1. Delete rows with missing values

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'age': [25, 30, np.nan, 45, 28],
'income': [50000, np.nan, 75000, 90000, 60000],
'purchases': [5, 3, 8, np.nan, 4]
})

# Drop any row with missing values
df_complete = df.dropna()
print(f"Original rows: {len(df)}, After dropna: {len(df_complete)}")
# Output: Original rows: 5, After dropna: 2

# Drop rows where specific columns are missing
df_age_complete = df.dropna(subset=['age'])
print(f"Rows with age: {len(df_age_complete)}")

# Output: Rows with age: 4

When to use:

✓ Missing data is MCAR
✓ You have plenty of data (losing rows doesn't hurt)
✓ Missing values are rare (<5%)

When NOT to use:

✗ Missing data is MAR or MNAR (creates bias)
✗ You have limited data
✗ Many rows have at least one missing value

2. Impute with simple statistics

# Mean imputation
df['age_imputed'] = df['age'].fillna(df['age'].mean())

# Median imputation (better for skewed data)
df['income_imputed'] = df['income'].fillna(df['income'].median())

# Mode imputation (for categorical data)
df['category'] = ['A', 'B', np.nan, 'A', 'B']
df['category_imputed'] = df['category'].fillna(df['category'].mode()[0])
print(df[['age', 'age_imputed', 'income', 'income_imputed']])

When to use:

✓ Missing data is MCAR or MAR
✓ Quick solution needed
✓ Missingness is moderate (5-20%)

Caution :

Reduces variance (all missing values get the same number)
Can distort relationships between variables
Mean is sensitive to outliers (use median for skewed data)

3. Forward fill / Backward fill (for time series)

# Time series data
df_ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5),
'temperature': [20, np.nan, np.nan, 23, 24]
})

# Forward fill: use last known value
df_ts['temp_ffill'] = df_ts['temperature'].fillna(method='ffill')

# Backward fill: use next known value
df_ts['temp_bfill'] = df_ts['temperature'].fillna(method='bfill')
print(df_ts)

Output:

date temperature temp_ffill temp_bfill

0 2024-01-01 20.0 20.0 20.0

1 2024-01-02 NaN 20.0 23.0

2 2024-01-03 NaN 20.0 23.0

3 2024-01-04 23.0 23.0 23.0

4 2024-01-05 24.0 24.0 24.0

When to use:

✓ Time series data where values change slowly
✓ Sensor data with occasional missing readings

4. Impute with predictive models

from sklearn.impute import KNNImputer

# Use K-Nearest Neighbors to impute based on similar records
imputer = KNNImputer(n_neighbors=3)
df_numeric = df[['age', 'income', 'purchases']]
df_imputed = pd.DataFrame(
imputer.fit_transform(df_numeric),
columns=df_numeric.columns
)

print("Original:\n", df_numeric)
print("\nImputed:\n", df_imputed)

When to use:

✓ Missing data is MAR
✓ You have enough data to build reliable models
✓ Relationships between variables are important

Other sophisticated methods:

Multiple imputation (creates several imputed datasets)
Model-based imputation (regression, random forest)

8.2.2 Detecting and Treating Outliers

Outliers are data points that are significantly different from others.

Two types:

Errors : Data entry mistakes, measurement errors, system glitches

Example: Age = 250, Price = -$100
Action : Remove or correct

Legitimate extremes : Real but unusual values

Example: CEO salary in employee dataset, bulk purchase in retail data
Action : Keep but handle carefully (may need separate analysis)

Detecting Outliers

Method 1: Visual inspection

import matplotlib.pyplot as plt

# Sample data with outliers

data = pd.DataFrame({

'salary': [50000, 55000, 52000, 48000, 51000, 53000, 500000, 49000, 54000, 50000]

})

# Box plot

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

plt.boxplot(data['salary'])

plt.title('Box Plot')

plt.ylabel('Salary')

plt.subplot(1, 2, 2)

plt.hist(data['salary'], bins=20, edgecolor='black')

plt.title('Histogram')

plt.xlabel('Salary')

plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

Method 2: Statistical methods

IQR (Interquartile Range) method:

def detect_outliers_iqr(df, column):
"""Detect outliers using IQR method"""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")
print(f"Number of outliers: {len(outliers)}")
return outliers

outliers = detect_outliers_iqr(data, 'salary')
print("\nOutliers:")
print(outliers)

Z-score method:

from scipy import stats
def detect_outliers_zscore(df, column, threshold=3):
"""Detect outliers using Z-score method"""
z_scores = np.abs(stats.zscore(df[column]))
outliers = df[z_scores > threshold]

print(f"Number of outliers (|z| > {threshold}): {len(outliers)}")
return outliers

outliers_z = detect_outliers_zscore(data, 'salary')
print("\nOutliers (Z-score):")
print(outliers_z)

When to use each:

IQR : Robust to outliers, works well for skewed data
Z-score : Assumes normal distribution, sensitive to extreme outliers

Treating Outliers

Option 1: Remove

# Remove outliers using IQR

Q1 = data['salary'].quantile(0.25)

Q3 = data['salary'].quantile(0.75)

IQR = Q3 - Q1

data_clean = data[

(data['salary'] >= Q1 - 1.5 * IQR) &

(data['salary'] <= Q3 + 1.5 * IQR)

]

print(f"Original: {len(data)} rows, After removal: {len(data_clean)} rows")

When to use:

✓ Outliers are clearly errors
✓ You have plenty of data
✗ Don't use if outliers are legitimate (you'll lose important information)

Option 2: Cap (Winsorize)

# Cap at 5th and 95th percentiles

lower = data['salary'].quantile(0.05)

upper = data['salary'].quantile(0.95)

data['salary_capped'] = data['salary'].clip(lower=lower, upper=upper)

print("Original vs Capped:")

print(data[['salary', 'salary_capped']])

When to use:

✓ Outliers are legitimate but you want to reduce their influence
✓ For modeling where extreme values can dominate

Option 3: Transform

# Log transformation (reduces impact of large values)

data['salary_log'] = np.log1p(data['salary']) # log1p = log(1 + x), handles zeros

# Compare distributions

print("Original - Mean: {:.0f}, Std: {:.0f}".format(

data['salary'].mean(), data['salary'].std()

))

print("Log transformed - Mean: {:.2f}, Std: {:.2f}".format(

data['salary_log'].mean(), data['salary_log'].std()

))

When to use:

✓ Data is right-skewed (common for income, prices, counts)
✓ You want to preserve all data but reduce outlier impact

Business Context Matters

Example: Retail transactions

transactions = pd.DataFrame({

'transaction_id': range(1, 11),

'amount': [25, 30, 28, 32, 27, 29, 5000, 31, 26, 30]

})

# $5,000 transaction is an outlier

# But is it an error or a bulk purchase?

# Check context

transactions['is_outlier'] = (

np.abs(stats.zscore(transactions['amount'])) > 3

)

print(transactions)

# Decision:

# - If it's an error → remove

# - If it's a bulk purchase → keep but analyze separately

# - For average transaction analysis → exclude

# - For total revenue analysis → include

Key insight : There's no universal rule. The right approach depends on:

Why the outlier exists
What question you're answering
What impact it has on your analysis

8.2.3 Dealing with Inconsistent and Duplicate Records

Inconsistencies make it impossible to group, aggregate, or analyze data correctly.

Common inconsistencies:

Text variations : "New York", "NY", "new york", "N.Y."
Date formats : "01/02/2024", "2024-01-02", "Jan 2, 2024"
Units : Some prices in dollars, others in cents
Encoding : Special characters, extra spaces
Categories : Typos, abbreviations, multiple naming conventions

Cleaning Text Data

# Sample data with inconsistencies

df = pd.DataFrame({

'customer_id': [1, 2, 3, 4, 5],

'city': ['New York', 'new york', 'NY', 'NEW YORK', 'New York City'],

'product': [' iPhone 14 ', 'iphone 14', 'iPhone14', 'IPHONE 14', 'iPhone-14']

})

# Clean text

df['city_clean'] = (

df['city']

.str.strip() # Remove leading/trailing spaces

.str.lower() # Convert to lowercase

.str.replace('.', '') # Remove periods

)

df['product_clean'] = (

df['product']

.str.strip()

.str.lower()

.str.replace(' ', '') # Remove spaces

.str.replace('-', '') # Remove hyphens

)

print(df[['city', 'city_clean', 'product', 'product_clean']])

Output:

city city_clean product product_clean

0 New York new york iPhone 14 iphone14

1 new york new york iphone 14 iphone14

2 NY ny iPhone14 iphone14

3 NEW YORK new york IPHONE 14 iphone14

4 New York City new york city iPhone-14 iphone14

Standardizing with Mappings

# Create mapping for known variations

city_mapping = {

'new york': 'New York',

'ny': 'New York',

'new york city': 'New York',

'nyc': 'New York',

'los angeles': 'Los Angeles',

'la': 'Los Angeles',

'l.a.': 'Los Angeles'

}

df['city_standardized'] = df['city_clean'].map(city_mapping)

print(df[['city', 'city_standardized']])

For large datasets with many variations:

# Use fuzzy matching

from fuzzywuzzy import process

def standardize_city(city, valid_cities, threshold=80):

"""Match to closest valid city name"""

match, score = process.extractOne(city, valid_cities)

if score >= threshold:

return match

return city

valid_cities = ['New York', 'Los Angeles', 'Chicago', 'Houston']

df['city_fuzzy'] = df['city_clean'].apply(

lambda x: standardize_city(x, valid_cities)

)

Detecting and Removing Duplicates

Types of duplicates:

Exact duplicates : All columns identical
Partial duplicates : Key columns identical (e.g., same customer_id)
Fuzzy duplicates : Nearly identical (e.g., typos in names)

# Sample data with duplicates

df = pd.DataFrame({

'customer_id': [1, 2, 2, 3, 4, 4],

'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'],

'email': ['alice@email.com', 'bob@email.com', 'bob@email.com',

'charlie@email.com', 'david@email.com', 'david2@email.com'],

'purchase_date': ['2024-01-01', '2024-01-02', '2024-01-02',

'2024-01-03', '2024-01-04', '2024-01-05']

})

# Check for duplicates

print("Duplicate rows (all columns):")

print(df[df.duplicated()])

print("\nDuplicate customer_ids:")

print(df[df.duplicated(subset=['customer_id'], keep=False)])

# Remove exact duplicates

df_no_exact_dupes = df.drop_duplicates()

print(f"\nOriginal: {len(df)}, After removing exact duplicates: {len(df_no_exact_dupes)}")

# Remove duplicates based on customer_id (keep first occurrence)

df_no_customer_dupes = df.drop_duplicates(subset=['customer_id'], keep='first')

print(f"After removing customer_id duplicates: {len(df_no_customer_dupes)}")

# Keep last occurrence instead

df_keep_last = df.drop_duplicates(subset=['customer_id'], keep='last')

Which duplicates to keep?

# Keep the most recent record

df['purchase_date'] = pd.to_datetime(df['purchase_date'])

df_sorted = df.sort_values('purchase_date', ascending=False)

df_latest = df_sorted.drop_duplicates(subset=['customer_id'], keep='first')

print("Keeping most recent record per customer:")

print(df_latest)

Identifying Fuzzy Duplicates

# Customers with similar names (possible duplicates)

customers = pd.DataFrame({

'customer_id': [1, 2, 3, 4],

'name': ['John Smith', 'Jon Smith', 'John Smyth', 'Jane Doe'],

'email': ['john@email.com', 'jon@email.com', 'jsmith@email.com', 'jane@email.com']

})

from fuzzywuzzy import fuzz

# Compare all pairs

for i in range(len(customers)):

for j in range(i+1, len(customers)):

name1 = customers.iloc[i]['name']

name2 = customers.iloc[j]['name']

similarity = fuzz.ratio(name1, name2)

if similarity > 80: # Threshold for potential duplicates

print(f"Potential duplicate: '{name1}' vs '{name2}' (similarity: {similarity})")

Output:

Potential duplicate: 'John Smith' vs 'Jon Smith' (similarity: 95)

Potential duplicate: 'John Smith' vs 'John Smyth' (similarity: 91)

Action : Manually review potential duplicates and merge if appropriate.

8.3 Data Transformation

8.3.1 Scaling and Normalization

Why scale data?

Many machine learning algorithms are sensitive to the scale of features:

Features with larger ranges dominate the model
Gradient descent converges faster with scaled data
Distance-based algorithms (KNN, clustering) require similar scales

Example of the problem:

df = pd.DataFrame({

'age': [25, 30, 35, 40, 45],

'income': [50000, 60000, 70000, 80000, 90000],

'purchases': [5, 8, 12, 15, 20]

})

print("Standard deviations:")

print(df.std())

Output:

age 7.91

income 15811.39

purchases 5.70

Income has a much larger scale than age or purchases. In a distance-based algorithm, income would dominate.

Scaling Methods

1. Standardization (Z-score normalization)

Transform to mean=0, std=1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df_standardized = pd.DataFrame(

scaler.fit_transform(df),

columns=df.columns

)

print("Standardized data:")

print(df_standardized)

print("\nMeans:", df_standardized.mean())

print("Stds:", df_standardized.std())

Output:

Standardized data:

age income purchases

0 -1.414214 -1.414214 -1.414214

1 -0.707107 -0.707107 -0.707107

2 0.000000 0.000000 0.000000

3 0.707107 0.707107 0.707107

4 1.414214 1.414214 1.414214

Means: age -7.401487e-17

income -7.401487e-17

purchases -7.401487e-17

Stds: age 1.0

income 1.0

purchases 1.0

When to use:

✓ Features have different units/scales
✓ Using algorithms sensitive to scale (SVM, neural networks, PCA)
✓ Data is approximately normally distributed

2. Min-Max Scaling

Transform to range [0, 1]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df_minmax = pd.DataFrame(

scaler.fit_transform(df),

columns=df.columns

)

print("Min-Max scaled data:")

print(df_minmax)

print("\nMins:", df_minmax.min())

print("Maxs:", df_minmax.max())

Output:

Min-Max scaled data:

age income purchases

0 0.0 0.0 0.0

1 0.25 0.25 0.2

2 0.5 0.5 0.467

3 0.75 0.75 0.667

4 1.0 1.0 1.0

Mins: age 0.0

income 0.0

purchases 0.0

Maxs: age 1.0

income 1.0

purchases 1.0

When to use:

✓ Need bounded range (e.g., for neural networks with sigmoid activation)
✓ Data doesn't have outliers (outliers compress the range)

3. Robust Scaling

Uses median and IQR instead of mean and std (robust to outliers):

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

df_robust = pd.DataFrame(

scaler.fit_transform(df),

columns=df.columns

)

print("Robust scaled data:")

print(df_robust)

When to use:

✓ Data has outliers
✓ Want scaling that's not affected by extreme values

Important: Fit on training data only

from sklearn.model_selection import train_test_split

# Split data

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# Fit scaler on training data only

scaler = StandardScaler()

scaler.fit(X_train)

# Transform both training and test data

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

# WRONG: Don't do this

# scaler.fit(X_test) # This would leak information from test set

Why? If you fit on test data, you're using information from the future, which creates data leakage.

8.3.2 Encoding Categorical Variables

Machine learning algorithms need numbers, not text. Encoding converts categorical variables to numerical format.

Types of categorical variables:

Nominal : No inherent order (color, city, product type)
Ordinal : Natural order (education level, satisfaction rating)

Encoding Methods

1. Label Encoding

Assign each category a number:

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({

'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']

})

encoder = LabelEncoder()

df['city_encoded'] = encoder.fit_transform(df['city'])

print(df)

print("\nMapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

Output:

city city_encoded

0 New York 2

1 Los Angeles 1

2 Chicago 0

3 New York 2

4 Chicago 0

Mapping: {'Chicago': 0, 'Los Angeles': 1, 'New York': 2}

Problem : Implies ordering (Chicago < Los Angeles < New York), which doesn't make sense for nominal variables.

When to use:

✓ Ordinal variables (low < medium < high)
✓ Tree-based models (can handle arbitrary encodings)
✗ Linear models (will treat as numeric)

2. One-Hot Encoding

Create binary column for each category:

df = pd.DataFrame({

'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']

})

# Using pandas

df_onehot = pd.get_dummies(df, columns=['city'], prefix='city')

print(df_onehot)

Output:

city_Chicago city_Los Angeles city_New York

0 0 0 1

1 0 1 0

2 1 0 0

3 0 0 1

4 1 0 0

Using sklearn (better for pipelines):

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, drop='first') # drop='first' avoids multicollinearity

city_encoded = encoder.fit_transform(df[['city']])

df_encoded = pd.DataFrame(

city_encoded,

columns=encoder.get_feature_names_out(['city'])

)

print(df_encoded)

When to use:

✓ Nominal variables
✓ Linear models, neural networks
✓ Small number of categories (<10-20)

Problem : High cardinality (many categories) creates too many columns.

3. Frequency Encoding

Replace category with its frequency:

df = pd.DataFrame({

'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago',

'New York', 'Chicago', 'Chicago']

})

# Calculate frequencies

freq = df['city'].value_counts(normalize=True)

df['city_freq'] = df['city'].map(freq)

print(df)

Output:

city city_freq

0 New York 0.375

1 Los Angeles 0.125

2 Chicago 0.500

3 New York 0.375

4 Chicago 0.500

5 New York 0.375

6 Chicago 0.500

7 Chicago 0.500

When to use:

✓ High cardinality variables
✓ Frequency is informative (common categories behave differently)

4. Target Encoding

Replace category with mean of target variable for that category:

df = pd.DataFrame({

'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],

'sales': [100, 150, 200, 120, 180]

})

# Calculate mean sales per city

target_means = df.groupby('city')['sales'].mean()

df['city_target_encoded'] = df['city'].map(target_means)

print(df)

Output:

city sales city_target_encoded

0 New York 100 110.0

1 Los Angeles 150 150.0

2 Chicago 200 190.0

3 New York 120 110.0

4 Chicago 180 190.0

When to use:

✓ High cardinality variables
✓ Strong relationship between category and target

Caution : Can cause overfitting. Use cross-validation or smoothing.

Handling Ordinal Variables

df = pd.DataFrame({

'education': ['High School', 'Bachelor', 'Master', 'PhD', 'High School']

})

# Define order

education_order = {

'High School': 1,

'Bachelor': 2,

'Master': 3,

'PhD': 4

}

df['education_encoded'] = df['education'].map(education_order)

print(df)

Output:

education education_encoded

0 High School 1

1 Bachelor 2

2 Master 3

3 PhD 4

4 High School 1

8.3.3 Date/Time Handling and Time-Based Features

Dates contain rich information, but need to be extracted properly.

Parsing Dates

df = pd.DataFrame({

'transaction_date': ['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05']

})

# Convert to datetime

df['date'] = pd.to_datetime(df['transaction_date'])

print(df.dtypes)

Extracting Time Components

# Extract components

df['year'] = df['date'].dt.year

df['month'] = df['date'].dt.month

df['day'] = df['date'].dt.day

df['day_of_week'] = df['date'].dt.dayofweek # Monday=0, Sunday=6

df['day_name'] = df['date'].dt.day_name()

df['quarter'] = df['date'].dt.quarter

df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

print(df)

Output:

transaction_date date year month day day_of_week day_name quarter is_weekend

0 2024-01-15 2024-01-15 2024 1 15 0 Monday 1 0

1 2024-02-20 2024-02-20 2024 2 20 1 Tuesday 1 0

2 2024-03-10 2024-03-10 2024 3 10 6 Sunday 1 1

3 2024-04-05 2024-04-05 2024 4 5 4 Friday 2 0

Time-Based Features

# Days since reference date

reference_date = pd.to_datetime('2024-01-01')

df['days_since_start'] = (df['date'] - reference_date).dt.days

# Time between events

df = df.sort_values('date')

df['days_since_last'] = df['date'].diff().dt.days

# Cyclical encoding (for periodic features like month, day of week)

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)

df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print(df[['date', 'month', 'month_sin', 'month_cos']])

Why cyclical encoding?

December (month=12) and January (month=1) are adjacent, but numerically far apart. Cyclical encoding preserves this relationship.

Business-Relevant Date Features

# Is it a holiday?

holidays = pd.to_datetime(['2024-01-01', '2024-07-04', '2024-12-25'])

df['is_holiday'] = df['date'].isin(holidays).astype(int)

# Is it month-end? (important for B2B sales)

df['is_month_end'] = (df['date'].dt.is_month_end).astype(int)

# Days until month-end

df['days_to_month_end'] = df['date'].dt.days_in_month - df['date'].dt.day

# Fiscal quarter (if fiscal year starts in April)

df['fiscal_quarter'] = ((df['month'] - 4) % 12) // 3 + 1

print(df[['date', 'is_month_end', 'days_to_month_end', 'fiscal_quarter']])

8.4 Feature Engineering

Feature engineering is the process of creating new variables from existing data to improve model performance.

Why it matters:

"Applied machine learning is basically feature engineering." — Andrew Ng

Good features can make a simple model outperform a complex model with poor features.

8.4.1 Domain-Driven Feature Construction

The best features come from business understanding, not just data manipulation.

Example: Customer Churn Prediction

Raw data:

customers = pd.DataFrame({

'customer_id': [1, 2, 3, 4, 5],

'signup_date': pd.to_datetime(['2023-01-15', '2023-03-20', '2023-06-10', '2023-08-05', '2023-10-12']),

'last_login': pd.to_datetime(['2024-01-10', '2023-12-15', '2024-01-08', '2023-09-20', '2024-01-12']),

'total_purchases': [15, 3, 22, 1, 8],

'total_spent': [1500, 200, 3200, 50, 650],

'support_tickets': [2, 5, 1, 8, 3],

'email_opens': [45, 10, 78, 5, 32]

})

Domain-driven features:

# Recency: Days since last activity (recent activity = less likely to churn)

today = pd.to_datetime('2024-01-15')

customers['days_since_last_login'] = (today - customers['last_login']).dt.days

# Tenure: How long they've been a customer

customers['tenure_days'] = (today - customers['signup_date']).dt.days

# Engagement: Activity per unit time

customers['purchases_per_month'] = customers['total_purchases'] / (customers['tenure_days'] / 30)

customers['logins_per_month'] = 30 / customers['days_since_last_login'] # Approximate

# Value: Average order value

customers['avg_order_value'] = customers['total_spent'] / customers['total_purchases']

# Support intensity: Tickets per purchase (high = problems)

customers['tickets_per_purchase'] = customers['support_tickets'] / customers['total_purchases']

# Engagement score: Email engagement

customers['email_engagement'] = customers['email_opens'] / customers['tenure_days'] * 30

print(customers[[

'customer_id', 'days_since_last_login', 'tenure_days',

'purchases_per_month', 'avg_order_value', 'tickets_per_purchase'

]])

Business logic behind each feature:

days_since_last_login : Inactive customers are more likely to churn
tenure_days : New customers churn more than established ones
purchases_per_month : Frequent buyers are engaged
avg_order_value : High-value customers are worth retaining
tickets_per_purchase : Many support issues indicate dissatisfaction
email_engagement : Engaged customers open emails

8.4.2 Aggregations, Ratios, and Interaction Features

Aggregations

Summarize detailed data to customer/product level:

# Transaction-level data

transactions = pd.DataFrame({

'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],

'transaction_date': pd.to_datetime([

'2024-01-01', '2024-01-15', '2024-01-20',

'2024-01-05', '2024-01-25',

'2024-01-10', '2024-01-12', '2024-01-18', '2024-01-22'

]),

'amount': [100, 150, 200, 50, 75, 300, 100, 250, 180],

'category': ['Electronics', 'Clothing', 'Electronics',

'Clothing', 'Clothing',

'Electronics', 'Home', 'Electronics', 'Clothing']

})

# Aggregate to customer level

customer_features = transactions.groupby('customer_id').agg({

'amount': ['sum', 'mean', 'std', 'min', 'max', 'count'],

'transaction_date': ['min', 'max']

}).reset_index()

customer_features.columns = ['customer_id', 'total_spent', 'avg_transaction',

'std_transaction', 'min_transaction', 'max_transaction',

'num_transactions', 'first_purchase', 'last_purchase']

# Time-based features

customer_features['days_active'] = (

customer_features['last_purchase'] - customer_features['first_purchase']

).dt.days

customer_features['purchase_frequency'] = (

customer_features['num_transactions'] / (customer_features['days_active'] + 1)

)

print(customer_features)

Ratios

Ratios often reveal more than raw numbers:

# E-commerce metrics

df = pd.DataFrame({

'sessions': [100, 200, 150, 300],

'page_views': [500, 800, 600, 1200],

'add_to_cart': [20, 35, 25, 50],

'purchases': [5, 10, 8, 15],

'revenue': [500, 1200, 900, 1800]

})

# Conversion funnel ratios

df['pages_per_session'] = df['page_views'] / df['sessions']

df['add_to_cart_rate'] = df['add_to_cart'] / df['sessions']

df['conversion_rate'] = df['purchases'] / df['sessions']

df['cart_to_purchase_rate'] = df['purchases'] / df['add_to_cart']

# Value metrics

df['avg_order_value'] = df['revenue'] / df['purchases']

df['revenue_per_session'] = df['revenue'] / df['sessions']

print(df[[

'conversion_rate', 'cart_to_purchase_rate',

'avg_order_value', 'revenue_per_session'

]])

Why ratios matter:

100 sessions with 10 purchases is very different from 1000 sessions with 10 purchases
Ratios normalize for scale and reveal efficiency

Interaction Features

Capture relationships between variables:

# Marketing data

df = pd.DataFrame({

'ad_spend': [1000, 2000, 1500, 3000],

'email_sends': [5000, 10000, 7500, 15000],

'conversions': [50, 120, 80, 200]

})

# Interaction: Combined effect of ad spend and email

df['ad_email_interaction'] = df['ad_spend'] * df['email_sends']

# Polynomial features: Non-linear relationships

df['ad_spend_squared'] = df['ad_spend'] ** 2

# Efficiency ratios

df['cost_per_conversion'] = df['ad_spend'] / df['conversions']

df['emails_per_conversion'] = df['email_sends'] / df['conversions']

print(df)

When to use interactions:

When you suspect two variables work together (e.g., price × quality)
When relationships are non-linear
For tree-based models: less important (they find interactions automatically)
For linear models: very important (they can't find interactions on their own)

8.4.3 Handling Categorical Data

Categorical data is common in business analytics and often needs to be transformed into numerical formats for machine learning models. Here are three widely used techniques to handle categorical variables:

One-Hot Encoding

One-hot encoding converts each category value into a new binary column (feature) with 1s and 0s indicating the presence of that category. This method is simple and effective for nominal categories without intrinsic order.

Example in Python:

import pandas as pd

data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue']}) one_hot = pd.get_dummies(data['Color'], prefix='Color') print(one_hot)

Output:

Blue	Green	Red
0	0	1
1	0	0
0	1	0
1	0	0

Use case: Best for categorical variables with a small number of unique values. Beware of high dimensionality if categories are many.

Advantages:

Simple and intuitive representation.
Does not impose any ordinal relationship between categories, preserving nominal data integrity.
Compatible with most machine learning algorithms.

Disadvantages:

Can lead to high dimensionality if the categorical variable has many unique values (high cardinality), causing the "curse of dimensionality."
Sparse representation can increase memory usage and slow down training.
May cause overfitting if many rare categories exist.

Handling Large Number of Categories:

Limit one-hot encoding to categorical variables with relatively few unique values (e.g., less than 20-30).
For high-cardinality variables, consider grouping rare categories into an "Other" category before encoding.
Use dimensionality reduction techniques (e.g., PCA) on one-hot encoded features, though this may reduce interpretability.

Label Encoding

Label encoding assigns each unique category an integer value. This is useful for ordinal categories where the order matters but can mislead models if used on nominal data.

Example in Python:

from sklearn.preprocessing import LabelEncoder
data = ['Low', 'Medium', 'High', 'Medium']
le = LabelEncoder()
encoded = le.fit_transform(data)
print(encoded)

Output:

[1 2 0 2]

Advantages:

Very memory efficient since it uses a single integer per category.
Simple and fast to compute.
Useful for ordinal data where the order matters.

Disadvantages:

Imposes an arbitrary ordinal relationship on nominal categories, which can mislead many algorithms (e.g., linear models may interpret higher integers as "larger" or "better").
Not suitable for nominal variables unless the model can handle categorical integers properly (e.g., tree-based models).

Handling Large Number of Categories:

Label encoding itself scales well with many categories since it only assigns integers.
However, if the model is sensitive to ordinal relationships, label encoding may cause bias.
Consider combining with target encoding or embeddings for high-cardinality nominal variables.

Embeddings

Embeddings map categories to dense, low-dimensional vectors learned during model training, capturing semantic relationships between categories. Commonly used in deep learning models, embeddings can represent high-cardinality categorical variables efficiently.

Example concept (using TensorFlow/Keras):

from tensorflow.keras.layers import Input, Embedding, Flatten
from tensorflow.keras.models import Model

input_cat = Input(shape=(1,))
embedding = Embedding(input_dim=100, output_dim=8)(input_cat)
flat = Flatten()(embedding)

model = Model(inputs=input_cat, outputs=flat)

Use case: Ideal for large cardinality categorical variables and when relationships between categories matter. Requires more complex models and training.

Advantages:

Efficiently represent high-cardinality categorical variables in a low-dimensional continuous space.
Capture semantic relationships and similarities between categories learned from data.
Reduce dimensionality compared to one-hot encoding.
Improve model performance, especially in deep learning and complex models.

Disadvantages:

Require more complex models and training to learn meaningful embeddings.
Less interpretable than one-hot or label encoding.
Need sufficient data to learn good embeddings; sparse categories may have poor representations.
Implementation complexity is higher.

Handling Large Number of Categories:

Embeddings naturally handle large cardinality by mapping categories to dense vectors.
Can be combined with hashing tricks to reduce embedding table size.
Use regularization and embedding dropout to prevent overfitting on rare categories.
For extremely large cardinality, consider hashing-based embeddings or dimensionality reduction before embedding.

Additional Techniques for Very Large Cardinality Categorical Variables

Frequency or Count Encoding: Replace categories with their frequency or count in the dataset. Simple and scalable but loses category identity.
Target Encoding: Replace categories with the mean of the target variable for that category. Powerful but prone to leakage; requires careful cross-validation.
Hashing Trick: Map categories to a fixed number of buckets using a hash function, reducing dimensionality but introducing collisions. Useful for streaming or very large datasets.
Clustering Categories: Group similar categories based on domain knowledge or data-driven similarity before encoding.

Summary Table

Encoding Method	Advantages	Disadvantages	Handling Large Cardinality
One-Hot Encoding	Simple, no ordinal assumptions	High dimensionality, sparse	Group rare categories, dimensionality reduction
Label Encoding	Memory efficient, fast	Imposes ordinal relation on nominal data	Use only for ordinal data, combine with other methods
Embeddings	Low-dimensional, captures relations	Complex, less interpretable	Natural for large cardinality, use hashing or regularization
Frequency Encoding	Simple, scalable	Loses category identity	Good for very large cardinality
Target Encoding	Powerful, uses target info	Risk of leakage	Use with cross-validation
Hashing Trick	Fixed size, scalable	Collisions possible	Useful for streaming or huge datasets

Choosing the right approach depends on the data, model, and business context. For very large cardinality categorical variables, embeddings or hashing-based methods are often preferred in modern analytics pipelines.

8.4.4 Text Features (Basic NLP Features)

Text data is everywhere: customer reviews, support tickets, product descriptions, emails.

Basic Text Features

reviews = pd.DataFrame({

'review_id': [1, 2, 3, 4],

'text': [

'Great product! Love it.',

'Terrible quality. Very disappointed.',

'Good value for money. Recommended.',

'Amazing! Best purchase ever!!!'

]

})

# Length features

reviews['char_count'] = reviews['text'].str.len()

reviews['word_count'] = reviews['text'].str.split().str.len()

reviews['avg_word_length'] = reviews['char_count'] / reviews['word_count']

# Punctuation (excitement, emphasis)

reviews['exclamation_count'] = reviews['text'].str.count('!')

reviews['question_count'] = reviews['text'].str.count('\?')

# Uppercase (shouting, emphasis)

reviews['uppercase_ratio'] = (

reviews['text'].str.count(r'[A-Z]') / reviews['char_count']

)

print(reviews)

Sentiment and Keywords

# Simple sentiment (count positive/negative words)

positive_words = ['great', 'love', 'good', 'amazing', 'best', 'excellent', 'recommended']

negative_words = ['terrible', 'bad', 'disappointed', 'worst', 'poor', 'awful']

reviews['text_lower'] = reviews['text'].str.lower()

reviews['positive_word_count'] = reviews['text_lower'].apply(

lambda x: sum(word in x for word in positive_words)

)

reviews['negative_word_count'] = reviews['text_lower'].apply(

lambda x: sum(word in x for word in negative_words)

)

reviews['sentiment_score'] = (

reviews['positive_word_count'] - reviews['negative_word_count']

)

print(reviews[['text', 'positive_word_count', 'negative_word_count', 'sentiment_score']])

For more sophisticated NLP:

# Using TextBlob for sentiment analysis

from textblob import TextBlob

reviews['polarity'] = reviews['text'].apply(

lambda x: TextBlob(x).sentiment.polarity

)

reviews['subjectivity'] = reviews['text'].apply(

lambda x: TextBlob(x).sentiment.subjectivity

)

print(reviews[['text', 'polarity', 'subjectivity']])

Polarity : -1 (negative) to +1 (positive)
Subjectivity : 0 (objective) to 1 (subjective)

TF-IDF for Document Similarity

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text to TF-IDF features

vectorizer = TfidfVectorizer(max_features=10, stop_words='english')

tfidf_matrix = vectorizer.fit_transform(reviews['text'])

# Convert to dataframe

tfidf_df = pd.DataFrame(

tfidf_matrix.toarray(),

columns=vectorizer.get_feature_names_out()

)

print("TF-IDF features:")

print(tfidf_df)

Use cases:

Document classification (categorize support tickets)
Similarity search (find similar products/reviews)
Feature extraction for ML models

8.5 Train–Test Splits and Data Leakage Prevention

Data leakage is when information from the future "leaks" into your training data, making your model look better than it actually is.

The result : Great performance in development, terrible performance in production.

Types of Data Leakage

1. Target leakage

Using information that wouldn't be available at prediction time.

Example: Predicting loan default

# WRONG: Using payment_status to predict default

# payment_status is only known AFTER you know if they defaulted

df = pd.DataFrame({

'customer_id': [1, 2, 3, 4],

'income': [50000, 60000, 45000, 70000],

'payment_status': ['late', 'on_time', 'late', 'on_time'], # ← This is leakage!

'defaulted': [1, 0, 1, 0]

})

# payment_status is determined by whether they default

# You can't use it to predict default

How to avoid : Ask "Would I have this information at the time I need to make the prediction?"

2. Train-test contamination

Information from test set leaking into training.

Example: Scaling before splitting

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

df = pd.DataFrame({

'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

'target': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

})

# WRONG: Scale before split

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df[['feature1']]) # Uses ALL data

X_train, X_test = train_test_split(df_scaled, test_size=0.2)

# RIGHT: Split first, then scale

X_train, X_test, y_train, y_test = train_test_split(

df[['feature1']], df['target'], test_size=0.2, random_state=42

)

scaler = StandardScaler()

scaler.fit(X_train) # Fit only on training data

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test) # Transform test using training parameters

Why it matters : If you scale using the full dataset, the test set's statistics influence the training data.

3. Temporal leakage

Using future information to predict the past.

Example: Time series forecasting

# Sales data

df = pd.DataFrame({

'date': pd.date_range('2024-01-01', periods=10),

'sales': [100, 110, 105, 115, 120, 125, 130, 135, 140, 145]

})

# WRONG: Random split for time series

X_train, X_test = train_test_split(df, test_size=0.2) # Mixes past and future

# RIGHT: Time-based split

train_size = int(0.8 * len(df))

train = df[:train_size] # First 80%

test = df[train_size:] # Last 20%

print("Training period:", train['date'].min(), "to", train['date'].max())

print("Test period:", test['date'].min(), "to", test['date'].max())

Rule : Always predict future from past, never past from future.

4. Feature leakage from aggregations

Example: Customer churn

# WRONG: Including future behavior in features

df = pd.DataFrame({

'customer_id': [1, 1, 1, 2, 2],

'month': [1, 2, 3, 1, 2],

'purchases': [2, 1, 0, 3, 2],

'churned_month_3': [1, 1, 1, 0, 0] # Churn status at month 3

})

# Calculate total purchases (includes future purchases!)

df['total_purchases'] = df.groupby('customer_id')['purchases'].transform('sum')

# This is leakage: total_purchases includes purchases AFTER prediction time

RIGHT: Use only past information

# Calculate cumulative purchases up to current month

df = df.sort_values(['customer_id', 'month'])

df['cumulative_purchases'] = df.groupby('customer_id')['purchases'].cumsum()

# For month 3 prediction, use data from months 1-2 only

prediction_month = 3

train_data = df[df['month'] < prediction_month]

Proper Train-Test Split

from sklearn.model_selection import train_test_split

# Basic split

X_train, X_test, y_train, y_test = train_test_split(

X, y,

test_size=0.2, # 20% for testing

random_state=42, # Reproducibility

stratify=y # Maintain class proportions (for classification)

)

print(f"Training set: {len(X_train)} samples")

print(f"Test set: {len(X_test)} samples")

print(f"Class distribution in train: {y_train.value_counts(normalize=True)}")

print(f"Class distribution in test: {y_test.value_counts(normalize=True)}")

Cross-Validation (Better than single split)

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# 5-fold cross-validation

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Cross-validation scores: {scores}")

print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Why cross-validation?

Uses all data for both training and testing
Gives more reliable estimate of performance
Reduces variance from single train-test split

Time Series Cross-Validation

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

# Train and evaluate model

# ...

Ensures : Each test set is always in the future relative to its training set.

8.6 Documenting and Automating Data Preparation Pipelines

Why pipelines matter:

Reproducibility : Same transformations every time
Efficiency : Automate repetitive tasks
Deployment : Easy to apply to new data
Debugging : Clear what transformations were applied

Building a Pipeline with sklearn

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Define pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])

# Fit pipeline (applies all steps)
pipeline.fit(X_train, y_train)

# Predict (applies same transformations)
y_pred = pipeline.predict(X_test)

# Score
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.3f}")

Benefits:

All transformations in one object
No risk of forgetting a step
Easy to save and load

Handling Different Column Types

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define which columns get which transformations
numeric_features = ['age', 'income', 'purchases']
categorical_features = ['city', 'product_category']

# Numeric pipeline
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

# Full pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', LogisticRegression())
])

# Fit and predict
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)

Custom Transformers

from sklearn.base import BaseEstimator, TransformerMixin
class FeatureEngineer(BaseEstimator, TransformerMixin):
"""Custom transformer for domain-specific features"""

def fit(self, X, y=None):
return self

def transform(self, X):
X = X.copy()
# Create new features
X['purchases_per_month'] = X['total_purchases'] / X['tenure_months']
X['avg_order_value'] = X['total_spent'] / X['total_purchases']
X['support_intensity'] = X['support_tickets'] / X['total_purchases']
return X

# Use in pipeline

pipeline = Pipeline([
('feature_engineer', FeatureEngineer()),
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])

Saving and Loading Pipelines

import joblib
# Save pipeline
joblib.dump(pipeline, 'model_pipeline.pkl')
# Load pipeline
loaded_pipeline = joblib.load('model_pipeline.pkl')

# Use on new data
new_predictions = loaded_pipeline.predict(new_data)

Documentation Best Practices

Data Preparation Pipeline for Customer Churn Prediction

Author: [Your Name]

Date: 2024-01-15

Last Updated: 2024-01-15

DATA SOURCE:

- customers.csv: Customer demographic and behavioral data

- transactions.csv: Transaction history

PREPROCESSING STEPS:

1. Missing Value Handling:

- age: Impute with median

- income: Impute with median, create missing indicator

- purchases: Impute with mean

2. Outlier Treatment:

- income: Cap at 5th and 95th percentiles

- purchases: Remove values > 3 standard deviations

3. Feature Engineering:

- purchases_per_month: total_purchases / tenure_months

- avg_order_value: total_spent / total_purchases

- days_since_last_purchase: today - last_purchase_date

4. Encoding:

- city: One-hot encoding

- education: Ordinal encoding (HS=1, Bachelor=2, Master=3, PhD=4)

5. Scaling:

- All numeric features: StandardScaler

VALIDATION:

- Train-test split: 80/20

- Stratified by churn status

- Random state: 42

KNOWN ISSUES:

- Small sample size for some cities (< 10 customers)

- Missing data for income is not random (higher for high earners)

NEXT STEPS:

- Consider target encoding for high-cardinality categorical variables

- Experiment with polynomial features for non-linear relationships

Key Takeaways

Data preparation is 80% of the work —and it determines the quality of your results.
Understand before you clean : Don't blindly apply transformations. Understand what the data represents and why it's messy.
Missing data strategy depends on why it's missing : MCAR, MAR, and MNAR require different approaches.
Outliers aren't always errors : Understand whether they're mistakes or legitimate extremes before removing them.
Scaling matters for some algorithms : Distance-based and gradient-based methods need scaled features.
Encoding depends on variable type : Nominal vs. ordinal, low vs. high cardinality.
Feature engineering is where domain knowledge shines : The best features come from understanding the business, not just manipulating data.
Data leakage will ruin your model : Always ask "Would I have this information at prediction time?"
Pipelines ensure reproducibility : Automate your preprocessing to avoid errors and enable deployment.
Document everything : Future you (and your colleagues) will thank you.

Exercises

Exercise 1: Clean a Raw Dataset

You're given a customer dataset with various data quality issues. Clean it using pandas.

Dataset: customers_raw.csv

i mport pandas as pd
import numpy as np

# Create sample messy data
np.random.seed(42)
df = pd.DataFrame({
'customer_id': [1, 2, 2, 3, 4, 5, 6, 7, 8, 9], # Duplicate
'name': ['Alice Smith', 'bob jones', 'Bob Jones', 'CHARLIE BROWN',
'Diana Prince', 'Eve Adams', None, 'Frank Miller', 'Grace Lee', 'Henry Ford'],
'age': [25, 30, 30, 250, 45, np.nan, 28, 35, 40, 50], # Outlier, missing
'city': ['New York', 'new york', 'NY', 'Los Angeles', 'Chicago',
'chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio'],
'income': [50000, 60000, 60000, 75000, np.nan, 55000, 62000, 58000, 70000, 80000], 'signup_date': ['2023-01-15', '2023-02-20', '2023-02-20', '01/03/2023',
'2023-04-10', '2023-05-15', '2023-06-20', '2023-07-25', '2023-08-30', '2023-09-15']
})

df.to_csv('customers_raw.csv', index=False)

Your tasks:

a) Identify and remove duplicate records

b) Standardize text fields (name, city):

Remove extra spaces
Consistent capitalization
Map city variations to standard names

c) Handle missing values:

Decide on strategy for each column
Implement and justify your choice

d) Detect and treat outliers:

Identify outliers in age and income
Decide whether to remove, cap, or keep