Appearance
🛠 Workshop — Xây Model Dự Đoán Customer Churn
Từ dataset 10,000 khách hàng → EDA → feature selection → train Logistic Regression + Decision Tree → evaluate: confusion matrix, accuracy, precision, recall → insight: ai churn, recommendations? Output: Jupyter Notebook + Model Comparison + Recommendation Slide!
🎯 Mục tiêu workshop
Sau khi hoàn thành workshop này, bạn sẽ:
- EDA — khám phá churn dataset, phát hiện patterns
- Feature selection — chọn features có predictive power
- Train 2 models — Logistic Regression & Decision Tree
- Evaluate — confusion matrix, accuracy, precision, recall, F1
- Insight — who churns? tại sao? recommendations cho business?
🧰 Yêu cầu
| Yêu cầu | Chi tiết |
|---|---|
| Kiến thức | Đã hoàn thành Buổi 17 lý thuyết (ML for DA) |
| Python | pandas, numpy, scikit-learn, matplotlib, seaborn (đã học Buổi 7-10) |
| Thời gian | 90–120 phút |
| Output | Jupyter Notebook + Model Comparison + Recommendation Slide |
💡 Naming convention
Đặt tên file: HoTen_Buoi17_ChurnPrediction.ipynb
📦 Scenario: Telecom Churn Prediction
Bạn là DA tại TelcoVN — nhà mạng viễn thông, 500,000 subscribers. Retention team cần biết: "Khách nào sắp churn trong 30 ngày tới?" để launch retention campaign (gọi điện, tặng voucher, upgrade plan).
Business context:
- Current churn rate: ~5.5%/tháng
- Average LTV (Lifetime Value): 3.2 triệu VND/khách
- Retention campaign cost: 150K VND/khách
- Goal: Identify top 500 high-risk khách → retention campaign → giảm churn 20%
Phần 1: Setup & Generate Dataset (15 phút)
Bước 1.1: Import Libraries
python
# ============================================
# PHẦN 1: SETUP & DATASET
# ============================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report,
ConfusionMatrixDisplay)
from sklearn.preprocessing import StandardScaler
# Settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)
print("✅ Libraries imported successfully!")Bước 1.2: Generate Churn Dataset (10,000 rows)
python
# ============================================
# GENERATE TELECOM CHURN DATASET (10,000 customers)
# ============================================
n = 10000
np.random.seed(42)
# Generate features
data = pd.DataFrame({
'customer_id': range(1, n + 1),
'tenure': np.random.randint(1, 72, n), # months as customer
'monthly_charges': np.random.uniform(20, 120, n).round(2), # USD/month
'total_charges': np.zeros(n), # will calculate
'contract_type': np.random.choice(
['Month-to-month', 'One year', 'Two year'],
n, p=[0.50, 0.30, 0.20]),
'payment_method': np.random.choice(
['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'],
n, p=[0.35, 0.20, 0.25, 0.20]),
'internet_service': np.random.choice(
['DSL', 'Fiber optic', 'No'], n, p=[0.35, 0.45, 0.20]),
'online_security': np.random.choice([0, 1], n, p=[0.55, 0.45]),
'tech_support': np.random.choice([0, 1], n, p=[0.60, 0.40]),
'streaming_tv': np.random.choice([0, 1], n, p=[0.50, 0.50]),
'num_support_tickets': np.random.randint(0, 12, n),
'senior_citizen': np.random.choice([0, 1], n, p=[0.85, 0.15]),
'partner': np.random.choice([0, 1], n, p=[0.48, 0.52]),
'dependents': np.random.choice([0, 1], n, p=[0.70, 0.30]),
})
# Calculate total_charges based on tenure × monthly_charges + noise
data['total_charges'] = (
data['tenure'] * data['monthly_charges'] * np.random.uniform(0.85, 1.15, n)
).round(2)
# Generate churn target (realistic logic)
churn_score = (
-0.04 * data['tenure'] + # longer tenure → less churn
0.015 * data['monthly_charges'] + # higher charges → more churn
1.2 * (data['contract_type'] == 'Month-to-month') + # monthly → much more churn
-0.6 * data['online_security'] + # security → less churn
-0.5 * data['tech_support'] + # support → less churn
0.08 * data['num_support_tickets'] + # more tickets → more churn
0.5 * (data['internet_service'] == 'Fiber optic') + # fiber → more churn (price)
0.3 * (data['payment_method'] == 'Electronic check') + # echeck → more churn
-0.3 * data['partner'] + # partner → less churn
-0.4 * data['dependents'] + # dependents → less churn
np.random.normal(0, 0.8, n) # noise
)
# Convert to binary: ~27% churn rate
threshold = np.percentile(churn_score, 73)
data['churn'] = (churn_score > threshold).astype(int)
print(f"📊 Dataset shape: {data.shape}")
print(f"📊 Churn rate: {data['churn'].mean():.1%}")
print(f"\n{data.head()}")Bước 1.3: Quick Overview
python
# Dataset overview
print("=" * 50)
print("📋 DATASET OVERVIEW")
print("=" * 50)
print(f"\nShape: {data.shape[0]:,} rows × {data.shape[1]} columns")
print(f"\nColumn types:")
print(data.dtypes.value_counts())
print(f"\nMissing values:")
print(data.isnull().sum().sum(), "total missing")
print(f"\nTarget distribution:")
print(data['churn'].value_counts())
print(f"\nChurn rate: {data['churn'].mean():.2%}")Phần 2: EDA — Exploratory Data Analysis (25 phút)
Bước 2.1: Churn vs Non-Churn Comparison
python
# ============================================
# PHẦN 2: EDA
# ============================================
# Churn vs Non-Churn — Numerical features
churn_comparison = data.groupby('churn')[
['tenure', 'monthly_charges', 'total_charges', 'num_support_tickets']
].mean().round(2)
churn_comparison.index = ['Stay', 'Churn']
print("📊 CHURN vs NON-CHURN — Averages:")
print(churn_comparison)Bước 2.2: Visualize Key Distributions
python
# Distributions: Tenure & Monthly Charges by Churn
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Tenure
for label, group in data.groupby('churn'):
axes[0].hist(group['tenure'], bins=30, alpha=0.6,
label=f"{'Churn' if label else 'Stay'}")
axes[0].set_xlabel('Tenure (months)')
axes[0].set_ylabel('Count')
axes[0].set_title('Tenure Distribution by Churn')
axes[0].legend()
# Monthly Charges
for label, group in data.groupby('churn'):
axes[1].hist(group['monthly_charges'], bins=30, alpha=0.6,
label=f"{'Churn' if label else 'Stay'}")
axes[1].set_xlabel('Monthly Charges ($)')
axes[1].set_ylabel('Count')
axes[1].set_title('Monthly Charges Distribution by Churn')
axes[1].legend()
plt.tight_layout()
plt.show()Bước 2.3: Churn Rate by Categorical Features
python
# Churn rate by categorical features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
categoricals = ['contract_type', 'internet_service', 'payment_method', 'online_security']
titles = ['Contract Type', 'Internet Service', 'Payment Method', 'Online Security']
for ax, col, title in zip(axes.flatten(), categoricals, titles):
churn_rate = data.groupby(col)['churn'].mean().sort_values(ascending=False)
churn_rate.plot(kind='bar', ax=ax, color=['#e74c3c' if v > 0.3 else '#3498db'
for v in churn_rate.values])
ax.set_title(f'Churn Rate by {title}')
ax.set_ylabel('Churn Rate')
ax.set_ylim(0, 0.6)
ax.tick_params(axis='x', rotation=45)
for i, v in enumerate(churn_rate.values):
ax.text(i, v + 0.01, f'{v:.1%}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()Bước 2.4: Correlation Heatmap
python
# Correlation heatmap (numerical features + target)
numerical_cols = ['tenure', 'monthly_charges', 'total_charges',
'online_security', 'tech_support', 'streaming_tv',
'num_support_tickets', 'senior_citizen', 'partner',
'dependents', 'churn']
corr_matrix = data[numerical_cols].corr()
plt.figure(figsize=(12, 9))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix — Numerical Features vs Churn')
plt.tight_layout()
plt.show()
# Top correlations with churn
print("\n📊 TOP CORRELATIONS WITH CHURN:")
churn_corr = corr_matrix['churn'].drop('churn').sort_values(key=abs, ascending=False)
for feat, corr in churn_corr.items():
direction = "↑ churn" if corr > 0 else "↓ churn"
print(f" {feat:>25s}: {corr:+.3f} ({direction})")Bước 2.5: EDA Key Findings
python
# ============================================
# EDA KEY FINDINGS
# ============================================
findings = """
📊 EDA KEY FINDINGS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. TENURE: Churners có tenure thấp hơn (avg ~18 months vs ~38 months)
→ Khách mới dễ churn hơn khách lâu năm
2. CONTRACT: Month-to-month churn rate >> One year >> Two year
→ Contract flexibility = churn risk
3. MONTHLY CHARGES: Churners trả nhiều hơn (avg ~$75 vs ~$60)
→ Giá cao + không gắn bó = churn
4. ONLINE SECURITY + TECH SUPPORT: Có → churn ít hơn
→ Value-added services giữ chân khách
5. SUPPORT TICKETS: Churners có nhiều tickets hơn
→ Trải nghiệm xấu → frustration → churn
6. ELECTRONIC CHECK: Churn rate cao hơn payment methods khác
→ Có thể liên quan đến customer segment
HYPOTHESIS: Churn driven by (1) short tenure, (2) monthly contract,
(3) high charges, (4) lack of value-added services.
"""
print(findings)Phần 3: Feature Engineering & Preparation (15 phút)
Bước 3.1: Feature Engineering
python
# ============================================
# PHẦN 3: FEATURE ENGINEERING
# ============================================
# Create copy for modeling
df = data.copy()
# One-Hot Encode categorical features
df = pd.get_dummies(df, columns=['contract_type', 'internet_service',
'payment_method'], drop_first=True)
print(f"Shape after encoding: {df.shape}")
print(f"\nNew columns:\n{[c for c in df.columns if '_' in c and c not in data.columns]}")Bước 3.2: Select Features & Split
python
# Define features and target
feature_cols = [c for c in df.columns if c not in ['customer_id', 'churn']]
X = df[feature_cols]
y = df['churn']
print(f"Features: {len(feature_cols)}")
print(f"Feature names: {feature_cols}")
print(f"\nTarget distribution:")
print(y.value_counts())
# Train/Test Split (80/20, stratified)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\n✅ Train: {X_train.shape[0]:,} rows ({y_train.mean():.1%} churn)")
print(f"✅ Test: {X_test.shape[0]:,} rows ({y_test.mean():.1%} churn)")Bước 3.3: Scale Features (cho Logistic Regression)
python
# Standard Scaling — cần cho Logistic Regression, không cần cho Decision Tree
scaler = StandardScaler()
# fit on TRAIN only, transform both
X_train_scaled = pd.DataFrame(
scaler.fit_transform(X_train),
columns=feature_cols, index=X_train.index
)
X_test_scaled = pd.DataFrame(
scaler.transform(X_test), # KHÔNG fit_transform!
columns=feature_cols, index=X_test.index
)
print("✅ Features scaled (Standard Scaler)")
print(f"Train mean: {X_train_scaled.mean().mean():.4f} (should be ~0)")
print(f"Train std: {X_train_scaled.std().mean():.4f} (should be ~1)")Phần 4: Train Models (20 phút)
Bước 4.1: Logistic Regression
python
# ============================================
# PHẦN 4: MODEL TRAINING
# ============================================
# ---- MODEL 1: LOGISTIC REGRESSION ----
lr_model = LogisticRegression(
max_iter=1000,
class_weight='balanced', # handle imbalanced data
random_state=42,
C=1.0 # regularization (default)
)
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
lr_prob = lr_model.predict_proba(X_test_scaled)[:, 1]
print("✅ Logistic Regression trained!")
print(f" Train accuracy: {lr_model.score(X_train_scaled, y_train):.4f}")
print(f" Test accuracy: {lr_model.score(X_test_scaled, y_test):.4f}")Bước 4.2: Decision Tree
python
# ---- MODEL 2: DECISION TREE ----
dt_model = DecisionTreeClassifier(
max_depth=5, # limit depth to prevent overfitting
min_samples_leaf=50, # minimum samples per leaf
class_weight='balanced',
random_state=42
)
dt_model.fit(X_train, y_train) # Decision Tree KHÔNG cần scaling
dt_pred = dt_model.predict(X_test)
dt_prob = dt_model.predict_proba(X_test)[:, 1]
print("✅ Decision Tree trained!")
print(f" Train accuracy: {dt_model.score(X_train, y_train):.4f}")
print(f" Test accuracy: {dt_model.score(X_test, y_test):.4f}")Bước 4.3: Cross-Validation
python
# Cross-Validation (5-fold) — check for overfitting
print("=" * 50)
print("📊 CROSS-VALIDATION (5-fold)")
print("=" * 50)
for name, model, X_cv in [
("Logistic Regression", lr_model, scaler.fit_transform(X)),
("Decision Tree", dt_model, X)
]:
cv_acc = cross_val_score(model, X_cv, y, cv=5, scoring='accuracy')
cv_f1 = cross_val_score(model, X_cv, y, cv=5, scoring='f1')
cv_recall = cross_val_score(model, X_cv, y, cv=5, scoring='recall')
print(f"\n🔹 {name}:")
print(f" Accuracy: {cv_acc.mean():.4f} ± {cv_acc.std():.4f}")
print(f" F1 Score: {cv_f1.mean():.4f} ± {cv_f1.std():.4f}")
print(f" Recall: {cv_recall.mean():.4f} ± {cv_recall.std():.4f}")Phần 5: Evaluate Models (20 phút)
Bước 5.1: Classification Reports
python
# ============================================
# PHẦN 5: MODEL EVALUATION
# ============================================
print("=" * 60)
print("📊 CLASSIFICATION REPORTS")
print("=" * 60)
for name, pred in [("LOGISTIC REGRESSION", lr_pred),
("DECISION TREE", dt_pred)]:
print(f"\n{'='*40}")
print(f"🔹 {name}")
print(f"{'='*40}")
print(classification_report(y_test, pred,
target_names=['Stay (0)', 'Churn (1)']))Bước 5.2: Confusion Matrices
python
# Confusion Matrices — Side by Side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, name, pred in zip(axes,
["Logistic Regression", "Decision Tree"],
[lr_pred, dt_pred]):
cm = confusion_matrix(y_test, pred)
disp = ConfusionMatrixDisplay(cm, display_labels=["Stay", "Churn"])
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title(f"{name}\nConfusion Matrix", fontsize=13)
plt.suptitle("Model Comparison — Confusion Matrices", fontsize=15, y=1.02)
plt.tight_layout()
plt.show()Bước 5.3: Model Comparison Table
python
# Model Comparison Table
print("=" * 70)
print("📊 MODEL COMPARISON")
print("=" * 70)
results = []
for name, pred, prob in [
("Logistic Regression", lr_pred, lr_prob),
("Decision Tree", dt_pred, dt_prob)
]:
results.append({
'Model': name,
'Accuracy': accuracy_score(y_test, pred),
'Precision': precision_score(y_test, pred),
'Recall': recall_score(y_test, pred),
'F1': f1_score(y_test, pred),
})
comparison = pd.DataFrame(results).set_index('Model')
print(comparison.round(4).to_string())
# Visual comparison
comparison.plot(kind='bar', figsize=(12, 6), rot=0)
plt.title('Model Comparison — Classification Metrics')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()Bước 5.4: Feature Importance
python
# Feature Importance — Logistic Regression (Coefficients)
lr_importance = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': lr_model.coef_[0]
}).sort_values('Coefficient', ascending=False)
print("\n🔍 LOGISTIC REGRESSION — Top 10 Features:")
print(lr_importance.head(10).to_string(index=False))
# Feature Importance — Decision Tree
dt_importance = pd.DataFrame({
'Feature': feature_cols,
'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\n🌳 DECISION TREE — Top 10 Features:")
print(dt_importance.head(10).to_string(index=False))
# Visualize Decision Tree Feature Importance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# LR Coefficients (top 10)
top_lr = lr_importance.head(10)
colors_lr = ['#e74c3c' if v > 0 else '#2ecc71' for v in top_lr['Coefficient']]
axes[0].barh(top_lr['Feature'], top_lr['Coefficient'], color=colors_lr)
axes[0].set_title('Logistic Regression — Top 10 Coefficients')
axes[0].set_xlabel('Coefficient (+ = churn risk)')
# DT Importance (top 10)
top_dt = dt_importance.head(10)
axes[1].barh(top_dt['Feature'], top_dt['Importance'], color='#3498db')
axes[1].set_title('Decision Tree — Top 10 Feature Importance')
axes[1].set_xlabel('Importance')
plt.tight_layout()
plt.show()Phần 6: Business Insights & Recommendations (15 phút)
Bước 6.1: Identify High-Risk Customers
python
# ============================================
# PHẦN 6: BUSINESS INSIGHTS & RECOMMENDATIONS
# ============================================
# Identify top 500 high-risk customers (using better model)
# Use the model with higher Recall (we want to CATCH churners)
best_model_name = comparison['Recall'].idxmax()
print(f"🏆 Best model for Recall: {best_model_name}")
# Use Decision Tree predictions (on full X_test, unscaled)
# Add churn probability to test data
test_results = X_test.copy()
test_results['churn_actual'] = y_test
test_results['churn_prob'] = dt_prob
test_results['churn_pred'] = dt_pred
test_results['customer_id'] = data.loc[X_test.index, 'customer_id']
# Top 500 high-risk
top_500 = test_results.nlargest(500, 'churn_prob')
print(f"\n📊 TOP 500 HIGH-RISK CUSTOMERS:")
print(f" Avg churn probability: {top_500['churn_prob'].mean():.1%}")
print(f" Actual churn rate: {top_500['churn_actual'].mean():.1%}")
print(f" Min churn probability: {top_500['churn_prob'].min():.1%}")Bước 6.2: Churn Profile — Who Churns?
python
# Churn Profile Analysis
print("=" * 50)
print("📊 CHURN PROFILE — Who Churns?")
print("=" * 50)
churners = data[data['churn'] == 1]
stayers = data[data['churn'] == 0]
profile = pd.DataFrame({
'Metric': [
'Avg Tenure (months)', 'Avg Monthly Charges ($)',
'Month-to-month %', 'Online Security %',
'Tech Support %', 'Avg Support Tickets',
'Electronic Check %', 'Has Partner %'
],
'Churners': [
churners['tenure'].mean().round(1),
churners['monthly_charges'].mean().round(1),
(churners['contract_type'] == 'Month-to-month').mean().round(3) * 100,
churners['online_security'].mean().round(3) * 100,
churners['tech_support'].mean().round(3) * 100,
churners['num_support_tickets'].mean().round(1),
(churners['payment_method'] == 'Electronic check').mean().round(3) * 100,
churners['partner'].mean().round(3) * 100,
],
'Stayers': [
stayers['tenure'].mean().round(1),
stayers['monthly_charges'].mean().round(1),
(stayers['contract_type'] == 'Month-to-month').mean().round(3) * 100,
stayers['online_security'].mean().round(3) * 100,
stayers['tech_support'].mean().round(3) * 100,
stayers['num_support_tickets'].mean().round(1),
(stayers['payment_method'] == 'Electronic check').mean().round(3) * 100,
stayers['partner'].mean().round(3) * 100,
]
})
print(profile.to_string(index=False))Bước 6.3: Business Impact Estimation
python
# Business Impact Estimation
print("=" * 50)
print("💰 BUSINESS IMPACT ESTIMATION")
print("=" * 50)
# Assumptions
ltv = 3_200_000 # VND per customer
campaign_cost = 150_000 # VND per customer contacted
retention_rate = 0.35 # 35% of contacted customers retained
# Using model on full 500,000 customers (scale up from test)
total_customers = 500_000
churn_rate = data['churn'].mean()
expected_churners = int(total_customers * churn_rate)
# Model catches (Recall)
recall = recall_score(y_test, dt_pred)
predicted_churners = int(expected_churners * recall)
# Retention campaign on top 500 per month
contacted = 500
retained = int(contacted * retention_rate)
monthly_saving = retained * ltv
monthly_cost = contacted * campaign_cost
monthly_roi = (monthly_saving - monthly_cost) / monthly_cost
print(f"""
📊 BUSINESS IMPACT (Monthly Estimate):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total customers: {total_customers:>10,}
Expected churners: {expected_churners:>10,}/month
Model catches (Recall): {recall:>10.1%}
Predicted churners: {predicted_churners:>10,}
RETENTION CAMPAIGN:
Contacted (top 500): {contacted:>10,}
Expected retained (35%): {retained:>10,}
Revenue saved: {monthly_saving:>10,} VND/month
Campaign cost: {monthly_cost:>10,} VND/month
Net saving: {monthly_saving - monthly_cost:>10,} VND/month
ROI: {monthly_roi:>10.1f}x
ANNUAL PROJECTION:
Annual revenue saved: {monthly_saving * 12:>10,} VND
Annual campaign cost: {monthly_cost * 12:>10,} VND
Annual net impact: {(monthly_saving - monthly_cost) * 12:>10,} VND
""")Bước 6.4: Recommendations
python
# ============================================
# FINAL RECOMMENDATIONS
# ============================================
recommendations = """
📋 RECOMMENDATIONS FOR BUSINESS STAKEHOLDERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. 🎯 TARGET: Month-to-month contract customers
- Churn rate 2-3x higher than yearly contracts
- Action: Offer 10% discount for annual upgrade
- Expected impact: 15-20% churn reduction in this segment
2. 🛡️ INVEST: Online Security & Tech Support bundles
- Customers WITHOUT these services churn significantly more
- Action: Free trial 30 ngày → auto-enroll
- Expected impact: 10-15% churn reduction
3. ⏰ FOCUS: First 12 months customers
- Highest churn risk period → onboarding critical
- Action: Enhanced onboarding (welcome call, setup guide, check-in)
- Expected impact: 20% reduction in early churn
4. 💰 REVIEW: High charge customers
- Higher charges + monthly contract = highest churn risk
- Action: Loyalty rewards, volume discounts
- Expected impact: Retain high-value customers
5. 📞 RETENTION CAMPAIGN: Monthly top 500 high-risk
- Model identifies high-risk customers with Recall ~75%+
- Call/voucher campaign → expected 35% retention rate
- ROI: ~6-8x return on retention investment
MODEL RECOMMENDATION:
━━━━━━━━━━━━━━━━━━━━
- Deploy Decision Tree for production (higher Recall, interpretable)
- Monitor monthly: retrain if accuracy drops >5%
- Phase 2: Try Random Forest/XGBoost for higher accuracy
"""
print(recommendations)📦 Deliverables Checklist
📋 WORKSHOP DELIVERABLES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Jupyter Notebook (HoTen_Buoi17_ChurnPrediction.ipynb):
□ Phần 1: Dataset generation + overview
□ Phần 2: EDA (distributions, churn rates, correlation)
□ Phần 3: Feature engineering + train/test split
□ Phần 4: Train 2 models (LR + DT) + cross-validation
□ Phần 5: Evaluation (confusion matrix, metrics, feature importance)
□ Phần 6: Business insights + recommendations
✅ Model Comparison Summary:
□ Table: Accuracy, Precision, Recall, F1 cho cả 2 models
□ Winner recommendation (with reasoning)
□ Feature importance analysis
✅ Recommendation Slide (1 page):
□ "Who churns?" — churn profile summary
□ "What to do?" — top 3 actionable recommendations
□ "Business impact?" — ROI estimation
□ "Next steps?" — model improvements + deployment plan💡 Extension Challenges (Bonus)
Nếu hoàn thành sớm, thử:
| Challenge | Mô tả | Difficulty |
|---|---|---|
| Tune threshold | Thay đổi decision threshold (0.5 → 0.3) → xem Recall thay đổi thế nào | ⭐⭐ |
| Try Random Forest | RandomForestClassifier(n_estimators=100) → so sánh với LR + DT | ⭐⭐ |
| ROC Curve | Plot ROC curve cho cả 2 models, tính AUC | ⭐⭐⭐ |
| Segment analysis | Model performance by contract_type segment | ⭐⭐⭐ |
| SHAP values | pip install shap → explain individual predictions | ⭐⭐⭐⭐ |