Skip to content

🛠 Workshop — Xây Model Dự Đoán Customer Churn

Từ dataset 10,000 khách hàng → EDA → feature selection → train Logistic Regression + Decision Tree → evaluate: confusion matrix, accuracy, precision, recall → insight: ai churn, recommendations? Output: Jupyter Notebook + Model Comparison + Recommendation Slide!

🎯 Mục tiêu workshop

Sau khi hoàn thành workshop này, bạn sẽ:

  1. EDA — khám phá churn dataset, phát hiện patterns
  2. Feature selection — chọn features có predictive power
  3. Train 2 models — Logistic Regression & Decision Tree
  4. Evaluate — confusion matrix, accuracy, precision, recall, F1
  5. Insight — who churns? tại sao? recommendations cho business?

🧰 Yêu cầu

Yêu cầuChi tiết
Kiến thứcĐã hoàn thành Buổi 17 lý thuyết (ML for DA)
Pythonpandas, numpy, scikit-learn, matplotlib, seaborn (đã học Buổi 7-10)
Thời gian90–120 phút
OutputJupyter Notebook + Model Comparison + Recommendation Slide

💡 Naming convention

Đặt tên file: HoTen_Buoi17_ChurnPrediction.ipynb


📦 Scenario: Telecom Churn Prediction

Bạn là DA tại TelcoVN — nhà mạng viễn thông, 500,000 subscribers. Retention team cần biết: "Khách nào sắp churn trong 30 ngày tới?" để launch retention campaign (gọi điện, tặng voucher, upgrade plan).

Business context:

  • Current churn rate: ~5.5%/tháng
  • Average LTV (Lifetime Value): 3.2 triệu VND/khách
  • Retention campaign cost: 150K VND/khách
  • Goal: Identify top 500 high-risk khách → retention campaign → giảm churn 20%

Phần 1: Setup & Generate Dataset (15 phút)

Bước 1.1: Import Libraries

python
# ============================================
# PHẦN 1: SETUP & DATASET
# ============================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report,
                             ConfusionMatrixDisplay)
from sklearn.preprocessing import StandardScaler

# Settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("✅ Libraries imported successfully!")

Bước 1.2: Generate Churn Dataset (10,000 rows)

python
# ============================================
# GENERATE TELECOM CHURN DATASET (10,000 customers)
# ============================================

n = 10000
np.random.seed(42)

# Generate features
data = pd.DataFrame({
    'customer_id': range(1, n + 1),
    'tenure': np.random.randint(1, 72, n),                    # months as customer
    'monthly_charges': np.random.uniform(20, 120, n).round(2), # USD/month
    'total_charges': np.zeros(n),                               # will calculate
    'contract_type': np.random.choice(
        ['Month-to-month', 'One year', 'Two year'],
        n, p=[0.50, 0.30, 0.20]),
    'payment_method': np.random.choice(
        ['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'],
        n, p=[0.35, 0.20, 0.25, 0.20]),
    'internet_service': np.random.choice(
        ['DSL', 'Fiber optic', 'No'], n, p=[0.35, 0.45, 0.20]),
    'online_security': np.random.choice([0, 1], n, p=[0.55, 0.45]),
    'tech_support': np.random.choice([0, 1], n, p=[0.60, 0.40]),
    'streaming_tv': np.random.choice([0, 1], n, p=[0.50, 0.50]),
    'num_support_tickets': np.random.randint(0, 12, n),
    'senior_citizen': np.random.choice([0, 1], n, p=[0.85, 0.15]),
    'partner': np.random.choice([0, 1], n, p=[0.48, 0.52]),
    'dependents': np.random.choice([0, 1], n, p=[0.70, 0.30]),
})

# Calculate total_charges based on tenure × monthly_charges + noise
data['total_charges'] = (
    data['tenure'] * data['monthly_charges'] * np.random.uniform(0.85, 1.15, n)
).round(2)

# Generate churn target (realistic logic)
churn_score = (
    -0.04 * data['tenure'] +                          # longer tenure → less churn
    0.015 * data['monthly_charges'] +                   # higher charges → more churn
    1.2 * (data['contract_type'] == 'Month-to-month') + # monthly → much more churn
    -0.6 * data['online_security'] +                    # security → less churn
    -0.5 * data['tech_support'] +                       # support → less churn
    0.08 * data['num_support_tickets'] +                # more tickets → more churn
    0.5 * (data['internet_service'] == 'Fiber optic') + # fiber → more churn (price)
    0.3 * (data['payment_method'] == 'Electronic check') + # echeck → more churn
    -0.3 * data['partner'] +                            # partner → less churn
    -0.4 * data['dependents'] +                         # dependents → less churn
    np.random.normal(0, 0.8, n)                         # noise
)

# Convert to binary: ~27% churn rate
threshold = np.percentile(churn_score, 73)
data['churn'] = (churn_score > threshold).astype(int)

print(f"📊 Dataset shape: {data.shape}")
print(f"📊 Churn rate: {data['churn'].mean():.1%}")
print(f"\n{data.head()}")

Bước 1.3: Quick Overview

python
# Dataset overview
print("=" * 50)
print("📋 DATASET OVERVIEW")
print("=" * 50)
print(f"\nShape: {data.shape[0]:,} rows × {data.shape[1]} columns")
print(f"\nColumn types:")
print(data.dtypes.value_counts())
print(f"\nMissing values:")
print(data.isnull().sum().sum(), "total missing")
print(f"\nTarget distribution:")
print(data['churn'].value_counts())
print(f"\nChurn rate: {data['churn'].mean():.2%}")

Phần 2: EDA — Exploratory Data Analysis (25 phút)

Bước 2.1: Churn vs Non-Churn Comparison

python
# ============================================
# PHẦN 2: EDA
# ============================================

# Churn vs Non-Churn — Numerical features
churn_comparison = data.groupby('churn')[
    ['tenure', 'monthly_charges', 'total_charges', 'num_support_tickets']
].mean().round(2)
churn_comparison.index = ['Stay', 'Churn']
print("📊 CHURN vs NON-CHURN — Averages:")
print(churn_comparison)

Bước 2.2: Visualize Key Distributions

python
# Distributions: Tenure & Monthly Charges by Churn
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Tenure
for label, group in data.groupby('churn'):
    axes[0].hist(group['tenure'], bins=30, alpha=0.6,
                 label=f"{'Churn' if label else 'Stay'}")
axes[0].set_xlabel('Tenure (months)')
axes[0].set_ylabel('Count')
axes[0].set_title('Tenure Distribution by Churn')
axes[0].legend()

# Monthly Charges
for label, group in data.groupby('churn'):
    axes[1].hist(group['monthly_charges'], bins=30, alpha=0.6,
                 label=f"{'Churn' if label else 'Stay'}")
axes[1].set_xlabel('Monthly Charges ($)')
axes[1].set_ylabel('Count')
axes[1].set_title('Monthly Charges Distribution by Churn')
axes[1].legend()

plt.tight_layout()
plt.show()

Bước 2.3: Churn Rate by Categorical Features

python
# Churn rate by categorical features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

categoricals = ['contract_type', 'internet_service', 'payment_method', 'online_security']
titles = ['Contract Type', 'Internet Service', 'Payment Method', 'Online Security']

for ax, col, title in zip(axes.flatten(), categoricals, titles):
    churn_rate = data.groupby(col)['churn'].mean().sort_values(ascending=False)
    churn_rate.plot(kind='bar', ax=ax, color=['#e74c3c' if v > 0.3 else '#3498db'
                                               for v in churn_rate.values])
    ax.set_title(f'Churn Rate by {title}')
    ax.set_ylabel('Churn Rate')
    ax.set_ylim(0, 0.6)
    ax.tick_params(axis='x', rotation=45)
    for i, v in enumerate(churn_rate.values):
        ax.text(i, v + 0.01, f'{v:.1%}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

Bước 2.4: Correlation Heatmap

python
# Correlation heatmap (numerical features + target)
numerical_cols = ['tenure', 'monthly_charges', 'total_charges',
                  'online_security', 'tech_support', 'streaming_tv',
                  'num_support_tickets', 'senior_citizen', 'partner',
                  'dependents', 'churn']

corr_matrix = data[numerical_cols].corr()

plt.figure(figsize=(12, 9))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix — Numerical Features vs Churn')
plt.tight_layout()
plt.show()

# Top correlations with churn
print("\n📊 TOP CORRELATIONS WITH CHURN:")
churn_corr = corr_matrix['churn'].drop('churn').sort_values(key=abs, ascending=False)
for feat, corr in churn_corr.items():
    direction = "↑ churn" if corr > 0 else "↓ churn"
    print(f"  {feat:>25s}: {corr:+.3f} ({direction})")

Bước 2.5: EDA Key Findings

python
# ============================================
# EDA KEY FINDINGS
# ============================================

findings = """
📊 EDA KEY FINDINGS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. TENURE: Churners có tenure thấp hơn (avg ~18 months vs ~38 months)
   → Khách mới dễ churn hơn khách lâu năm

2. CONTRACT: Month-to-month churn rate >> One year >> Two year
   → Contract flexibility = churn risk

3. MONTHLY CHARGES: Churners trả nhiều hơn (avg ~$75 vs ~$60)
   → Giá cao + không gắn bó = churn

4. ONLINE SECURITY + TECH SUPPORT: Có → churn ít hơn
   → Value-added services giữ chân khách

5. SUPPORT TICKETS: Churners có nhiều tickets hơn
   → Trải nghiệm xấu → frustration → churn

6. ELECTRONIC CHECK: Churn rate cao hơn payment methods khác
   → Có thể liên quan đến customer segment

HYPOTHESIS: Churn driven by (1) short tenure, (2) monthly contract,
            (3) high charges, (4) lack of value-added services.
"""
print(findings)

Phần 3: Feature Engineering & Preparation (15 phút)

Bước 3.1: Feature Engineering

python
# ============================================
# PHẦN 3: FEATURE ENGINEERING
# ============================================

# Create copy for modeling
df = data.copy()

# One-Hot Encode categorical features
df = pd.get_dummies(df, columns=['contract_type', 'internet_service',
                                  'payment_method'], drop_first=True)

print(f"Shape after encoding: {df.shape}")
print(f"\nNew columns:\n{[c for c in df.columns if '_' in c and c not in data.columns]}")

Bước 3.2: Select Features & Split

python
# Define features and target
feature_cols = [c for c in df.columns if c not in ['customer_id', 'churn']]
X = df[feature_cols]
y = df['churn']

print(f"Features: {len(feature_cols)}")
print(f"Feature names: {feature_cols}")
print(f"\nTarget distribution:")
print(y.value_counts())

# Train/Test Split (80/20, stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n✅ Train: {X_train.shape[0]:,} rows ({y_train.mean():.1%} churn)")
print(f"✅ Test:  {X_test.shape[0]:,} rows ({y_test.mean():.1%} churn)")

Bước 3.3: Scale Features (cho Logistic Regression)

python
# Standard Scaling — cần cho Logistic Regression, không cần cho Decision Tree
scaler = StandardScaler()

# fit on TRAIN only, transform both
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=feature_cols, index=X_train.index
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),   # KHÔNG fit_transform!
    columns=feature_cols, index=X_test.index
)

print("✅ Features scaled (Standard Scaler)")
print(f"Train mean: {X_train_scaled.mean().mean():.4f} (should be ~0)")
print(f"Train std:  {X_train_scaled.std().mean():.4f} (should be ~1)")

Phần 4: Train Models (20 phút)

Bước 4.1: Logistic Regression

python
# ============================================
# PHẦN 4: MODEL TRAINING
# ============================================

# ---- MODEL 1: LOGISTIC REGRESSION ----
lr_model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',   # handle imbalanced data
    random_state=42,
    C=1.0                      # regularization (default)
)

lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
lr_prob = lr_model.predict_proba(X_test_scaled)[:, 1]

print("✅ Logistic Regression trained!")
print(f"   Train accuracy: {lr_model.score(X_train_scaled, y_train):.4f}")
print(f"   Test accuracy:  {lr_model.score(X_test_scaled, y_test):.4f}")

Bước 4.2: Decision Tree

python
# ---- MODEL 2: DECISION TREE ----
dt_model = DecisionTreeClassifier(
    max_depth=5,               # limit depth to prevent overfitting
    min_samples_leaf=50,       # minimum samples per leaf
    class_weight='balanced',
    random_state=42
)

dt_model.fit(X_train, y_train)  # Decision Tree KHÔNG cần scaling
dt_pred = dt_model.predict(X_test)
dt_prob = dt_model.predict_proba(X_test)[:, 1]

print("✅ Decision Tree trained!")
print(f"   Train accuracy: {dt_model.score(X_train, y_train):.4f}")
print(f"   Test accuracy:  {dt_model.score(X_test, y_test):.4f}")

Bước 4.3: Cross-Validation

python
# Cross-Validation (5-fold) — check for overfitting
print("=" * 50)
print("📊 CROSS-VALIDATION (5-fold)")
print("=" * 50)

for name, model, X_cv in [
    ("Logistic Regression", lr_model, scaler.fit_transform(X)),
    ("Decision Tree", dt_model, X)
]:
    cv_acc = cross_val_score(model, X_cv, y, cv=5, scoring='accuracy')
    cv_f1 = cross_val_score(model, X_cv, y, cv=5, scoring='f1')
    cv_recall = cross_val_score(model, X_cv, y, cv=5, scoring='recall')

    print(f"\n🔹 {name}:")
    print(f"   Accuracy: {cv_acc.mean():.4f} ± {cv_acc.std():.4f}")
    print(f"   F1 Score: {cv_f1.mean():.4f} ± {cv_f1.std():.4f}")
    print(f"   Recall:   {cv_recall.mean():.4f} ± {cv_recall.std():.4f}")

Phần 5: Evaluate Models (20 phút)

Bước 5.1: Classification Reports

python
# ============================================
# PHẦN 5: MODEL EVALUATION
# ============================================

print("=" * 60)
print("📊 CLASSIFICATION REPORTS")
print("=" * 60)

for name, pred in [("LOGISTIC REGRESSION", lr_pred),
                    ("DECISION TREE", dt_pred)]:
    print(f"\n{'='*40}")
    print(f"🔹 {name}")
    print(f"{'='*40}")
    print(classification_report(y_test, pred,
                                target_names=['Stay (0)', 'Churn (1)']))

Bước 5.2: Confusion Matrices

python
# Confusion Matrices — Side by Side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, name, pred in zip(axes,
                           ["Logistic Regression", "Decision Tree"],
                           [lr_pred, dt_pred]):
    cm = confusion_matrix(y_test, pred)
    disp = ConfusionMatrixDisplay(cm, display_labels=["Stay", "Churn"])
    disp.plot(ax=ax, cmap='Blues', values_format='d')
    ax.set_title(f"{name}\nConfusion Matrix", fontsize=13)

plt.suptitle("Model Comparison — Confusion Matrices", fontsize=15, y=1.02)
plt.tight_layout()
plt.show()

Bước 5.3: Model Comparison Table

python
# Model Comparison Table
print("=" * 70)
print("📊 MODEL COMPARISON")
print("=" * 70)

results = []
for name, pred, prob in [
    ("Logistic Regression", lr_pred, lr_prob),
    ("Decision Tree", dt_pred, dt_prob)
]:
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, pred),
        'Precision': precision_score(y_test, pred),
        'Recall': recall_score(y_test, pred),
        'F1': f1_score(y_test, pred),
    })

comparison = pd.DataFrame(results).set_index('Model')
print(comparison.round(4).to_string())

# Visual comparison
comparison.plot(kind='bar', figsize=(12, 6), rot=0)
plt.title('Model Comparison — Classification Metrics')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

Bước 5.4: Feature Importance

python
# Feature Importance — Logistic Regression (Coefficients)
lr_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': lr_model.coef_[0]
}).sort_values('Coefficient', ascending=False)

print("\n🔍 LOGISTIC REGRESSION — Top 10 Features:")
print(lr_importance.head(10).to_string(index=False))

# Feature Importance — Decision Tree
dt_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\n🌳 DECISION TREE — Top 10 Features:")
print(dt_importance.head(10).to_string(index=False))

# Visualize Decision Tree Feature Importance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# LR Coefficients (top 10)
top_lr = lr_importance.head(10)
colors_lr = ['#e74c3c' if v > 0 else '#2ecc71' for v in top_lr['Coefficient']]
axes[0].barh(top_lr['Feature'], top_lr['Coefficient'], color=colors_lr)
axes[0].set_title('Logistic Regression — Top 10 Coefficients')
axes[0].set_xlabel('Coefficient (+ = churn risk)')

# DT Importance (top 10)
top_dt = dt_importance.head(10)
axes[1].barh(top_dt['Feature'], top_dt['Importance'], color='#3498db')
axes[1].set_title('Decision Tree — Top 10 Feature Importance')
axes[1].set_xlabel('Importance')

plt.tight_layout()
plt.show()

Phần 6: Business Insights & Recommendations (15 phút)

Bước 6.1: Identify High-Risk Customers

python
# ============================================
# PHẦN 6: BUSINESS INSIGHTS & RECOMMENDATIONS
# ============================================

# Identify top 500 high-risk customers (using better model)
# Use the model with higher Recall (we want to CATCH churners)
best_model_name = comparison['Recall'].idxmax()
print(f"🏆 Best model for Recall: {best_model_name}")

# Use Decision Tree predictions (on full X_test, unscaled)
# Add churn probability to test data
test_results = X_test.copy()
test_results['churn_actual'] = y_test
test_results['churn_prob'] = dt_prob
test_results['churn_pred'] = dt_pred
test_results['customer_id'] = data.loc[X_test.index, 'customer_id']

# Top 500 high-risk
top_500 = test_results.nlargest(500, 'churn_prob')
print(f"\n📊 TOP 500 HIGH-RISK CUSTOMERS:")
print(f"   Avg churn probability: {top_500['churn_prob'].mean():.1%}")
print(f"   Actual churn rate:     {top_500['churn_actual'].mean():.1%}")
print(f"   Min churn probability: {top_500['churn_prob'].min():.1%}")

Bước 6.2: Churn Profile — Who Churns?

python
# Churn Profile Analysis
print("=" * 50)
print("📊 CHURN PROFILE — Who Churns?")
print("=" * 50)

churners = data[data['churn'] == 1]
stayers = data[data['churn'] == 0]

profile = pd.DataFrame({
    'Metric': [
        'Avg Tenure (months)', 'Avg Monthly Charges ($)',
        'Month-to-month %', 'Online Security %',
        'Tech Support %', 'Avg Support Tickets',
        'Electronic Check %', 'Has Partner %'
    ],
    'Churners': [
        churners['tenure'].mean().round(1),
        churners['monthly_charges'].mean().round(1),
        (churners['contract_type'] == 'Month-to-month').mean().round(3) * 100,
        churners['online_security'].mean().round(3) * 100,
        churners['tech_support'].mean().round(3) * 100,
        churners['num_support_tickets'].mean().round(1),
        (churners['payment_method'] == 'Electronic check').mean().round(3) * 100,
        churners['partner'].mean().round(3) * 100,
    ],
    'Stayers': [
        stayers['tenure'].mean().round(1),
        stayers['monthly_charges'].mean().round(1),
        (stayers['contract_type'] == 'Month-to-month').mean().round(3) * 100,
        stayers['online_security'].mean().round(3) * 100,
        stayers['tech_support'].mean().round(3) * 100,
        stayers['num_support_tickets'].mean().round(1),
        (stayers['payment_method'] == 'Electronic check').mean().round(3) * 100,
        stayers['partner'].mean().round(3) * 100,
    ]
})
print(profile.to_string(index=False))

Bước 6.3: Business Impact Estimation

python
# Business Impact Estimation
print("=" * 50)
print("💰 BUSINESS IMPACT ESTIMATION")
print("=" * 50)

# Assumptions
ltv = 3_200_000          # VND per customer
campaign_cost = 150_000  # VND per customer contacted
retention_rate = 0.35    # 35% of contacted customers retained

# Using model on full 500,000 customers (scale up from test)
total_customers = 500_000
churn_rate = data['churn'].mean()
expected_churners = int(total_customers * churn_rate)

# Model catches (Recall)
recall = recall_score(y_test, dt_pred)
predicted_churners = int(expected_churners * recall)

# Retention campaign on top 500 per month
contacted = 500
retained = int(contacted * retention_rate)
monthly_saving = retained * ltv
monthly_cost = contacted * campaign_cost
monthly_roi = (monthly_saving - monthly_cost) / monthly_cost

print(f"""
📊 BUSINESS IMPACT (Monthly Estimate):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total customers:          {total_customers:>10,}
Expected churners:        {expected_churners:>10,}/month
Model catches (Recall):   {recall:>10.1%}
Predicted churners:       {predicted_churners:>10,}

RETENTION CAMPAIGN:
Contacted (top 500):      {contacted:>10,}
Expected retained (35%):  {retained:>10,}
Revenue saved:            {monthly_saving:>10,} VND/month
Campaign cost:            {monthly_cost:>10,} VND/month
Net saving:               {monthly_saving - monthly_cost:>10,} VND/month
ROI:                      {monthly_roi:>10.1f}x

ANNUAL PROJECTION:
Annual revenue saved:     {monthly_saving * 12:>10,} VND
Annual campaign cost:     {monthly_cost * 12:>10,} VND
Annual net impact:        {(monthly_saving - monthly_cost) * 12:>10,} VND
""")

Bước 6.4: Recommendations

python
# ============================================
# FINAL RECOMMENDATIONS
# ============================================

recommendations = """
📋 RECOMMENDATIONS FOR BUSINESS STAKEHOLDERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. 🎯 TARGET: Month-to-month contract customers
   - Churn rate 2-3x higher than yearly contracts
   - Action: Offer 10% discount for annual upgrade
   - Expected impact: 15-20% churn reduction in this segment

2. 🛡️ INVEST: Online Security & Tech Support bundles
   - Customers WITHOUT these services churn significantly more
   - Action: Free trial 30 ngày → auto-enroll
   - Expected impact: 10-15% churn reduction

3. ⏰ FOCUS: First 12 months customers
   - Highest churn risk period → onboarding critical
   - Action: Enhanced onboarding (welcome call, setup guide, check-in)
   - Expected impact: 20% reduction in early churn

4. 💰 REVIEW: High charge customers
   - Higher charges + monthly contract = highest churn risk
   - Action: Loyalty rewards, volume discounts
   - Expected impact: Retain high-value customers

5. 📞 RETENTION CAMPAIGN: Monthly top 500 high-risk
   - Model identifies high-risk customers with Recall ~75%+
   - Call/voucher campaign → expected 35% retention rate
   - ROI: ~6-8x return on retention investment

MODEL RECOMMENDATION:
━━━━━━━━━━━━━━━━━━━━
- Deploy Decision Tree for production (higher Recall, interpretable)
- Monitor monthly: retrain if accuracy drops >5%
- Phase 2: Try Random Forest/XGBoost for higher accuracy
"""
print(recommendations)

📦 Deliverables Checklist

📋 WORKSHOP DELIVERABLES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Jupyter Notebook (HoTen_Buoi17_ChurnPrediction.ipynb):
   □ Phần 1: Dataset generation + overview
   □ Phần 2: EDA (distributions, churn rates, correlation)
   □ Phần 3: Feature engineering + train/test split
   □ Phần 4: Train 2 models (LR + DT) + cross-validation
   □ Phần 5: Evaluation (confusion matrix, metrics, feature importance)
   □ Phần 6: Business insights + recommendations

✅ Model Comparison Summary:
   □ Table: Accuracy, Precision, Recall, F1 cho cả 2 models
   □ Winner recommendation (with reasoning)
   □ Feature importance analysis

✅ Recommendation Slide (1 page):
   □ "Who churns?" — churn profile summary
   □ "What to do?" — top 3 actionable recommendations
   □ "Business impact?" — ROI estimation
   □ "Next steps?" — model improvements + deployment plan

💡 Extension Challenges (Bonus)

Nếu hoàn thành sớm, thử:

ChallengeMô tảDifficulty
Tune thresholdThay đổi decision threshold (0.5 → 0.3) → xem Recall thay đổi thế nào⭐⭐
Try Random ForestRandomForestClassifier(n_estimators=100) → so sánh với LR + DT⭐⭐
ROC CurvePlot ROC curve cho cả 2 models, tính AUC⭐⭐⭐
Segment analysisModel performance by contract_type segment⭐⭐⭐
SHAP valuespip install shap → explain individual predictions⭐⭐⭐⭐