Appearance
🧠 Case Study — Capstone Analysis Sprint: 3 Projects thực tế
Trong buổi học này, chúng ta đã nắm sprint methodology, dashboard development, và executive summary writing. Bây giờ hãy xem 3 capstone projects ở các mức độ khác nhau — từ một project "trước và sau refactor", đến sprint process hoàn chỉnh, và customer segmentation end-to-end. Mỗi case study cho thấy process quan trọng hơn output — cách bạn đi từ raw data đến insight quyết định chất lượng cuối cùng.
Case Study 1: E-commerce Revenue Analysis — Before & After Dashboard
Bối cảnh
Dataset: Online retail — 25,000 transactions, 18 columns, 12 tháng (2025). Nguồn: Kaggle dataset, cleaned & augmented.
Business Context: ShopLocal — sàn e-commerce Việt Nam, GMV 50 tỷ VND/năm, 8,000 active sellers, 120,000 monthly buyers. CEO muốn biết: "Revenue đang đi đâu? Nên focus vào segment nào?"
5 Business Questions:
| # | Question | Type |
|---|---|---|
| Q1 | Revenue trend 12 tháng — tháng nào peak, tháng nào dip? | Descriptive |
| Q2 | Top 10 product categories contribute bao nhiêu % revenue? | Diagnostic |
| Q3 | Khách hàng mới vs returning khác nhau thế nào về AOV? | Comparative |
| Q4 | Nên focus region nào để tăng revenue? | Prescriptive |
| Q5 | Revenue Q1 2026 forecast? | Predictive |
❌ Version 1: Dashboard "trước khi peer review"
Lỗi phổ biến mà capstone projects thường mắc:
┌──────────────────────────────────────────────────────────────┐
│ ❌ DASHBOARD v1 — "Chart Dump" │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Pie │ │Bar │ │Line │ │Scatter│ │Table │ │Area │ │
│ │Chart │ │Chart │ │Chart │ │Plot │ │ │ │Chart │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Heatmap│ │Donut │ │Treemap│ │Gauge │ │
│ │ │ │ │ │ │ │ │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ Problems: │
│ - 10 charts, 1 page — overwhelm │
│ - No visual hierarchy — equal size │
│ - No titling — "Chart 1", "Chart 2" │
│ - Colors: rainbow — mỗi chart 1 palette │
│ - No KPI cards — no context │
│ - No filters — static view only │
└──────────────────────────────────────────────────────────────┘Peer Review Feedback cho v1:
| Reviewer | Feedback | Priority |
|---|---|---|
| Reviewer A | "Nhìn vào đâu trước? Không có hierarchy" | 🔴 High |
| Reviewer B | "10 charts nhưng không chart nào answer question" | 🔴 High |
| Reviewer C | "Pie chart 12 slices — unreadable. Dùng bar chart" | 🟡 Medium |
| Reviewer D | "Color inconsistent — 'Electronics' xanh ở chart 1, đỏ ở chart 3" | 🟡 Medium |
| Reviewer E | "Không có title, không biết dashboard này cho ai" | 🟡 Medium |
✅ Version 2: Dashboard "sau khi refactor"
┌──────────────────────────────────────────────────────────────┐
│ ✅ DASHBOARD v2 — PAGE 1: EXECUTIVE OVERVIEW │
│ "ShopLocal Revenue Dashboard — FY2025" │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Revenue │ │ Orders │ │ AOV │ │ Active │ │
│ │ 52.3B │ │ 25,041 │ │ 2.09M │ │ Buyers │ │
│ │ ▲ +12% │ │ ▲ +8% │ │ ▲ +3.5% │ │ 118,420 │ │
│ │ vs LY │ │ vs LY │ │ vs LY │ │ ▲ +15% │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌────────────────────────────┐ ┌────────────────┐ │
│ │ Revenue Trend (Line) │ │ Top Categories │ │
│ │ ────────────── │ │ (Horizontal Bar)│ │
│ │ /\ /\ 📌+22% │ │ Electronics 32%│ │
│ │ / \ / \ Campaign X│ │ Fashion 24%│ │
│ │ / \/ \ │ │ Home 18%│ │
│ │ ──────────────────── │ │ Beauty 12%│ │
│ │ Annotated: Q3 dip, │ │ Others 14%│ │
│ │ campaign impact │ │ │ │
│ └────────────────────────────┘ └────────────────┘ │
│ │
│ Filter: [Date Range ▼] [Region ▼] [Category ▼] │
└──────────────────────────────────────────────────────────────┘Những gì thay đổi:
| Aspect | Before (v1) | After (v2) |
|---|---|---|
| Layout | 10 charts, 1 page | 3 pages, 3-4 charts each |
| Hierarchy | All same size | KPI cards top → trend middle → details bottom |
| Titles | "Chart 1", "Chart 2" | Descriptive: "Revenue Trend by Month (FY2025)" |
| Colors | Rainbow | 3-color palette: blue (primary), gray (secondary), red (alert) |
| KPI Cards | None | 4 cards with YoY comparison |
| Annotations | None | Campaign impact, Q3 dip explanation |
| Filters | None | Date, Region, Category slicer |
| Questions | Answered 0 | Each page answers 1-2 questions |
Key Finding
"Pie chart 12 slices → bar chart horizontal đã tăng readability 300%. Peer review là bước tốn ít thời gian nhất nhưng tạo nhiều giá trị nhất."
Case Study 2: Healthcare Churn — Sprint Process hoàn chỉnh
Bối cảnh
Dataset: Subscription data từ một health-tech app — 15,000 users, 22 columns, 24 tháng. Bao gồm: user demographics, subscription plan, app usage, health metrics tracking, support tickets.
Sprint Timeline:
mermaid
gantt
title Capstone Sprint — Healthcare Churn
dateFormat HH:mm
axisFormat %H:%M
section Data Prep
Data Audit Review :a1, 08:00, 30m
Data Cleaning :a2, after a1, 90m
section Analysis
EDA :b1, after a2, 60m
Q1-Q3 Analysis :b2, after b1, 90m
Q4-Q5 Analysis :b3, after b2, 60m
section Dashboard
Mockup Sketch :c1, after b3, 20m
Dashboard Build :c2, after c1, 120m
Peer Review + Fix :c3, after c2, 40m
section Summary
Executive Summary :d1, after c3, 40mSprint Log
| Time | Task | Output | Issue / Note |
|---|---|---|---|
| 08:00 | Review data audit | Confirmed: 4.2% missing in age, 8% in income | OK — impute median |
| 08:30 | Data cleaning | Cleaned dataset 14,370 rows (dropped 630 invalid) | signup_date format inconsistent → fixed 3 formats |
| 10:00 | EDA | 8 key distributions, correlation matrix | login_frequency strongly correlated with churn (r = -0.62) |
| 11:00 | Q1: Churn trend | Monthly churn trending up: 3.1% → 5.4% over 12 months | Spike in month 8 — coincides with pricing change |
| 11:30 | Q2: Segment analysis | Free-to-Premium < 3 months: churn 12.3%. Annual: 1.8% | Monthly plan is the problem — not the product |
| 12:00 | Lunch break | — | — |
| 13:00 | Q3: Churn predictors | Top 3: login_frequency, feature_adoption, tenure | Users who log in < 2x/week → 5x churn risk |
| 14:00 | Q4: Cohort analysis | Month-6 cohort: 40% retention. Month-12: 28% | Steep drop at month 3 — onboarding issue |
| 14:30 | Q5: Recommendations | 3 actionable recommendations with impact estimates | — |
| 15:00 | Dashboard mockup | Sketched 3 pages on paper | Showed to partner for quick feedback |
| 15:20 | Dashboard build | Page 1: Overview (KPIs + trend) | Used template layout from Buổi 11 |
| 16:00 | Dashboard build | Page 2: Segment deep dive | Added cross-filters between plan type and churn |
| 16:40 | Dashboard build | Page 3: Recommendations + annotations | Annotated the pricing change impact on trend chart |
| 17:00 | Peer review | 3 feedback items: axis labels, tooltip detail, color | Fixed all 3 in 15 minutes |
| 17:20 | Executive summary | 1-page draft with 5 findings | Followed What → So What → Now What template |
| 18:00 | Done | Notebook (38 cells) + Dashboard (3 pages) + Summary | Sprint complete ✅ |
Analysis Deep Dive: Cohort Retention
Một trong những analyses mạnh nhất trong project này là cohort retention analysis:
python
# ============================================
# COHORT RETENTION ANALYSIS
# ============================================
# Create cohort groups by signup month
df['signup_month'] = df['signup_date'].dt.to_period('M')
df['activity_month'] = df['last_active_date'].dt.to_period('M')
df['cohort_age'] = (df['activity_month'] - df['signup_month']).apply(lambda x: x.n)
# Build cohort table
cohort = df.groupby(['signup_month', 'cohort_age'])['user_id'].nunique().reset_index()
cohort_pivot = cohort.pivot(index='signup_month', columns='cohort_age', values='user_id')
# Calculate retention rates
cohort_size = cohort_pivot[0]
retention = cohort_pivot.divide(cohort_size, axis=0) * 100
print("📊 Cohort Retention Table (%):")
print(retention.round(1))Output:
| Signup Month | M0 | M1 | M2 | M3 | M6 | M12 |
|---|---|---|---|---|---|---|
| 2024-01 | 100% | 72% | 58% | 44% | 31% | 22% |
| 2024-04 | 100% | 68% | 51% | 38% | 25% | — |
| 2024-07 | 100% | 65% | 48% | 35% | — | — |
| 2024-10 | 100% | 61% | 42% | — | — | — |
Finding: Retention đang xấu đi theo thời gian — cohort mới có M1 retention thấp hơn cohort cũ (61% vs 72%). Root cause: pricing change ở tháng 7 làm tăng expectation nhưng không cải thiện onboarding. Users trả nhiều hơn nhưng vẫn không biết dùng product.
Dashboard Highlight: Annotated Trend
Churn Rate Trend — 24 Months
6% ─ ╱── 5.4%
│ 📌 Price ╱
5% ─ increase ╱
│ ↓ ╱
4% ─ ─────────── ────────╱─────╱
│ ╱ ╱
3% ─ ────────╱ ╱
│ 3.1% Stable ╱
2% ─ period Post-price-change
│ acceleration
└────┬────┬────┬────┬────┬────┬────┬────→
M1 M4 M7 M10 M13 M16 M19 M22Lesson Learned
| What worked | What didn't | Next time |
|---|---|---|
| Sprint log — track every hour | Spent too long on formatting EDA charts | Time-box chart formatting: 5 min max per chart |
| Paper mockup before dashboard | First attempt: too many filters | Start with 2 filters, add more if needed |
| Peer review before summary | Wrote summary before final dashboard → had to rewrite | Dashboard first, summary last |
Case Study 3: FinTech Customer Segmentation — Data to Action
Bối cảnh
Dataset: Transaction data từ fintech app (MoMo-like) — 40,000 users, 28 columns, data 2024-2025. Features: transaction frequency, average amount, categories (F&B, bills, transfer, savings), demographics.
Business Context: FinPay — ví điện tử 2 triệu users. Marketing team cần biết: "Có mấy nhóm users? Mỗi nhóm cần chiến lược gì?"
5 Business Questions:
| # | Question | Method |
|---|---|---|
| Q1 | Users phân bổ thế nào theo behavior? | RFM Analysis |
| Q2 | Có bao nhiêu segments tự nhiên? | K-Means Clustering |
| Q3 | Mỗi segment có đặc điểm gì? | Segment profiling |
| Q4 | Segment nào có highest LTV? | LTV calculation per segment |
| Q5 | Marketing strategy cho mỗi segment? | Prescriptive recommendations |
Analysis Process
Step 1: RFM Analysis
python
# ============================================
# RFM ANALYSIS
# ============================================
import pandas as pd
import numpy as np
from datetime import datetime
# Calculate RFM
snapshot_date = datetime(2025, 12, 31)
rfm = df.groupby('user_id').agg({
'transaction_date': lambda x: (snapshot_date - x.max()).days, # Recency
'transaction_id': 'count', # Frequency
'amount': 'sum' # Monetary
}).rename(columns={
'transaction_date': 'Recency',
'transaction_id': 'Frequency',
'amount': 'Monetary'
})
# RFM Scoring (quintiles)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
print(f"RFM Table:\n{rfm.head(10)}")Step 2: K-Means Clustering
python
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Standardize
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])
# Elbow method
inertias = []
for k in range(2, 10):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(rfm_scaled)
inertias.append(km.inertia_)
# Optimal: k=4 (elbow at 4)
km_final = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm['Segment'] = km_final.fit_predict(rfm_scaled)Step 3: Segment Profiling
| Segment | Name | Users | Avg Recency | Avg Frequency | Avg Monetary | LTV | Strategy |
|---|---|---|---|---|---|---|---|
| 0 | 💎 Champions | 6,200 (15.5%) | 5 days | 48/month | 8.2M VND | 42M | Loyalty rewards, early access |
| 1 | 🌟 Promising | 12,800 (32%) | 18 days | 22/month | 3.1M VND | 15M | Upsell, cross-sell campaigns |
| 2 | 😴 At Risk | 14,000 (35%) | 45 days | 8/month | 1.2M VND | 4M | Re-engagement, push notifications |
| 3 | 💤 Hibernating | 7,000 (17.5%) | 90 days | 2/month | 0.3M VND | 0.8M | Win-back email, discount |
Dashboard: Segment View
┌──────────────────────────────────────────────────────────────┐
│ PAGE 1: CUSTOMER SEGMENTS OVERVIEW │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ 💎 Champs │ │ 🌟 Promise│ │ 😴 At Risk│ │ 💤 Hibern │ │
│ │ 6,200 │ │ 12,800 │ │ 14,000 │ │ 7,000 │ │
│ │ LTV: 42M │ │ LTV: 15M │ │ LTV: 4M │ │ LTV: 0.8M│ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
│ │
│ ┌────────────────────────────────┐ ┌──────────────────┐ │
│ │ Scatter: Frequency vs │ │ Revenue Contrib │ │
│ │ Monetary (color = segment) │ │ by Segment (Bar) │ │
│ │ 💎💎 │ │ Champs: 52% │ │
│ │ 💎 🌟🌟 │ │ Promise: 28% │ │
│ │ 🌟🌟🌟 😴😴 │ │ At Risk: 15% │ │
│ │ 😴😴 💤💤💤 │ │ Hibern: 5% │ │
│ └────────────────────────────────┘ └──────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Segment Migration: How users move between segments │ │
│ │ (Sankey or alluvial diagram) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘Executive Summary Excerpt
📊 KEY FINDING:
15.5% users (Champions) contribute 52% revenue.
35% users (At Risk) have not transacted in 45+ days.
💡 RECOMMENDATION:
1. Champions: Loyalty program → retain 95%+ → protect 52% revenue
Cost: 200M VND/year. Expected: retain 5,900 Champions → save 248B revenue
2. At Risk: Re-engagement push campaign → convert 20% back to Promising
Cost: 50M VND. Expected: reactivate 2,800 users → +3.36B revenue
3. Promising → Champions: Cross-sell savings products → increase frequency
Cost: 100M VND. Expected: 15% conversion → 1,920 new Champions → +19.2B LTVWhat Made This Project Stand Out
mermaid
flowchart LR
A["❌ Typical Project<br/>Descriptive only<br/>Charts without insight<br/>No actionable recs"] --> B["✅ This Project<br/>Descriptive → Diagnostic → Prescriptive<br/>Each chart answers a question<br/>Each finding has cost/revenue impact"]| Element | Average Capstone | This Capstone |
|---|---|---|
| Questions | Vague: "analyze customers" | Specific: "How many segments? What LTV?" |
| Analysis | Descriptive only | RFM + Clustering + LTV |
| Dashboard | Static charts | Interactive: click segment → see details |
| Findings | "Segment A is bigger" | "Segment A = 52% revenue, losing 5%/quarter" |
| Recommendations | "Should do marketing" | "Loyalty program: 200M cost → 248B revenue protection" |
🔑 So sánh 3 Case Studies
| Tiêu chí | E-commerce (CS1) | Healthcare (CS2) | FinTech (CS3) |
|---|---|---|---|
| Dataset size | 25,000 rows | 15,000 rows | 40,000 rows |
| Key lesson | Before/After dashboard refactor | Sprint process discipline | Segmentation → Action |
| Analysis depth | Descriptive + Diagnostic | Cohort + Trend + Diagnostic | RFM + ML (K-Means) + LTV |
| Dashboard pages | 3 pages | 3 pages | 3 pages |
| Strongest element | Visual improvement | Sprint log & time management | Business impact quantification |
| Peer review impact | Redesigned entire layout | Fixed 3 formatting issues | Added revenue impact to recs |
💡 Lessons for Your Capstone
1. BEFORE/AFTER = powerful proof of learning
→ Show v1 dashboard AND v2 → demonstrates growth
2. SPRINT LOG = professional discipline
→ Track every hour → accountable → efficient
3. QUANTIFY EVERYTHING
→ Don't say "should improve retention"
→ Say "retain 2,800 users → +3.36B revenue, ROI = 67x"
4. PEER REVIEW = free consulting
→ 5 minutes feedback saves 2 hours rework
5. 3 PAGES IS ENOUGH
→ Overview + Deep Dive + Recommendations
→ Quality > Quantity, always