🧠 Case Study — Capstone Analysis Sprint: 3 Projects thực tế

Trong buổi học này, chúng ta đã nắm sprint methodology, dashboard development, và executive summary writing. Bây giờ hãy xem 3 capstone projects ở các mức độ khác nhau — từ một project "trước và sau refactor", đến sprint process hoàn chỉnh, và customer segmentation end-to-end. Mỗi case study cho thấy process quan trọng hơn output — cách bạn đi từ raw data đến insight quyết định chất lượng cuối cùng.

Case Study 1: E-commerce Revenue Analysis — Before & After Dashboard

Bối cảnh

Dataset: Online retail — 25,000 transactions, 18 columns, 12 tháng (2025). Nguồn: Kaggle dataset, cleaned & augmented.

Business Context: ShopLocal — sàn e-commerce Việt Nam, GMV 50 tỷ VND/năm, 8,000 active sellers, 120,000 monthly buyers. CEO muốn biết: "Revenue đang đi đâu? Nên focus vào segment nào?"

5 Business Questions:

#	Question	Type
Q1	Revenue trend 12 tháng — tháng nào peak, tháng nào dip?	Descriptive
Q2	Top 10 product categories contribute bao nhiêu % revenue?	Diagnostic
Q3	Khách hàng mới vs returning khác nhau thế nào về AOV?	Comparative
Q4	Nên focus region nào để tăng revenue?	Prescriptive
Q5	Revenue Q1 2026 forecast?	Predictive

❌ Version 1: Dashboard "trước khi peer review"

Lỗi phổ biến mà capstone projects thường mắc:

┌──────────────────────────────────────────────────────────────┐
│  ❌ DASHBOARD v1 — "Chart Dump"                              │
│                                                               │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐     │
│  │Pie   │ │Bar   │ │Line  │ │Scatter│ │Table │ │Area  │     │
│  │Chart │ │Chart │ │Chart │ │Plot  │ │     │ │Chart │     │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘     │
│                                                               │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                        │
│  │Heatmap│ │Donut │ │Treemap│ │Gauge │                       │
│  │      │ │      │ │       │ │      │                        │
│  └──────┘ └──────┘ └──────┘ └──────┘                        │
│                                                               │
│  Problems:                                                    │
│  - 10 charts, 1 page — overwhelm                             │
│  - No visual hierarchy — equal size                           │
│  - No titling — "Chart 1", "Chart 2"                          │
│  - Colors: rainbow — mỗi chart 1 palette                     │
│  - No KPI cards — no context                                  │
│  - No filters — static view only                              │
└──────────────────────────────────────────────────────────────┘

Peer Review Feedback cho v1:

Reviewer	Feedback	Priority
Reviewer A	"Nhìn vào đâu trước? Không có hierarchy"	🔴 High
Reviewer B	"10 charts nhưng không chart nào answer question"	🔴 High
Reviewer C	"Pie chart 12 slices — unreadable. Dùng bar chart"	🟡 Medium
Reviewer D	"Color inconsistent — 'Electronics' xanh ở chart 1, đỏ ở chart 3"	🟡 Medium
Reviewer E	"Không có title, không biết dashboard này cho ai"	🟡 Medium

✅ Version 2: Dashboard "sau khi refactor"

┌──────────────────────────────────────────────────────────────┐
│  ✅ DASHBOARD v2 — PAGE 1: EXECUTIVE OVERVIEW                │
│  "ShopLocal Revenue Dashboard — FY2025"                      │
│                                                               │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        │
│  │ Revenue  │ │ Orders   │ │ AOV      │ │ Active   │        │
│  │ 52.3B    │ │ 25,041   │ │ 2.09M   │ │ Buyers   │        │
│  │ ▲ +12%   │ │ ▲ +8%    │ │ ▲ +3.5%  │ │ 118,420  │        │
│  │ vs LY    │ │ vs LY    │ │ vs LY    │ │ ▲ +15%   │        │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘        │
│                                                               │
│  ┌────────────────────────────┐ ┌────────────────┐           │
│  │  Revenue Trend (Line)     │ │ Top Categories │           │
│  │  ──────────────           │ │ (Horizontal Bar)│           │
│  │     /\    /\   📌+22%    │ │ Electronics 32%│           │
│  │    /  \  /  \  Campaign X│ │ Fashion    24%│           │
│  │   /    \/    \           │ │ Home       18%│           │
│  │  ────────────────────    │ │ Beauty     12%│           │
│  │  Annotated: Q3 dip,     │ │ Others     14%│           │
│  │  campaign impact         │ │               │           │
│  └────────────────────────────┘ └────────────────┘           │
│                                                               │
│  Filter: [Date Range ▼] [Region ▼] [Category ▼]             │
└──────────────────────────────────────────────────────────────┘

Những gì thay đổi:

Aspect	Before (v1)	After (v2)
Layout	10 charts, 1 page	3 pages, 3-4 charts each
Hierarchy	All same size	KPI cards top → trend middle → details bottom
Titles	"Chart 1", "Chart 2"	Descriptive: "Revenue Trend by Month (FY2025)"
Colors	Rainbow	3-color palette: blue (primary), gray (secondary), red (alert)
KPI Cards	None	4 cards with YoY comparison
Annotations	None	Campaign impact, Q3 dip explanation
Filters	None	Date, Region, Category slicer
Questions	Answered 0	Each page answers 1-2 questions

Key Finding

"Pie chart 12 slices → bar chart horizontal đã tăng readability 300%. Peer review là bước tốn ít thời gian nhất nhưng tạo nhiều giá trị nhất."

Case Study 2: Healthcare Churn — Sprint Process hoàn chỉnh

Bối cảnh

Dataset: Subscription data từ một health-tech app — 15,000 users, 22 columns, 24 tháng. Bao gồm: user demographics, subscription plan, app usage, health metrics tracking, support tickets.

Sprint Timeline:

mermaid

gantt
    title Capstone Sprint — Healthcare Churn
    dateFormat HH:mm
    axisFormat %H:%M
    section Data Prep
        Data Audit Review      :a1, 08:00, 30m
        Data Cleaning          :a2, after a1, 90m
    section Analysis
        EDA                    :b1, after a2, 60m
        Q1-Q3 Analysis         :b2, after b1, 90m
        Q4-Q5 Analysis         :b3, after b2, 60m
    section Dashboard
        Mockup Sketch          :c1, after b3, 20m
        Dashboard Build        :c2, after c1, 120m
        Peer Review + Fix      :c3, after c2, 40m
    section Summary
        Executive Summary      :d1, after c3, 40m

Sprint Log

Time	Task	Output	Issue / Note
08:00	Review data audit	Confirmed: 4.2% missing in `age`, 8% in `income`	OK — impute median
08:30	Data cleaning	Cleaned dataset 14,370 rows (dropped 630 invalid)	`signup_date` format inconsistent → fixed 3 formats
10:00	EDA	8 key distributions, correlation matrix	`login_frequency` strongly correlated with churn (r = -0.62)
11:00	Q1: Churn trend	Monthly churn trending up: 3.1% → 5.4% over 12 months	Spike in month 8 — coincides with pricing change
11:30	Q2: Segment analysis	Free-to-Premium < 3 months: churn 12.3%. Annual: 1.8%	Monthly plan is the problem — not the product
12:00	Lunch break	—	—
13:00	Q3: Churn predictors	Top 3: login_frequency, feature_adoption, tenure	Users who log in < 2x/week → 5x churn risk
14:00	Q4: Cohort analysis	Month-6 cohort: 40% retention. Month-12: 28%	Steep drop at month 3 — onboarding issue
14:30	Q5: Recommendations	3 actionable recommendations with impact estimates	—
15:00	Dashboard mockup	Sketched 3 pages on paper	Showed to partner for quick feedback
15:20	Dashboard build	Page 1: Overview (KPIs + trend)	Used template layout from Buổi 11
16:00	Dashboard build	Page 2: Segment deep dive	Added cross-filters between plan type and churn
16:40	Dashboard build	Page 3: Recommendations + annotations	Annotated the pricing change impact on trend chart
17:00	Peer review	3 feedback items: axis labels, tooltip detail, color	Fixed all 3 in 15 minutes
17:20	Executive summary	1-page draft with 5 findings	Followed What → So What → Now What template
18:00	Done	Notebook (38 cells) + Dashboard (3 pages) + Summary	Sprint complete ✅

Analysis Deep Dive: Cohort Retention

Một trong những analyses mạnh nhất trong project này là cohort retention analysis:

python

# ============================================
# COHORT RETENTION ANALYSIS
# ============================================

# Create cohort groups by signup month
df['signup_month'] = df['signup_date'].dt.to_period('M')
df['activity_month'] = df['last_active_date'].dt.to_period('M')
df['cohort_age'] = (df['activity_month'] - df['signup_month']).apply(lambda x: x.n)

# Build cohort table
cohort = df.groupby(['signup_month', 'cohort_age'])['user_id'].nunique().reset_index()
cohort_pivot = cohort.pivot(index='signup_month', columns='cohort_age', values='user_id')

# Calculate retention rates
cohort_size = cohort_pivot[0]
retention = cohort_pivot.divide(cohort_size, axis=0) * 100

print("📊 Cohort Retention Table (%):")
print(retention.round(1))

Output:

Signup Month	M0	M1	M2	M3	M6	M12
2024-01	100%	72%	58%	44%	31%	22%
2024-04	100%	68%	51%	38%	25%	—
2024-07	100%	65%	48%	35%	—	—
2024-10	100%	61%	42%	—	—	—

Finding: Retention đang xấu đi theo thời gian — cohort mới có M1 retention thấp hơn cohort cũ (61% vs 72%). Root cause: pricing change ở tháng 7 làm tăng expectation nhưng không cải thiện onboarding. Users trả nhiều hơn nhưng vẫn không biết dùng product.

Dashboard Highlight: Annotated Trend

Churn Rate Trend — 24 Months
6% ─                                          ╱── 5.4%
    │                              📌 Price   ╱
5% ─                              increase  ╱
    │                                ↓     ╱
4% ─           ─────────── ────────╱─────╱
    │          ╱                  ╱
3% ─ ────────╱                  ╱
    │  3.1%       Stable         ╱
2% ─              period        Post-price-change
    │                            acceleration
    └────┬────┬────┬────┬────┬────┬────┬────→
        M1   M4   M7   M10  M13  M16  M19  M22

Lesson Learned

What worked	What didn't	Next time
Sprint log — track every hour	Spent too long on formatting EDA charts	Time-box chart formatting: 5 min max per chart
Paper mockup before dashboard	First attempt: too many filters	Start with 2 filters, add more if needed
Peer review before summary	Wrote summary before final dashboard → had to rewrite	Dashboard first, summary last

Case Study 3: FinTech Customer Segmentation — Data to Action

Bối cảnh

Dataset: Transaction data từ fintech app (MoMo-like) — 40,000 users, 28 columns, data 2024-2025. Features: transaction frequency, average amount, categories (F&B, bills, transfer, savings), demographics.

Business Context: FinPay — ví điện tử 2 triệu users. Marketing team cần biết: "Có mấy nhóm users? Mỗi nhóm cần chiến lược gì?"

5 Business Questions:

#	Question	Method
Q1	Users phân bổ thế nào theo behavior?	RFM Analysis
Q2	Có bao nhiêu segments tự nhiên?	K-Means Clustering
Q3	Mỗi segment có đặc điểm gì?	Segment profiling
Q4	Segment nào có highest LTV?	LTV calculation per segment
Q5	Marketing strategy cho mỗi segment?	Prescriptive recommendations

Analysis Process

Step 1: RFM Analysis

python

# ============================================
# RFM ANALYSIS
# ============================================
import pandas as pd
import numpy as np
from datetime import datetime

# Calculate RFM
snapshot_date = datetime(2025, 12, 31)
rfm = df.groupby('user_id').agg({
    'transaction_date': lambda x: (snapshot_date - x.max()).days,  # Recency
    'transaction_id': 'count',                                     # Frequency
    'amount': 'sum'                                                # Monetary
}).rename(columns={
    'transaction_date': 'Recency',
    'transaction_id': 'Frequency',
    'amount': 'Monetary'
})

# RFM Scoring (quintiles)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

print(f"RFM Table:\n{rfm.head(10)}")

Step 2: K-Means Clustering

python

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Standardize
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])

# Elbow method
inertias = []
for k in range(2, 10):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(rfm_scaled)
    inertias.append(km.inertia_)

# Optimal: k=4 (elbow at 4)
km_final = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm['Segment'] = km_final.fit_predict(rfm_scaled)

Step 3: Segment Profiling

Segment	Name	Users	Avg Recency	Avg Frequency	Avg Monetary	LTV	Strategy
0	💎 Champions	6,200 (15.5%)	5 days	48/month	8.2M VND	42M	Loyalty rewards, early access
1	🌟 Promising	12,800 (32%)	18 days	22/month	3.1M VND	15M	Upsell, cross-sell campaigns
2	😴 At Risk	14,000 (35%)	45 days	8/month	1.2M VND	4M	Re-engagement, push notifications
3	💤 Hibernating	7,000 (17.5%)	90 days	2/month	0.3M VND	0.8M	Win-back email, discount

Dashboard: Segment View

┌──────────────────────────────────────────────────────────────┐
│  PAGE 1: CUSTOMER SEGMENTS OVERVIEW                          │
│                                                               │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐    │
│  │ 💎 Champs │ │ 🌟 Promise│ │ 😴 At Risk│ │ 💤 Hibern │    │
│  │ 6,200     │ │ 12,800    │ │ 14,000    │ │ 7,000     │    │
│  │ LTV: 42M  │ │ LTV: 15M  │ │ LTV: 4M   │ │ LTV: 0.8M│    │
│  └───────────┘ └───────────┘ └───────────┘ └───────────┘    │
│                                                               │
│  ┌────────────────────────────────┐ ┌──────────────────┐     │
│  │  Scatter: Frequency vs         │ │ Revenue Contrib  │     │
│  │  Monetary (color = segment)    │ │ by Segment (Bar) │     │
│  │     💎💎                       │ │ Champs: 52%      │     │
│  │    💎   🌟🌟                   │ │ Promise: 28%     │     │
│  │  🌟🌟🌟 😴😴                  │ │ At Risk: 15%     │     │
│  │  😴😴  💤💤💤                  │ │ Hibern: 5%       │     │
│  └────────────────────────────────┘ └──────────────────┘     │
│                                                               │
│  ┌────────────────────────────────────────────────────────┐   │
│  │  Segment Migration: How users move between segments    │   │
│  │  (Sankey or alluvial diagram)                          │   │
│  └────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘

Executive Summary Excerpt

📊 KEY FINDING:
15.5% users (Champions) contribute 52% revenue.
35% users (At Risk) have not transacted in 45+ days.

💡 RECOMMENDATION:
1. Champions: Loyalty program → retain 95%+ → protect 52% revenue
   Cost: 200M VND/year. Expected: retain 5,900 Champions → save 248B revenue

2. At Risk: Re-engagement push campaign → convert 20% back to Promising
   Cost: 50M VND. Expected: reactivate 2,800 users → +3.36B revenue

3. Promising → Champions: Cross-sell savings products → increase frequency
   Cost: 100M VND. Expected: 15% conversion → 1,920 new Champions → +19.2B LTV

What Made This Project Stand Out

mermaid

flowchart LR
    A["❌ Typical Project<br/>Descriptive only<br/>Charts without insight<br/>No actionable recs"] --> B["✅ This Project<br/>Descriptive → Diagnostic → Prescriptive<br/>Each chart answers a question<br/>Each finding has cost/revenue impact"]

Element	Average Capstone	This Capstone
Questions	Vague: "analyze customers"	Specific: "How many segments? What LTV?"
Analysis	Descriptive only	RFM + Clustering + LTV
Dashboard	Static charts	Interactive: click segment → see details
Findings	"Segment A is bigger"	"Segment A = 52% revenue, losing 5%/quarter"
Recommendations	"Should do marketing"	"Loyalty program: 200M cost → 248B revenue protection"

🔑 So sánh 3 Case Studies

Tiêu chí	E-commerce (CS1)	Healthcare (CS2)	FinTech (CS3)
Dataset size	25,000 rows	15,000 rows	40,000 rows
Key lesson	Before/After dashboard refactor	Sprint process discipline	Segmentation → Action
Analysis depth	Descriptive + Diagnostic	Cohort + Trend + Diagnostic	RFM + ML (K-Means) + LTV
Dashboard pages	3 pages	3 pages	3 pages
Strongest element	Visual improvement	Sprint log & time management	Business impact quantification
Peer review impact	Redesigned entire layout	Fixed 3 formatting issues	Added revenue impact to recs

💡 Lessons for Your Capstone

1. BEFORE/AFTER = powerful proof of learning
   → Show v1 dashboard AND v2 → demonstrates growth

2. SPRINT LOG = professional discipline
   → Track every hour → accountable → efficient

3. QUANTIFY EVERYTHING
   → Don't say "should improve retention"
   → Say "retain 2,800 users → +3.36B revenue, ROI = 67x"

4. PEER REVIEW = free consulting
   → 5 minutes feedback saves 2 hours rework

5. 3 PAGES IS ENOUGH
   → Overview + Deep Dive + Recommendations
   → Quality > Quantity, always

🧠 Case Study — Capstone Analysis Sprint: 3 Projects thực tế ​

Case Study 1: E-commerce Revenue Analysis — Before & After Dashboard ​

Bối cảnh ​

❌ Version 1: Dashboard "trước khi peer review" ​

✅ Version 2: Dashboard "sau khi refactor" ​

Key Finding ​

Case Study 2: Healthcare Churn — Sprint Process hoàn chỉnh ​

Bối cảnh ​

Sprint Log ​

Analysis Deep Dive: Cohort Retention ​

Dashboard Highlight: Annotated Trend ​

Lesson Learned ​

Case Study 3: FinTech Customer Segmentation — Data to Action ​

Bối cảnh ​

Analysis Process ​

Dashboard: Segment View ​

Executive Summary Excerpt ​

What Made This Project Stand Out ​

🔑 So sánh 3 Case Studies ​

💡 Lessons for Your Capstone ​

🔗 Liên kết ​

🧠 Case Study — Capstone Analysis Sprint: 3 Projects thực tế

Case Study 1: E-commerce Revenue Analysis — Before & After Dashboard

Bối cảnh

❌ Version 1: Dashboard "trước khi peer review"

✅ Version 2: Dashboard "sau khi refactor"

Key Finding

Case Study 2: Healthcare Churn — Sprint Process hoàn chỉnh

Bối cảnh

Sprint Log

Analysis Deep Dive: Cohort Retention

Dashboard Highlight: Annotated Trend

Lesson Learned

Case Study 3: FinTech Customer Segmentation — Data to Action

Bối cảnh

Analysis Process

Dashboard: Segment View

Executive Summary Excerpt

What Made This Project Stand Out

🔑 So sánh 3 Case Studies

💡 Lessons for Your Capstone

🔗 Liên kết