🧠 Case Study — ML for DA: Spotify, Shopee, VietCredit

Trong buổi học này, chúng ta đã nắm ML fundamentals — regression, classification, evaluation metrics, feature engineering. Bây giờ hãy xem 3 công ty áp dụng ML vào bài toán business thực tế như thế nào — từ Spotify personalize music cho 600M+ users, đến Shopee predict seller churn để giữ chân hệ sinh thái, và VietCredit dùng Logistic Regression + Decision Tree cho credit scoring.

Case Study 1: Spotify — Recommendation System cho 600M+ Users

Bối cảnh

Spotify (2025): 640 triệu monthly active users, 250 triệu premium subscribers, hoạt động tại 184 markets. Content library: 100+ triệu tracks và 6 triệu podcasts. Mỗi ngày, users tạo 1.4 tỷ playlists và stream 1.8 tỷ hours content.

Vấn đề kinh doanh: Với 100+ triệu tracks, user KHÔNG THỂ tự tìm nhạc hay. Nếu user không tìm được nhạc thích → giảm engagement → cancel premium → churn. Spotify cần biết: "Bài nào người này muốn nghe tiếp?" — cho mỗi user, mỗi thời điểm, mỗi context (gym, commute, study, sleep).

ML Team: 200+ ML engineers và data scientists. ML không chỉ là "feature" — ML LÀ product. Spotify CEO Daniel Ek: "Spotify is a machine learning company that happens to play music."

ML Architecture

Spotify sử dụng 3 lớp ML cho recommendation:

┌─────────────────────────────────────────────────────────────┐
│  LAYER 1: COLLABORATIVE FILTERING                           │
│  "Users giống bạn cũng nghe..."                             │
│  Input: Listening history + playlist co-occurrence           │
│  Model: Matrix Factorization (ALS), Neural Collaborative    │
│  Output: Similar users → similar taste → recommend tracks   │
├─────────────────────────────────────────────────────────────┤
│  LAYER 2: CONTENT-BASED FILTERING                            │
│  "Bài này GIỐNG bài bạn thích..."                           │
│  Input: Audio features (tempo, energy, valence, danceability)│
│  Model: CNN on spectrograms, NLP on lyrics                   │
│  Output: Track similarity score                              │
├─────────────────────────────────────────────────────────────┤
│  LAYER 3: CONTEXTUAL BANDIT                                  │
│  "Lúc này, ở đây, bài nào phù hợp?"                        │
│  Input: Time of day, device, activity, weather               │
│  Model: Multi-armed bandit, Reinforcement Learning           │
│  Output: Context-aware recommendation                        │
└─────────────────────────────────────────────────────────────┘

Discover Weekly — Flagship ML Product

Discover Weekly ra mắt 2015 — mỗi thứ 2, Spotify tạo playlist 30 bài hoàn toàn mới cho mỗi user. Không phải top charts, không phải hit mới — mà là bài user CHƯA nghe nhưng CÓ KHẢ NĂNG thích.

Cách hoạt động (simplified):

Step	Mô tả	Model
1. User Profile	Phân tích 6 tháng listening history: genres, artists, audio features	Feature Engineering
2. Find Taste Neighbors	Tìm 1,000 users có taste tương tự (cosine similarity trên user-track matrix)	Collaborative Filtering (ALS)
3. Candidate Generation	Lấy tracks mà taste neighbors nghe NHƯNG user chưa nghe → 3,000 candidates	Matrix Factorization
4. Ranking	Rank 3,000 candidates theo predicted preference score → top 30	Gradient Boosted Trees
5. Diversity Filter	Đảm bảo ≥ 5 genres, ≤ 3 tracks/artist, mix known + unknown artists	Rules + ML

Features dùng cho ranking model:

Feature Group	Ví dụ	Type
User features	Listening hours/week, top genres, skip rate, save rate	Behavioral
Track features	Tempo, energy, valence, danceability, speechiness, popularity	Audio (CNN)
Context features	Day of week, time of day, device type, country	Contextual
Interaction features	User × Genre affinity, Artist familiarity score	Collaborative

Evaluation Metrics

Spotify đo recommendation quality bằng nhiều metrics:

Metric	Định nghĩa	Target
Stream Rate	% tracks recommended mà user nghe > 30s	> 35%
Skip Rate	% tracks skipped trong 5s đầu	< 25%
Save Rate	% tracks user save vào library	> 8%
Discovery Rate	% tracks từ artists user CHƯA BAO GIỜ nghe	> 40%
Retention Impact	Discover Weekly users vs non-users, 30-day churn	-15% churn

Impact

Metric	Giá trị
Weekly users Discover Weekly	150 triệu users/tuần
Stream Rate	38% (vs 22% cho non-personalized playlists)
Discovery Rate	45% new artists discovered
Revenue attribution	~$1.5B/năm incremental streaming revenue
Churn reduction	Users dùng Discover Weekly churn ít hơn 15% so với non-users
Total tracks surfaced	10 tỷ+ tracks/tháng recommended thành công

Key Learnings cho DA

Collaborative Filtering = "wisdom of crowds" — không cần hiểu NỘI DUNG, chỉ cần patterns: "users giống nhau nghe gì"
Feature Engineering quan trọng hơn model — Spotify dành 60% effort cho feature engineering (audio CNN, NLP lyrics, behavioral features)
A/B test mọi thứ — mọi thay đổi algorithm đều qua A/B test trên millions of users
DA role: Spotify DAs không train models — nhưng define metrics, analyze experiments, segment users, communicate insights

Case Study 2: Shopee — Churn Prediction cho Seller Retention

Bối cảnh

Shopee (2025): Nền tảng e-commerce lớn nhất Đông Nam Á — 350 triệu monthly active users, 15 triệu active sellers, hoạt động tại 7 markets (VN, TH, ID, PH, MY, SG, BR). GMV (Gross Merchandise Value) 2024: $80 tỷ.

Vấn đề: Shopee là marketplace — revenue đến từ commission fees sellers trả khi bán hàng. Nếu seller rời platform → SKUs giảm → buyer experience xấu → buyer cũng churn → death spiral. Seller retention = survival metric.

Data: Shopee có data khổng lồ về seller behavior:

15 triệu sellers × 200+ features → 3 tỷ data points
Features: số đơn/tháng, response rate, rating, return rate, login frequency, promotion participation, competitor cross-listing...

ML Pipeline cho Seller Churn Prediction

mermaid

flowchart TD
    A["1️⃣ Data Collection<br/>200+ seller features<br/>từ 6 tháng history"] --> B["2️⃣ Feature Engineering<br/>Behavioral: login, response time<br/>Performance: orders, rating<br/>Financial: revenue, commission"]
    B --> C["3️⃣ Labeling<br/>Churn = no orders in 30 days<br/>After being active (≥5 orders/month)"]
    C --> D["4️⃣ Model Training<br/>XGBoost (primary)<br/>Logistic Regression (baseline)"]
    D --> E["5️⃣ Evaluation<br/>Recall ≥ 0.80 (bắt 80% churners)<br/>Precision ≥ 0.60"]
    E --> F["6️⃣ Score & Rank<br/>All sellers → churn probability<br/>Top 5,000 high-risk/market"]
    F --> G["7️⃣ Action<br/>Retention team: call, voucher<br/>Auto: push notification, fee waiver"]

Feature Engineering chi tiết

Feature Category	Features	Rationale
Engagement	Login frequency, days since last login, time on platform/session	Declining engagement = churn signal
Performance	Orders/month trend, GMV trend, conversion rate	Declining sales → frustration → churn
Quality	Rating (1-5), return rate, late shipment rate	Low quality → penalties → give up
Financial	Revenue/month, commission paid, profit margin estimate	Low profit → not worth it
Competition	Cross-listing on Lazada/Tiki (detected via SKU matching)	Seller hedging bets → partial churn risk
Support	Tickets opened, resolution time, penalty count	Nhiều penalty → demoralized
Trends	30-day vs 60-day vs 90-day rolling averages	DECLINE quan trọng hơn absolute value

Key insight: Shopee phát hiện trend features (thay đổi theo thời gian) predict churn tốt hơn snapshot features (giá trị tại 1 thời điểm):

Feature Type	Example	Churn Prediction Power
Snapshot	Orders this month = 50	AUC = 0.72
Trend	Orders decreased 30% vs last month	AUC = 0.81
Combined	Orders = 50 AND decreased 30%	AUC = 0.86

Model Results

Model	Accuracy	Precision	Recall	F1	AUC-ROC
Logistic Regression (baseline)	0.78	0.62	0.71	0.66	0.79
Decision Tree (max_depth=8)	0.80	0.65	0.74	0.69	0.82
Random Forest	0.84	0.71	0.78	0.74	0.87
XGBoost (production)	0.87	0.75	0.82	0.78	0.91

Top 10 Features (XGBoost importance):

Rank	Feature	Importance
1	orders_30d_vs_60d_change	0.15
2	days_since_last_login	0.12
3	gmv_30d_trend	0.11
4	cross_listed_competitor	0.09
5	avg_rating_30d	0.08
6	return_rate_30d	0.07
7	late_shipment_rate	0.06
8	support_tickets_30d	0.05
9	promotion_participation_rate	0.05
10	login_frequency_trend	0.04

Retention Actions

Dựa trên churn prediction score, Shopee áp dụng tiered retention strategy:

Risk Tier	Churn Probability	# Sellers/Market	Action	Cost
🔴 Critical	> 80%	~500	Account Manager gọi trực tiếp, offer fee waiver 30 ngày	50K VND/seller
🟡 High	60-80%	~2,000	Auto push notification + voucher 200K marketing credits	15K VND/seller
🟠 Medium	40-60%	~5,000	Email + in-app tips cải thiện performance	2K VND/seller
🟢 Low	< 40%	~50,000	No special action — monitor only	0

Impact (Vietnam market, 2024)

Metric	Before ML (2022)	After ML (2024)	Improvement
Monthly seller churn rate	4.2%	2.8%	-33%
Sellers retained/month	—	3,200 sellers	—
Average seller GMV	15M VND/month	18M VND/month	+20%
Estimated revenue saved	—	120 tỷ VND/năm	Commission từ retained sellers
Retention campaign ROI	—	8.5x	Chi phí campaign vs revenue saved

Key Learnings cho DA

Trend > Snapshot: Feature "orders giảm 30% vs tháng trước" predict churn tốt hơn "orders tháng này = 50"
Tiered actions: Không phải mọi at-risk seller đều cần gọi điện — phân tier để optimize ROI
DA role tại Shopee: DAs define churn definition, engineer features, analyze model output by segment, create dashboards tracking retention campaign effectiveness

Case Study 3: VietCredit — Credit Scoring bằng Logistic Regression + Decision Tree

Bối cảnh

VietCredit (tên đã thay đổi để bảo mật) — công ty tài chính tiêu dùng tại Việt Nam, cung cấp thẻ tín dụng và cho vay tiêu dùng không tài sản đảm bảo. 500,000+ khách hàng active, dư nợ 8,000 tỷ VND. Thị trường consumer finance Việt Nam tăng trưởng 25%/năm nhưng nợ xấu (NPL - Non-Performing Loan) là rủi ro lớn nhất — NPL rate trung bình ngành: 6-8%.

Vấn đề: VietCredit cần quyết định cho vay hay không cho 50,000+ hồ sơ/tháng. Manual review bởi credit officer: 15-20 phút/hồ sơ → bottleneck. Cần model tự động scoring → credit officer chỉ review borderline cases.

Đặc thù Việt Nam:

Nhiều khách hàng thin-file — ít lịch sử tín dụng (không có credit bureau data)
Income verification khó — nhiều người thu nhập không chính thức
Fraud risk cao — document forgery, identity theft
Regulatory: NHNN yêu cầu model phải interpretable (giải thích được tại sao reject)

Data & Features

Feature Category	Features	Ví dụ
Demographics	Age, gender, marital status, education, province	28 tuổi, nữ, đại học, TP.HCM
Employment	Job type, tenure, company size, industry	Nhân viên VP, 3 năm, 500+ NV, bán lẻ
Financial	Monthly income, existing debt, debt-to-income ratio	15M/tháng, nợ 50M, DTI = 28%
Credit History	CIC score, existing loans, payment history	CIC 650, 2 khoản vay, on-time 90%
Application	Loan amount, loan purpose, channel	30M, mua xe máy, online
Alternative Data	Phone type, telco prepaid/postpaid, social media	iPhone, postpaid, có Facebook verify

Model Architecture — Tại sao Logistic Regression + Decision Tree?

VietCredit chọn 2 models kết hợp thay vì 1 model phức tạp:

mermaid

flowchart TD
    A["📋 Hồ sơ vay mới<br/>50,000 hồ sơ/tháng"] --> B["🤖 Model 1: Logistic Regression<br/>Credit Score (0-1000)"]
    A --> C["🌳 Model 2: Decision Tree<br/>Risk Category + Reason Codes"]
    B --> D{"Score ≥ 650?"}
    D -->|"Yes"| E["✅ Auto-Approve<br/>(60% hồ sơ)"]
    D -->|"350-649"| F["⚠️ Manual Review<br/>Credit Officer xem<br/>(25% hồ sơ)"]
    D -->|"< 350"| G["❌ Auto-Reject<br/>(15% hồ sơ)"]
    C --> F
    F --> H["Officer dùng Decision Tree<br/>reasons để quyết định"]

Tại sao Logistic Regression cho scoring?

Output là probability (0-1) → dễ chuyển thành score (0-1000)
Interpretable: coefficient cho biết impact mỗi feature
Regulatory compliance: NHNN và Basel yêu cầu model giải thích được
Stable: ít biến động giữa các time periods

Tại sao Decision Tree cho reason codes?

Business cần nói TẠI SAO reject: "Thu nhập thấp so với khoản vay, Debt-to-income > 50%"
Decision Tree tạo IF-THEN rules dễ hiểu cho credit officers và khách hàng
Regulatory: luật yêu cầu adverse action notice — lý do từ chối rõ ràng

Implementation

python

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, classification_report

# ============================================
# MODEL 1: LOGISTIC REGRESSION - CREDIT SCORE
# ============================================
lr_model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',   # handle imbalanced (default 6% NPL)
    C=0.1,                     # regularization
    random_state=42
)
lr_model.fit(X_train, y_train)

# Probability → Score (0-1000)
prob_default = lr_model.predict_proba(X_test)[:, 1]
credit_score = ((1 - prob_default) * 1000).astype(int)

print("LOGISTIC REGRESSION - CREDIT SCORING")
print(f"AUC-ROC: {roc_auc_score(y_test, prob_default):.4f}")

# Coefficients → Feature Impact
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': lr_model.coef_[0],
    'Impact': ['Tăng risk' if c > 0 else 'Giảm risk'
               for c in lr_model.coef_[0]]
}).sort_values('Coefficient', ascending=False)
print("\nFeature Impact:")
print(coef_df.to_string(index=False))

python

# ============================================
# MODEL 2: DECISION TREE - REASON CODES
# ============================================
dt_model = DecisionTreeClassifier(
    max_depth=5,               # giới hạn depth cho interpretability
    min_samples_leaf=500,      # tránh overfitting
    class_weight='balanced',
    random_state=42
)
dt_model.fit(X_train, y_train)

# Reason codes từ Decision Tree path
def get_rejection_reasons(model, features, sample):
    """Extract reason codes from Decision Tree path."""
    node = 0
    reasons = []
    tree = model.tree_
    while tree.children_left[node] != -1:  # not leaf
        feature_idx = tree.feature[node]
        threshold = tree.threshold[node]
        feature_name = features[feature_idx]
        if sample[feature_idx] <= threshold:
            reasons.append(f"{feature_name} ≤ {threshold:.1f}")
            node = tree.children_left[node]
        else:
            reasons.append(f"{feature_name} > {threshold:.1f}")
            node = tree.children_right[node]
    return reasons

Model Results

Metric	Logistic Regression	Decision Tree	Ensemble (Combined)
AUC-ROC	0.82	0.78	0.84
Accuracy	0.79	0.76	0.81
Precision (default)	0.45	0.40	0.48
Recall (default)	0.72	0.68	0.75
F1 (default)	0.55	0.51	0.59

Confusion Matrix (Ensemble, production):

                    Predicted
                  Good (0)    Default (1)
Actual  Good (0)    8,520        980
        Default(1)    150        450

Total test: 10,100
- TP = 450: Correctly predicted default → PREVENT loss
- TN = 8,520: Correctly approved good loans → REVENUE
- FP = 980: Predicted default but actually good → LOST revenue (type I)
- FN = 150: Predicted good but actually default → BAD DEBT (type II)

Business trade-off:

FP (980): Reject khách tốt → mất revenue ~980 × 30M × 15% interest = 4.4 tỷ/năm opportunity cost
FN (150): Approve khách xấu → nợ xấu ~150 × 30M × 60% loss = 2.7 tỷ/năm actual loss
Net: FN (actual loss) đắt hơn per-case → optimize for Recall (catch more defaults)

Impact (2024)

Metric	Before ML (Manual)	After ML	Improvement
Processing time/hồ sơ	15-20 phút	2 phút (auto) + 5 phút (manual review)	-75%
Auto-decision rate	0%	75% (auto approve + auto reject)	Massive efficiency
NPL rate (30+ DPD)	7.2%	5.1%	-29%
Revenue from approved loans	450 tỷ/năm	580 tỷ/năm	+28.9% (more good loans approved)
Compliance audit	"Không giải thích được quyết định"	"Score + 3 reason codes mỗi hồ sơ"	100% compliant

Key Learnings cho DA

Interpretability = requirement, không phải nice-to-have: Trong finance, healthcare, legal → model PHẢI giải thích được. Logistic Regression + Decision Tree > Black-box XGBoost
2 models tốt hơn 1: LR cho scoring (quantitative), DT cho reason codes (qualitative) → complementary
FP vs FN tradeoff: Trong credit scoring, FN (approve bad loan → loss) đắt hơn FP (reject good applicant → opportunity cost). Optimize threshold theo business cost
DA role: VietCredit DAs analyze model performance by segment (age, region, income), monitor model drift monthly, create scorecards for business review

🔗 So sánh 3 Case Studies

Dimension	Spotify	Shopee	VietCredit
Bài toán	Recommendation	Churn Prediction	Credit Scoring
ML Type	Collaborative Filtering + Content-Based	Classification (XGBoost)	Classification (LR + DT)
Scale	640M users × 100M tracks	15M sellers × 200 features	50,000 applications/month
Key metric	Stream Rate, Discovery Rate	Recall ≥ 0.80	AUC-ROC, NPL Rate
Interpretability	Không cần (black-box OK)	Moderate (feature importance)	PHẢI interpretable (regulatory)
DA Role	Define metrics, A/B test analysis	Feature engineering, segment analysis	Monitor model, create scorecards
Revenue Impact	$1.5B/năm	120 tỷ VND/năm	Giảm NPL 29%, tăng revenue 28.9%
Key Learning	Feature engineering > algorithm	Trend features > snapshot	Interpretability = compliance

💡 Takeaway cho DA

Không cần biết train XGBoost hay neural network — nhưng PHẢI hiểu ML đang làm gì, metrics nghĩa gì, business impact bao nhiêu
DA = bridge giữa ML model và business decision — translate F1 score thành "tiết kiệm X tỷ/năm"
Feature engineering (domain knowledge → features) thường valuable hơn model tuning — và đây chính là thế mạnh của DA!

🧠 Case Study — ML for DA: Spotify, Shopee, VietCredit ​

Case Study 1: Spotify — Recommendation System cho 600M+ Users ​

Bối cảnh ​

ML Architecture ​

Discover Weekly — Flagship ML Product ​

Evaluation Metrics ​

Impact ​

Key Learnings cho DA ​

Case Study 2: Shopee — Churn Prediction cho Seller Retention ​

Bối cảnh ​

ML Pipeline cho Seller Churn Prediction ​

Feature Engineering chi tiết ​

Model Results ​

Retention Actions ​

Impact (Vietnam market, 2024) ​

Key Learnings cho DA ​

Case Study 3: VietCredit — Credit Scoring bằng Logistic Regression + Decision Tree ​

Bối cảnh ​

Data & Features ​

Model Architecture — Tại sao Logistic Regression + Decision Tree? ​

Implementation ​

Model Results ​

Impact (2024) ​

Key Learnings cho DA ​

🔗 So sánh 3 Case Studies ​

🧠 Case Study — ML for DA: Spotify, Shopee, VietCredit

Case Study 1: Spotify — Recommendation System cho 600M+ Users

Bối cảnh

ML Architecture

Discover Weekly — Flagship ML Product

Evaluation Metrics

Impact

Key Learnings cho DA

Case Study 2: Shopee — Churn Prediction cho Seller Retention

Bối cảnh

ML Pipeline cho Seller Churn Prediction

Feature Engineering chi tiết

Model Results

Retention Actions

Impact (Vietnam market, 2024)

Key Learnings cho DA

Case Study 3: VietCredit — Credit Scoring bằng Logistic Regression + Decision Tree

Bối cảnh

Data & Features

Model Architecture — Tại sao Logistic Regression + Decision Tree?

Implementation

Model Results

Impact (2024)

Key Learnings cho DA

🔗 So sánh 3 Case Studies