🏆 Tiêu chuẩn — Machine Learning for DA

Các tiêu chuẩn giúp bạn thực hiện ML project đúng quy trình, đúng API, và document rõ ràng — CRISP-DM cho lifecycle, scikit-learn cho implementation, Model Documentation cho transparency.

Tổng quan tiêu chuẩn buổi 17

Buổi 17 chuyển từ A/B Testing (Buổi 15) sang Machine Learning — xây model dự đoán churn, revenue, credit scoring. ML không chỉ cần biết model.fit() — cần quy trình chuẩn, API conventions nhất quán, và documentation đầy đủ:

CRISP-DM — Cross Industry Standard Process for Data Mining → quy trình 6 phases cho ML project
scikit-learn API Conventions — Chuẩn API thiết kế cho ML library phổ biến nhất → consistency, reproducibility
Model Documentation — Tiêu chuẩn document model cho transparency, audit, và maintenance

📋 Danh sách tiêu chuẩn liên quan

#	Tiêu chuẩn	Tổ chức / Tác giả	Áp dụng cho Buổi 17
1	CRISP-DM	SPSS / NCR / DaimlerChrysler / OHRA (1996-2000)	Quy trình end-to-end cho ML/data mining projects
2	scikit-learn API Conventions	scikit-learn core developers	Chuẩn API design: Estimator, fit/predict/transform
3	Model Documentation	Google Model Cards / IBM FactSheets	Document model purpose, performance, limitations

1️⃣ CRISP-DM — Cross Industry Standard Process for Data Mining

Giới thiệu

CRISP-DM (Cross Industry Standard Process for Data Mining) — framework quy trình ML/data mining phổ biến nhất thế giới, phát triển từ năm 1996 bởi consortium: SPSS, NCR, DaimlerChrysler, OHRA. Theo khảo sát KDnuggets (2014) và Data Science Central (2024), CRISP-DM vẫn là methodology được sử dụng nhiều nhất — ≈ 43% data science projects trên toàn cầu.

CRISP-DM gồm 6 phases — không linear mà iterative (vòng lặp liên tục).

mermaid

flowchart TD
    A["1️⃣ Business<br/>Understanding"] --> B["2️⃣ Data<br/>Understanding"]
    B --> C["3️⃣ Data<br/>Preparation"]
    C --> D["4️⃣ Modeling"]
    D --> E["5️⃣ Evaluation"]
    E --> F["6️⃣ Deployment"]
    F -->|"Monitor & Iterate"| A
    B -->|"Feedback"| A
    E -->|"Need more data"| C
    D -->|"Need better features"| C

6 Phases chi tiết

Phase	Mục đích	Output	DA Responsibility
1. Business Understanding	Hiểu bài toán, KPI, success criteria	Problem statement, objectives, success metrics	✅ Primary: define problem, align stakeholders
2. Data Understanding	Khám phá data: sources, quality, distributions	Data exploration report, initial insights	✅ Primary: EDA, data quality assessment
3. Data Preparation	Clean, transform, feature engineering	Clean dataset, feature matrix	✅ Primary: data wrangling, feature creation
4. Modeling	Chọn algorithm, train, tune hyperparameters	Trained model(s), parameter settings	⚠️ Shared: DA chọn baseline models, ML Engineer optimize
5. Evaluation	Đánh giá model vs business objectives	Model performance report, business impact estimate	✅ Primary: evaluate metrics, translate for business
6. Deployment	Integrate model vào business process	Production model, monitoring dashboard	⚠️ Shared: DA monitor, ML Engineer deploy

CRISP-DM Checklist cho DA

PHASE 1 — BUSINESS UNDERSTANDING:
✅ Bài toán gì? (predict churn, forecast revenue, classify fraud)
✅ KPI/metric nào đo success? (Recall > 0.75, MAE < 5%)
✅ Stakeholder expectations aligned?
✅ Business constraint? (model phải interpretable)
✅ Baseline performance? (current process accuracy)

PHASE 2 — DATA UNDERSTANDING:
✅ Data ở đâu? (DB, API, CSV, data lake)
✅ Bao nhiêu rows × columns? Đủ cho ML?
✅ Target variable distribution? (imbalanced?)
✅ Missing values? Outliers? Data quality?
✅ Feature correlation? Multicollinearity?

PHASE 3 — DATA PREPARATION:
✅ Handle missing values (drop / impute / flag)
✅ Encode categorical variables (one-hot / label)
✅ Scale numerical features (standard / min-max)
✅ Feature engineering (domain knowledge → new features)
✅ Train/test split (80/20 hoặc 70/30, stratified)

PHASE 4 — MODELING:
✅ Baseline model (Logistic Regression / Linear Regression)
✅ Advanced model (Decision Tree, Random Forest)
✅ Cross-validation (5-fold hoặc 10-fold)
✅ Hyperparameter tuning (grid search / random search)

PHASE 5 — EVALUATION:
✅ Regression: R², MAE, RMSE
✅ Classification: Accuracy, Precision, Recall, F1, AUC-ROC
✅ Confusion matrix analysis
✅ Business impact estimation ($ saved, $ earned)
✅ Model vs baseline comparison

PHASE 6 — DEPLOYMENT:
✅ Model packaging (pickle / joblib)
✅ Integration plan (API / batch scoring)
✅ Monitoring: performance drift, data drift
✅ Re-training schedule (monthly / quarterly)
✅ Documentation complete

Ưu & nhược điểm CRISP-DM

Ưu điểm	Nhược điểm
✅ Industry-standard — được hiểu và chấp nhận rộng rãi	❌ Không cover MLOps modernly (CI/CD, model versioning)
✅ Iterative — cho phép quay lại phases trước	❌ Vague ở Phase 6 (Deployment) — thiếu chi tiết
✅ Business-first — bắt đầu từ business understanding	❌ Không mention ethical AI, fairness, bias
✅ Technology-agnostic — dùng với bất kỳ tool nào	❌ Từ 2000 — chưa update cho modern ML practices

💡 Khi nào dùng CRISP-DM?

Mọi ML project — dù nhỏ hay lớn. Framework đủ flexible để scale.
Communicate với non-technical stakeholders — 6 phases dễ hiểu.
DA starting ML project — CRISP-DM prevent bạn skip important steps (đặc biệt Phase 1 và Phase 5).

2️⃣ scikit-learn API Conventions — Consistency trong ML Code

Giới thiệu

scikit-learn — ML library phổ biến nhất cho Python, phát triển từ 2007 (David Cournapou, Inria, Pháp). Được dùng tại Google, Spotify, Booking.com, JPMorgan. PyPI downloads: 50+ triệu/tháng (2025).

scikit-learn nổi tiếng vì API design cực kỳ consistent — học 1 model, biết dùng tất cả. Đây là API conventions chuẩn mà mọi DA/DS nên tuân thủ.

3 Object Types

Object	Role	Methods	Ví dụ
Estimator	Bất kỳ object nào learn từ data	`fit(X, y)`	`LinearRegression()`, `DecisionTreeClassifier()`
Predictor	Estimator có khả năng predict	`predict(X)`, `predict_proba(X)`	`model.predict(X_test)`
Transformer	Estimator biến đổi data	`transform(X)`, `fit_transform(X)`	`StandardScaler()`, `OneHotEncoder()`

API Pattern — fit / predict / transform

python

# ============================================
# scikit-learn API PATTERN - Consistent cho MỌI model
# ============================================

# 1. INITIALIZE — set hyperparameters
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000, C=0.1, random_state=42)

# 2. FIT — learn from training data
model.fit(X_train, y_train)    # Returns self → cho phép chaining

# 3. PREDICT — predict on new data
y_pred = model.predict(X_test)           # Class labels: [0, 1, 0, 1, ...]
y_prob = model.predict_proba(X_test)     # Probabilities: [[0.7, 0.3], ...]

# 4. SCORE — evaluate on test data
accuracy = model.score(X_test, y_test)   # Default metric (accuracy for classifiers)

# ============================================
# TRANSFORMER PATTERN - Preprocessing
# ============================================
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform cùng lúc
X_test_scaled = scaler.transform(X_test)          # CHỈ transform, KHÔNG fit!

Conventions quan trọng

Convention	Rule	Lý do
fit on train ONLY	`scaler.fit_transform(X_train)` rồi `scaler.transform(X_test)` — KHÔNG `fit_transform(X_test)`	Tránh data leakage — test data không được influence preprocessing
Consistent interface	Mọi model đều có `fit()`, `predict()`, `score()`	Swap model dễ dàng: chỉ đổi 1 dòng khởi tạo
Hyperparameters at init	`LogisticRegression(C=0.1)` — set lúc khởi tạo, KHÔNG lúc fit	Reproducibility — cùng params → cùng kết quả
random_state	Luôn set `random_state=42` (hoặc số bất kỳ)	Reproducibility — kết quả giống nhau mỗi lần chạy
Return self in fit	`model.fit(X, y)` returns `self`	Cho phép chaining: `model.fit(X, y).predict(X)`
NumPy/Pandas input	Chấp nhận cả NumPy arrays và Pandas DataFrames	Flexibility — dùng với bất kỳ data structure

Pipeline — Best Practice

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Pipeline = chain preprocessing + model
pipeline = Pipeline([
    ('scaler', StandardScaler()),           # Step 1: scale features
    ('classifier', LogisticRegression())     # Step 2: classify
])

# Fit entire pipeline
pipeline.fit(X_train, y_train)

# Predict — automatically scales THEN predicts
y_pred = pipeline.predict(X_test)

# Cross-validate entire pipeline (no data leakage!)
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

⚠️ Data Leakage phổ biến nhất

python

# ❌ SAI — fit scaler trên TOÀN BỘ data trước khi split
scaler.fit(X)  # <-- LEAK! Test data info vào train process
X_scaled = scaler.transform(X)
X_train, X_test = train_test_split(X_scaled)

# ✅ ĐÚNG — fit scaler CHỈ trên train data
X_train, X_test = train_test_split(X)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ✅ TỐT NHẤT — dùng Pipeline (auto prevent leakage)
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
cross_val_score(pipeline, X, y, cv=5)  # Scaler refit mỗi fold!

3️⃣ Model Documentation — Transparency & Accountability

Giới thiệu

Model Documentation — thực hành ghi lại mục đích, data, performance, limitations, và ethical considerations của ML model. Google đề xuất Model Cards (2019, Mitchell et al.), IBM đề xuất FactSheets (2019). Cả hai nhằm tăng transparency và accountability — đặc biệt quan trọng trong regulated industries (finance, healthcare).

Tại sao DA cần document models?

Không document	Có document
"Model này ai build? Khi nào?" — không ai biết	Model Card: author, date, version, purpose
"Accuracy bao nhiêu? Trên data nào?" — phải re-run	Performance metrics: accuracy, precision, recall by segment
"Model công bằng không? Bias?" — không kiểm tra	Fairness evaluation: performance by gender, age, region
"Khi nào cần retrain?" — quên	Monitoring plan: retrain trigger, drift threshold
Audit fail → penalty	Audit pass → compliance ✅

Model Card Template cho DA

markdown

# 📋 MODEL CARD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 1. Model Overview
- **Name:** Customer Churn Prediction v2.1
- **Author:** Vy Nguyen (DA Team) + ML Engineering
- **Date:** 2025-12-15
- **Version:** 2.1 (retrained monthly)
- **Type:** Binary Classification
- **Framework:** scikit-learn 1.4.0

## 2. Purpose & Scope
- **Business purpose:** Predict which customers will churn in next 30 days
- **Users:** Retention team (action on predictions), Product team (insights)
- **NOT intended for:** Individual customer communication ("model says you'll leave")
- **Decision:** Top 500 high-risk → retention campaign

## 3. Data
- **Training data:** 18 months history (Jan 2024 - Jun 2025)
- **Size:** 180,000 customers × 25 features
- **Target:** churn_30d (binary: 0=stay, 1=churn)
- **Class balance:** 94% stay, 6% churn → class_weight='balanced'
- **Train/test split:** 80/20, stratified

## 4. Features
| Feature | Type | Description |
|---------|------|-------------|
| tenure_months | Numeric | Số tháng làm customer |
| monthly_charges | Numeric | Chi tiêu hàng tháng |
| sessions_per_week | Numeric | Số sessions/tuần |
| contract_type | Categorical | Monthly / Annual |
| ... | ... | (full list in appendix) |

## 5. Performance
| Metric | Train | Test | CV (5-fold) |
|--------|-------|------|-------------|
| Accuracy | 0.83 | 0.81 | 0.80 ± 0.02 |
| Precision | 0.76 | 0.74 | 0.72 ± 0.03 |
| Recall | 0.81 | 0.79 | 0.77 ± 0.04 |
| F1 | 0.78 | 0.76 | 0.74 ± 0.03 |
| AUC-ROC | 0.89 | 0.87 | 0.86 ± 0.02 |

## 6. Limitations
- Performance lower on customers with tenure < 3 months (thin data)
- Not validated for B2B customers (only B2C training data)
- Seasonal effects not fully captured (need > 24 months data)

## 7. Fairness & Bias
- Tested across age groups: no significant disparity
- Tested across regions: performance consistent
- Gender not used as feature (ethical choice)

## 8. Monitoring & Maintenance
- **Retrain:** Monthly (1st week)
- **Drift alert:** If AUC-ROC drops below 0.80 → investigate
- **Owner:** DA Team
- **Escalation:** ML Engineering for model architecture changes

Model Documentation Checklist

MODEL DOCUMENTATION — DA CHECKLIST:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

BEFORE DEPLOYMENT:
✅ Model Card completed (purpose, data, performance, limitations)
✅ Performance metrics documented (train, test, CV)
✅ Segment analysis (model works equally across segments?)
✅ Fairness evaluation (no discriminatory bias?)
✅ Stakeholder review (business approved metrics and limitations?)

AFTER DEPLOYMENT:
✅ Monitoring dashboard (accuracy over time, drift detection)
✅ Retrain schedule documented
✅ Incident response plan (model performance drops → what to do?)
✅ Version history (v1.0 → v2.1 → changes documented)
✅ Quarterly review meeting (DA + ML Engineering + Business)

🔗 3 Tiêu chuẩn kết nối

	CRISP-DM	scikit-learn Conventions	Model Documentation
Focus	QUY TRÌNH — 6 phases	CODE — API consistency	DOCUMENT — transparency
When	Planning & executing ML project	Writing ML code	Before & after deployment
Who	DA + ML Team + Business	DA + ML Engineer	DA + Compliance + Business
Output	Project plan, milestone tracking	Clean, reproducible code	Model Card, monitoring plan
Connection	CRISP-DM Phase 4-5 → dùng scikit-learn	scikit-learn output → document trong Model Card	Model Card reference CRISP-DM phases

💡 Rules of Thumb cho DA

CRISP-DM: Đừng skip Phase 1 (Business Understanding) — 80% ML project fail vì giải SAI bài toán
scikit-learn: Dùng Pipeline để prevent data leakage — lỗi phổ biến nhất của beginners
Model Documentation: Viết Model Card DÙ model đơn giản — future you sẽ cảm ơn present you

🏆 Tiêu chuẩn — Machine Learning for DA ​

Tổng quan tiêu chuẩn buổi 17 ​

📋 Danh sách tiêu chuẩn liên quan ​

1️⃣ CRISP-DM — Cross Industry Standard Process for Data Mining ​

Giới thiệu ​

6 Phases chi tiết ​

CRISP-DM Checklist cho DA ​

Ưu & nhược điểm CRISP-DM ​

2️⃣ scikit-learn API Conventions — Consistency trong ML Code ​

Giới thiệu ​

3 Object Types ​

API Pattern — fit / predict / transform ​

Conventions quan trọng ​

Pipeline — Best Practice ​

3️⃣ Model Documentation — Transparency & Accountability ​

Giới thiệu ​

Tại sao DA cần document models? ​

Model Card Template cho DA ​

Model Documentation Checklist ​

🔗 3 Tiêu chuẩn kết nối ​

🏆 Tiêu chuẩn — Machine Learning for DA

Tổng quan tiêu chuẩn buổi 17

📋 Danh sách tiêu chuẩn liên quan

1️⃣ CRISP-DM — Cross Industry Standard Process for Data Mining

Giới thiệu

6 Phases chi tiết

CRISP-DM Checklist cho DA

Ưu & nhược điểm CRISP-DM

2️⃣ scikit-learn API Conventions — Consistency trong ML Code

Giới thiệu

3 Object Types

API Pattern — fit / predict / transform

Conventions quan trọng

Pipeline — Best Practice

3️⃣ Model Documentation — Transparency & Accountability

Giới thiệu

Tại sao DA cần document models?

Model Card Template cho DA

Model Documentation Checklist

🔗 3 Tiêu chuẩn kết nối