📘 Buổi 17: Basic Machine Learning for DA — Từ phân tích sang dự đoán

DA không cần thành ML engineer — chỉ cần hiểu đủ để biết khi nào dùng ML.

🎯 Mục tiêu buổi học

Sau buổi này, học viên sẽ:

Hiểu ML workflow: data → feature → model → evaluate → deploy
Xây Linear Regression model — dự đoán giá trị liên tục
Xây Classification model: Logistic Regression, Decision Tree
Đánh giá model: accuracy, precision, recall, confusion matrix

📋 Tổng quan

Ở các buổi trước, bạn đã nắm vững toàn bộ workflow DA: từ thu thập dữ liệu (Python, SQL), làm sạch (Pandas), khám phá (EDA), trực quan hóa (Chart, BI), kể chuyện (Storytelling), đến phân tích business metrics, industry cases, và A/B testing. Tất cả đều xoay quanh MỘT câu hỏi: "Chuyện gì ĐÃ xảy ra?"

Buổi 17 chuyển từ "phân tích quá khứ" sang "dự đoán tương lai" — Machine Learning. Thay vì chỉ report "churn rate tháng trước là 8%", bạn sẽ xây model dự đoán "khách hàng NÀO sẽ churn tháng tới?" — và đề xuất hành động TRƯỚC khi họ rời đi.

Theo McKinsey (2025), 72% các công ty đã adopt AI/ML ở ít nhất 1 business function. Tuy nhiên, 85% ML projects fail trước khi đến production — phần lớn vì thiếu hiểu biết về problem framing, data quality, và model evaluation. DA không cần thành ML engineer — nhưng cần hiểu đủ để frame vấn đề đúng, chọn model phù hợp, đánh giá kết quả, và communicate findings.

mermaid

flowchart LR
    A["📥 Obtain<br/>Buổi 7: Python"] --> B["🧹 Scrub<br/>Buổi 8: Pandas"]
    B --> C["🔍 Explore<br/>Buổi 9: EDA"]
    C --> D["📊 iNterpret<br/>Buổi 10-11: Chart + BI"]
    D --> E["📖 Storytelling<br/>Buổi 12: Presentation"]
    E --> F["💼 Business Metrics<br/>Buổi 13: KPI, Funnel"]
    F --> G["🧪 A/B Testing<br/>Buổi 15: Experiment"]
    G --> H["🤖 Machine Learning<br/>✅ Buổi 17: Predict"]
    style H fill:#e8f5e9,stroke:#4caf50,stroke-width:3px

💡 Tại sao DA cần biết ML?

Tình huống	Không ML	Có ML
CEO hỏi "khách nào sắp churn?"	"Tháng trước churn 8%, top reason là price" — reactive	Model predict 850 khách có >70% churn probability → proactive retention campaign
Marketing muốn target đúng	Segment theo demographics → spray and pray	Model predict purchase probability → target top 20% → ROAS tăng 3x
Finance muốn forecast revenue	Linear trendline Excel → Month+1 estimate	Regression model với multiple features → Q+1 forecast ± CI
Product muốn biết feature nào quan trọng	Survey + gut feeling	Feature importance từ Decision Tree → data-driven roadmap
Risk team muốn đánh giá tín dụng	Rule-based: income > X AND age > Y	Credit scoring model: Logistic Regression + 20 features → probability default

📌 Phần 1: ML for Data Analysts — Hiểu đúng, dùng đúng

Machine Learning là gì?

Machine Learning — thuật toán học từ dữ liệu để tìm patterns và đưa ra dự đoán, KHÔNG cần được lập trình cụ thể cho mỗi tình huống.

Khác với traditional programming:

mermaid

flowchart LR
    subgraph Traditional["Traditional Programming"]
        A1["📝 Rules"] --> B1["💻 Program"]
        A2["📊 Data"] --> B1
        B1 --> C1["📤 Output"]
    end
    subgraph ML["Machine Learning"]
        D1["📊 Data"] --> E1["🤖 Algorithm"]
        D2["📤 Expected Output"] --> E1
        E1 --> F1["📝 Model (Rules)"]
    end

Khái niệm	Định nghĩa	Ví dụ DA
Supervised Learning	Có label (output) → model học mapping input → output	Churn (Yes/No), Revenue ($), Credit Score
Unsupervised Learning	Không có label → model tìm patterns/groups	Customer segmentation, anomaly detection
Regression	Output là số liên tục	Dự đoán revenue, LTV, giá nhà
Classification	Output là nhãn/category	Churn (Yes/No), Fraud (Yes/No), Segment (A/B/C)
Feature	Biến đầu vào (input variable)	age, tenure, monthly_charges, contract_type
Target/Label	Biến cần dự đoán (output)	churn (0/1), revenue ($)
Training set	Data dùng để train model	80% dataset
Test set	Data dùng để evaluate model (CHƯA BAO GIỜ thấy)	20% dataset

ML Applications phổ biến trong DA

Bài toán	Type	Model phổ biến	Industry
Customer Churn Prediction	Classification	Logistic Regression, Decision Tree	Telecom, SaaS, Banking
Revenue Forecasting	Regression	Linear Regression, Random Forest	E-commerce, Retail
Credit Scoring	Classification	Logistic Regression, Gradient Boosting	Finance, Banking
Customer Segmentation	Unsupervised	K-Means, DBSCAN	Marketing, Retail
Fraud Detection	Classification	Random Forest, XGBoost	Banking, Insurance
Product Recommendation	Collaborative Filtering	Matrix Factorization, Neural CF	E-commerce, Streaming
Demand Forecasting	Regression	Time Series, XGBoost	Supply Chain, Retail

ML Workflow: CRISP-DM → KDD

mermaid

flowchart TD
    A["1️⃣ Business Understanding<br/>Vấn đề gì? KPI nào?"] --> B["2️⃣ Data Understanding<br/>Data ở đâu? Quality?"]
    B --> C["3️⃣ Data Preparation<br/>Clean, transform, feature engineering"]
    C --> D["4️⃣ Modeling<br/>Chọn algorithm, train model"]
    D --> E["5️⃣ Evaluation<br/>Accuracy, Precision, Recall, F1"]
    E --> F["6️⃣ Deployment<br/>Integrate vào business process"]
    F -->|"Monitor & Iterate"| A
    style A fill:#fff3e0
    style D fill:#e3f2fd
    style E fill:#fce4ec

DA thường tập trung ở Phase 1-5 — business understanding, data prep, modeling, evaluation. Deployment thường là ML Engineer hoặc Data Engineer. Nhưng DA cần hiểu đủ để communicate với engineering team.

Khi nào DÙNG ML vs khi nào KHÔNG?

Dùng ML	KHÔNG dùng ML
Bài toán cần dự đoán (predict future)	Bài toán chỉ cần đếm/tổng hợp (sum, count, average)
Có đủ data (thousands+ rows)	Data < 100 rows → statistics tốt hơn
Pattern phức tạp (nhiều features interact)	Rule đơn giản: "if revenue > X → premium"
Cần automate prediction cho 1000+ cases	Predict 1-2 cases → manual analysis nhanh hơn
Business impact justify effort	Simple heuristic đã đủ tốt (80/20 rule)

⚠️ ML không phải silver bullet

"Đôi khi SQL query + business logic giải quyết vấn đề nhanh hơn, rẻ hơn, và dễ explain hơn ML model. DA giỏi biết khi nào KHÔNG dùng ML." — Cassie Kozyrkov, Google Chief Decision Scientist

📌 Phần 2: Linear Regression — Dự đoán giá trị liên tục

Concept

Linear Regression — tìm đường thẳng (hoặc mặt phẳng) best fit dữ liệu, dự đoán giá trị liên tục.

\hat{y} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + . . . + β_{n} x_{n}

Trong đó:

$\hat{y}$ = giá trị dự đoán (predicted value)
$β_{0}$ = intercept (hằng số)
$β_{1}, β_{2}, . . ., β_{n}$ = coefficients (trọng số cho mỗi feature)
$x_{1}, x_{2}, . . ., x_{n}$ = features (biến đầu vào)

Ví dụ: Revenue Prediction

python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# ============================================
# REVENUE PREDICTION - LINEAR REGRESSION
# ============================================

# Sample data: monthly revenue prediction
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'marketing_spend': np.random.uniform(10, 100, n),        # triệu VND
    'num_customers': np.random.randint(100, 1000, n),
    'avg_order_value': np.random.uniform(200, 800, n),       # nghìn VND
    'website_visits': np.random.randint(5000, 50000, n),
    'num_promotions': np.random.randint(0, 10, n),
})

# Revenue = f(marketing, customers, AOV, visits, promotions) + noise
data['revenue'] = (
    2.5 * data['marketing_spend'] +
    0.8 * data['num_customers'] +
    0.3 * data['avg_order_value'] +
    0.02 * data['website_visits'] +
    15 * data['num_promotions'] +
    np.random.normal(0, 50, n)   # noise
)

print(f"Dataset shape: {data.shape}")
print(data.describe().round(2))

Train & Evaluate

python

# Features & Target
X = data[['marketing_spend', 'num_customers', 'avg_order_value',
          'website_visits', 'num_promotions']]
y = data['revenue']

# Train/Test Split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {X_train.shape[0]} rows | Test: {X_test.shape[0]} rows")

# Train Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("\n📊 MODEL EVALUATION:")
print(f"  R² Score:  {r2_score(y_test, y_pred):.4f}")
print(f"  MAE:       {mean_absolute_error(y_test, y_pred):.2f}")
print(f"  RMSE:      {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

Evaluation Metrics cho Regression

Metric	Công thức	Ý nghĩa	Tốt khi
R² (R-squared)	$1 - \frac{S S_{r e s}}{S S_{t o t}}$	% variance được giải thích bởi model	Càng gần 1 càng tốt
MAE	$\frac{1}{n} \sum \| y_{i} - {\hat{y}}_{i} \|$	Trung bình absolute error	Càng nhỏ càng tốt
RMSE	$\sqrt{\frac{1}{n} \sum (y_{i} - {\hat{y}}_{i})^{2}}$	Root mean squared error — phạt error lớn nặng hơn	Càng nhỏ càng tốt

Feature Importance

python

# Feature Importance (coefficients)
importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)

print("\n🔍 FEATURE IMPORTANCE (Coefficients):")
print(importance.to_string(index=False))
print(f"\nIntercept: {model.intercept_:.2f}")

# Interpretation:
# marketing_spend coefficient = 2.5 → tăng 1 triệu marketing → revenue tăng ~2.5
# num_promotions coefficient = 15 → thêm 1 promotion → revenue tăng ~15

💡 Đọc coefficients cho stakeholders

"Marketing spend có coefficient 2.5 — nghĩa là cứ tăng 1 triệu VND marketing → revenue tăng khoảng 2.5 triệu VND, giữ các yếu tố khác không đổi. ROI ≈ 2.5x."

Đây chính là cách DA translate ML output thành business insight — không cần nói về β hay gradient descent.

📌 Phần 3: Classification — Phân loại và dự đoán nhãn

Logistic Regression

Logistic Regression — dù tên có "Regression" nhưng thực chất là classification model. Output là probability (0-1) → threshold để phân loại.

P (y = 1) = \frac{1}{1 + e^{- (β_{0} + β_{1} x_{1} + . . . + β_{n} x_{n})}}

Decision Tree

Decision Tree — chia dữ liệu thành các nhánh dựa trên conditions, tạo cây quyết định dễ hiểu.

mermaid

flowchart TD
    A["🌳 Root: Contract Type?"] --> B{"Monthly?"}
    A --> C{"1-Year / 2-Year?"}
    B --> D{"Tenure < 12 months?"}
    B --> E["Churn Risk: 45%"]
    D --> F["🔴 HIGH CHURN: 78%"]
    D --> G["🟡 MEDIUM: 52%"]
    C --> H{"Monthly Charges > 70$?"}
    H --> I["🟡 MEDIUM: 35%"]
    H --> J["🟢 LOW CHURN: 12%"]
    style F fill:#ffcdd2
    style J fill:#c8e6c9

Ví dụ: Churn Prediction

python

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, precision_score,
                             recall_score, f1_score,
                             confusion_matrix, classification_report)
from sklearn.model_selection import train_test_split, cross_val_score

# ============================================
# CHURN PREDICTION - CLASSIFICATION
# ============================================
np.random.seed(42)
n = 2000

churn_data = pd.DataFrame({
    'tenure': np.random.randint(1, 72, n),               # months
    'monthly_charges': np.random.uniform(20, 110, n),     # USD
    'total_charges': np.random.uniform(100, 8000, n),
    'contract_monthly': np.random.binint(0, 1, n) if False else np.random.choice([0, 1], n, p=[0.5, 0.5]),
    'tech_support': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    'online_security': np.random.choice([0, 1], n, p=[0.5, 0.5]),
    'num_tickets': np.random.randint(0, 10, n),
})

# Churn logic: higher probability if short tenure, monthly contract, high charges
churn_prob = (
    -0.03 * churn_data['tenure'] +
    0.02 * churn_data['monthly_charges'] +
    0.8 * churn_data['contract_monthly'] +
    -0.5 * churn_data['tech_support'] +
    0.1 * churn_data['num_tickets'] +
    np.random.normal(0, 0.5, n)
)
churn_data['churn'] = (churn_prob > np.percentile(churn_prob, 73)).astype(int)

print(f"Dataset: {churn_data.shape}")
print(f"Churn rate: {churn_data['churn'].mean():.1%}")
print(churn_data.head())

Train & Compare Models

python

# Features & Target
features = ['tenure', 'monthly_charges', 'total_charges',
            'contract_monthly', 'tech_support', 'online_security', 'num_tickets']
X = churn_data[features]
y = churn_data['churn']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ---- Model 1: Logistic Regression ----
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

# ---- Model 2: Decision Tree ----
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)

# ---- So sánh ----
print("=" * 50)
print("📊 MODEL COMPARISON")
print("=" * 50)

for name, pred in [("Logistic Regression", lr_pred), ("Decision Tree", dt_pred)]:
    print(f"\n🔹 {name}:")
    print(f"  Accuracy:  {accuracy_score(y_test, pred):.4f}")
    print(f"  Precision: {precision_score(y_test, pred):.4f}")
    print(f"  Recall:    {recall_score(y_test, pred):.4f}")
    print(f"  F1 Score:  {f1_score(y_test, pred):.4f}")

Confusion Matrix — Hiểu đúng

Confusion Matrix là bảng 2×2 so sánh actual vs predicted:

                    Predicted
                  Negative    Positive
Actual  Negative    TN          FP
        Positive    FN          TP

Ô	Tên	Ý nghĩa DA	Ví dụ Churn
TP	True Positive	Predict churn = 1, thực tế churn = 1 ✅	Đúng — khách churn và model biết
TN	True Negative	Predict churn = 0, thực tế churn = 0 ✅	Đúng — khách ở lại và model biết
FP	False Positive (Type I)	Predict churn = 1, thực tế churn = 0 ❌	Sai — khách ở lại nhưng model nói churn → lãng phí retention budget
FN	False Negative (Type II)	Predict churn = 0, thực tế churn = 1 ❌	Sai — khách churn nhưng model không biết → mất khách

python

# Confusion Matrix - Logistic Regression
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, name, pred in zip(axes,
                           ["Logistic Regression", "Decision Tree"],
                           [lr_pred, dt_pred]):
    cm = confusion_matrix(y_test, pred)
    disp = ConfusionMatrixDisplay(cm, display_labels=["Stay", "Churn"])
    disp.plot(ax=ax, cmap='Blues', values_format='d')
    ax.set_title(f"{name}\nConfusion Matrix")

plt.tight_layout()
plt.show()

Evaluation Metrics cho Classification

Metric	Công thức	Khi nào quan trọng?
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$	Data balanced (50/50). KHÔNG dùng khi imbalanced!
Precision	$\frac{T P}{T P + F P}$	Khi FP costly — false alarm tốn tiền (spam filter, fraud alert)
Recall	$\frac{T P}{T P + F N}$	Khi FN costly — bỏ sót nguy hiểm (bệnh, churn, fraud)
F1 Score	$2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$	Cần balance Precision và Recall

⚠️ Accuracy Paradox

Dataset có 95% "Stay", 5% "Churn". Model predict TẤT CẢ là "Stay" → Accuracy = 95%! Nhưng Recall = 0% — không bắt được khách churn nào. Luôn check Precision + Recall, đặc biệt với imbalanced data.

Cross-Validation

python

# Cross-Validation (5-fold)
from sklearn.model_selection import cross_val_score

lr_cv = cross_val_score(lr_model, X, y, cv=5, scoring='f1')
dt_cv = cross_val_score(dt_model, X, y, cv=5, scoring='f1')

print("📊 CROSS-VALIDATION (5-fold, F1 Score):")
print(f"  Logistic Regression: {lr_cv.mean():.4f} ± {lr_cv.std():.4f}")
print(f"  Decision Tree:       {dt_cv.mean():.4f} ± {dt_cv.std():.4f}")

Model Comparison Table

Tiêu chí	Logistic Regression	Decision Tree
Interpretability	Coefficients → impact mỗi feature	Visual tree → dễ explain cho business
Speed	Rất nhanh	Nhanh
Handling non-linear	Không (linear boundary)	Có (non-linear splits)
Overfitting risk	Thấp	Cao (nếu không prune)
Feature scaling	Cần standardize	Không cần
Missing values	Cần xử lý	Một số implementation handle được
Best for DA	Baseline model, probability output	Explainable predictions, feature importance

📌 Phần 4: Practical ML cho DA — Feature Engineering & Model Selection

Feature Engineering Basics

Feature Engineering — biến raw data thành features mà model có thể hiểu và học tốt.

Kỹ thuật	Mô tả	Ví dụ
One-Hot Encoding	Chuyển categorical → dummy variables	contract_type: "monthly" → `[1, 0, 0]`
Label Encoding	Chuyển categorical → số thứ tự	satisfaction: Low=1, Medium=2, High=3
Standard Scaling	Scale features về mean=0, std=1	income: 50M → 0.3 (standardized)
Min-Max Scaling	Scale về [0, 1]	age: 25 → 0.15
Log Transform	Giảm skewness	revenue: 1000 → log(1000) = 6.9
Binning	Nhóm số liên tục thành categories	age → "18-25", "26-35", "36-45"

python

from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['contract_type', 'payment_method'],
                            drop_first=True)

# Standard Scaling
scaler = StandardScaler()
df[['tenure', 'monthly_charges']] = scaler.fit_transform(
    df[['tenure', 'monthly_charges']]
)

# Label Encoding (ordinal)
le = LabelEncoder()
df['satisfaction_encoded'] = le.fit_transform(df['satisfaction'])

Overfitting vs Underfitting

mermaid

flowchart LR
    A["📉 Underfitting<br/>Model quá đơn giản<br/>High bias, Low variance<br/>Train ❌ Test ❌"] --> B["✅ Good Fit<br/>Model vừa đủ phức tạp<br/>Balanced bias-variance<br/>Train ✅ Test ✅"]
    B --> C["📈 Overfitting<br/>Model quá phức tạp<br/>Low bias, High variance<br/>Train ✅ Test ❌"]

	Underfitting	Good Fit	Overfitting
Train accuracy	Thấp	Cao	Rất cao
Test accuracy	Thấp	Cao (gần train)	Thấp (gap lớn vs train)
Nguyên nhân	Model quá simple, ít features	Đủ complexity	Model quá complex, quá nhiều features
Fix	Thêm features, model phức tạp hơn	—	Prune tree, regularization, more data, cross-validation

Model Selection cho DA

mermaid

flowchart TD
    A["🤔 Bài toán là gì?"] --> B{"Output type?"}
    B -->|"Số liên tục<br/>(revenue, price)"| C["Regression"]
    B -->|"Nhãn/Category<br/>(churn, fraud)"| D["Classification"]
    C --> E{"Cần interpretable?"}
    E -->|"Yes"| F["Linear Regression"]
    E -->|"No, cần accuracy"| G["Random Forest / XGBoost"]
    D --> H{"Data size?"}
    H -->|"< 10K rows"| I["Logistic Regression<br/>(baseline)"]
    H -->|"> 10K rows"| J{"Cần explain cho business?"}
    J -->|"Yes"| K["Decision Tree<br/>(max_depth ≤ 5)"]
    J -->|"No, cần performance"| L["Random Forest / XGBoost"]
    style F fill:#e3f2fd
    style I fill:#e3f2fd
    style K fill:#e3f2fd

Communicate ML Results cho Business

DA không cần nói: "F1 score = 0.82, AUC-ROC = 0.89, cross-validated with stratified k-fold."

DA cần nói:

"Model dự đoán churn đúng 82% trường hợp. Trong 1,000 khách hàng, model xác định được 78% khách sắp churn (recall = 0.78). Nếu chạy retention campaign cho top 200 khách có risk cao nhất, ước tính giữ lại 120 khách × LTV 5 triệu = tiết kiệm 600 triệu/năm. Chi phí campaign: 100 triệu. ROI = 5x."

Template ML communication:

📊 ML RESULTS SUMMARY
━━━━━━━━━━━━━━━━━━━━━
1. BÀI TOÁN: [Dự đoán churn/revenue/fraud]
2. MODEL: [Logistic Regression / Decision Tree]
3. PERFORMANCE: [Model đúng X% trường hợp]
4. KEY DRIVERS: [Top 3 features ảnh hưởng nhất]
5. BUSINESS IMPACT: [Tiết kiệm/tăng $X nếu apply]
6. RECOMMENDATION: [Hành động cụ thể]
7. LIMITATION: [Model không hoạt động tốt khi...]

🔗 Kết nối toàn bộ

ML trong hành trình DA

Buổi	Kỹ năng	ML liên quan
Buổi 7-8	Python + Pandas	Data prep cho ML — clean, transform, feature engineering
Buổi 9	EDA	Explore features, distributions → chọn features cho model
Buổi 10-11	Visualization + BI	Visualize model results, feature importance, confusion matrix
Buổi 12	Data Storytelling	Present ML results cho stakeholders — không cần technical jargon
Buổi 13	Business Metrics	North Star metric → target variable. KPI → evaluation metric
Buổi 15	A/B Testing	Test ML model impact — A/B test "có model" vs "không model"
Buổi 17	Machine Learning	Xây model dự đoán — từ phân tích sang predict

Checklist "ML Literacy cho DA"

✅ Hiểu Supervised vs Unsupervised, Regression vs Classification
✅ Biết khi nào DÙNG ML và khi nào KHÔNG
✅ Xây Linear Regression trong scikit-learn
✅ Xây Logistic Regression + Decision Tree cho classification
✅ Đọc confusion matrix: TP, TN, FP, FN
✅ Phân biệt Accuracy, Precision, Recall, F1
✅ Biết train/test split và cross-validation
✅ Feature engineering cơ bản: encoding, scaling
✅ Nhận biết overfitting vs underfitting
✅ Communicate ML results bằng business language

📚 Tài liệu tham khảo

Tài liệu	Tác giả	Nội dung chính
Hands-On Machine Learning	Aurélien Géron	Bible của ML thực hành — scikit-learn + TensorFlow
An Introduction to Statistical Learning (ISLR)	James, Witten, Hastie, Tibshirani	ML theory accessible cho non-CS background
scikit-learn Documentation	scikit-learn.org	Official docs — examples, API reference
Machine Learning for Everyone	Vas3k.com	Visual guide ML concepts — cực dễ hiểu
Cassie Kozyrkov's Decision Intelligence	Google	Khi nào KHÔNG dùng ML — decision science perspective

🎯 Bài tập và thực hành

Workshop: Xây model churn prediction — EDA, feature selection, Logistic Regression + Decision Tree, confusion matrix
Case Study: Spotify recommendation, Shopee churn prediction, VietCredit credit scoring
Mini Game: ML Explorer — 7 bài toán, chọn model type, Gold ≥ 80 XP
Blog: Câu chuyện Vy — DA healthtech và bài học "không phải gì cũng cần ML"
Tiêu chuẩn: CRISP-DM, scikit-learn API conventions, Model Documentation

📘 Buổi 17: Basic Machine Learning for DA — Từ phân tích sang dự đoán ​

🎯 Mục tiêu buổi học ​

📋 Tổng quan ​

📌 Phần 1: ML for Data Analysts — Hiểu đúng, dùng đúng ​

Machine Learning là gì? ​

ML Applications phổ biến trong DA ​

ML Workflow: CRISP-DM → KDD ​

Khi nào DÙNG ML vs khi nào KHÔNG? ​

📌 Phần 2: Linear Regression — Dự đoán giá trị liên tục ​

Concept ​

Ví dụ: Revenue Prediction ​

Train & Evaluate ​

Evaluation Metrics cho Regression ​

Feature Importance ​

📌 Phần 3: Classification — Phân loại và dự đoán nhãn ​

Logistic Regression ​

Decision Tree ​

Ví dụ: Churn Prediction ​

Train & Compare Models ​

Confusion Matrix — Hiểu đúng ​

Evaluation Metrics cho Classification ​

Cross-Validation ​

Model Comparison Table ​

📌 Phần 4: Practical ML cho DA — Feature Engineering & Model Selection ​

Feature Engineering Basics ​

Overfitting vs Underfitting ​

Model Selection cho DA ​

Communicate ML Results cho Business ​

🔗 Kết nối toàn bộ ​

ML trong hành trình DA ​

Checklist "ML Literacy cho DA" ​

📚 Tài liệu tham khảo ​

🎯 Bài tập và thực hành ​

📘 Buổi 17: Basic Machine Learning for DA — Từ phân tích sang dự đoán

🎯 Mục tiêu buổi học

📋 Tổng quan

📌 Phần 1: ML for Data Analysts — Hiểu đúng, dùng đúng

Machine Learning là gì?

ML Applications phổ biến trong DA

ML Workflow: CRISP-DM → KDD

Khi nào DÙNG ML vs khi nào KHÔNG?

📌 Phần 2: Linear Regression — Dự đoán giá trị liên tục

Concept

Ví dụ: Revenue Prediction

Train & Evaluate

Evaluation Metrics cho Regression

Feature Importance

📌 Phần 3: Classification — Phân loại và dự đoán nhãn

Logistic Regression

Decision Tree

Ví dụ: Churn Prediction

Train & Compare Models

Confusion Matrix — Hiểu đúng

Evaluation Metrics cho Classification

Cross-Validation

Model Comparison Table

📌 Phần 4: Practical ML cho DA — Feature Engineering & Model Selection

Feature Engineering Basics

Overfitting vs Underfitting

Model Selection cho DA

Communicate ML Results cho Business

🔗 Kết nối toàn bộ

ML trong hành trình DA

Checklist "ML Literacy cho DA"

📚 Tài liệu tham khảo

🎯 Bài tập và thực hành