📘 Buổi 9: EDA — Khám phá dữ liệu và tìm insight

EDA = "khám sức khỏe tổng quát" cho dataset. Bước bắt buộc trước mọi phân tích — bỏ qua EDA là lao vào phân tích với đôi mắt nhắm nghiền.

🎯 Mục tiêu buổi học

Sau buổi này, học viên sẽ:

Thực hiện quy trình EDA hoàn chỉnh: univariate → bivariate → multivariate
Tạo distribution plots, scatter plots, correlation matrix
Phát hiện patterns, anomalies, relationships trong dữ liệu
Viết EDA summary với insight rõ ràng, actionable

📋 Tổng quan

Ở Buổi 8, bạn đã thành thạo Pandas & Numpy — đọc file, xử lý missing values, xóa duplicates, merge DataFrames. Dữ liệu đã sạch, nhưng sạch chưa có nghĩa là hiểu. Bạn có 500.000 dòng order data — giá trị trung bình là bao nhiêu? Phân phối có bị lệch không? Cột nào ảnh hưởng mạnh nhất đến doanh thu? Có outlier nào đang phá hỏng phân tích? Đó chính là câu hỏi mà EDA — Exploratory Data Analysis trả lời.

EDA là quá trình khám phá dữ liệu một cách có hệ thống — quan sát, tóm tắt, trực quan hóa — để hiểu cấu trúc, phát hiện pattern và anomaly trước khi đưa ra kết luận hoặc xây dựng model. John Tukey, cha đẻ EDA, từng nói: "The greatest value of a picture is when it forces us to notice what we never expected to see." — biểu đồ tốt sẽ buộc bạn nhìn thấy điều bạn không ngờ tới.

Nhớ lại hành trình OSEMN: Buổi 7 bạn Obtain dữ liệu bằng Python, Buổi 8 bạn Scrub bằng Pandas. Buổi 9 là bước Explore — bước mà Data Analyst thật sự bắt đầu "nói chuyện" với dữ liệu.

mermaid

flowchart LR
    A["📥 Obtain<br/>Buổi 7: Python đọc file"] --> B["🧹 Scrub<br/>Buổi 8: Pandas Data Cleaning"]
    B --> C["🔍 Explore<br/>✅ Buổi 9: EDA"]
    C --> D["📊 iNterpret<br/>Buổi 10-11: Visualization"]
    D --> E["🤖 Model<br/>Buổi 12+"]
    style C fill:#e8f5e9,stroke:#4caf50,stroke-width:3px

💡 EDA trong thực tế doanh nghiệp

Ai cần EDA?	Tình huống	Output kỳ vọng
DA kiểm tra data mới nhận	Nhận file 200K dòng từ team vận hành	Summary: shape, dtypes, missing %, phân phối
Product Manager hỏi trend	"Doanh thu Q4 có gì bất thường?"	Time series plot + highlight anomaly
Marketing muốn segment	"Ai là khách hàng VIP?"	Distribution + scatter plot spending vs frequency
DS trước khi build model	Kiểm tra feature relationships	Correlation matrix, feature importance

Công cụ sử dụng: Jupyter Notebook, Pandas, Matplotlib, Seaborn — tất cả công cụ bạn cần để biến data thành hình ảnh có ý nghĩa.

📌 Phần 1: EDA Framework — Quy trình khám phá dữ liệu

OSEMN Pipeline — Focus vào bước Explore

EDA không phải "nhìn data rồi nghĩ ra insight" — nó là quy trình có cấu trúc. Mỗi bước dẫn đến bước tiếp theo, từ tổng quát đến chi tiết:

mermaid

flowchart TD
    A["1️⃣ Overview<br/>shape, dtypes, head, info"] --> B["2️⃣ Univariate<br/>Phân tích từng biến riêng lẻ"]
    B --> C["3️⃣ Bivariate<br/>Mối quan hệ giữa 2 biến"]
    C --> D["4️⃣ Multivariate<br/>Nhiều biến cùng lúc"]
    D --> E["5️⃣ Summary<br/>Insight + recommendation"]

Bước đầu tiên — Dataset Overview

Mọi EDA đều bắt đầu bằng 5 lệnh "must-run" để hiểu dataset tổng quan trước khi đi sâu:

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Thiết lập style cho biểu đồ
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Đọc dataset e-commerce
df = pd.read_csv("ecommerce_orders.csv")

# 5 lệnh overview — chạy MỌI LÚC khi nhận data mới
print(f"Shape: {df.shape[0]:,} dòng × {df.shape[1]} cột")
# Shape: 50,000 dòng × 12 cột

print(df.head())
print(df.info())
print(df.describe())
print(df.dtypes)

python

# Kiểm tra missing values — bao nhiêu %, cột nào bị thiếu
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({
    "missing_count": missing,
    "missing_pct": missing_pct
}).query("missing_count > 0").sort_values("missing_pct", ascending=False)

print(missing_report)
# 	              missing_count  missing_pct
# review_score            3500         7.00
# shipping_date           1200         2.40
# discount                 800         1.60

📌 Quy tắc nhanh xử lý missing trong EDA

< 5% missing → điền median (numerical) hoặc mode (categorical), hoặc dropna
5–30% missing → phân tích pattern trước, dùng fillna có suy nghĩ
> 30% missing → cân nhắc drop cả cột — cột đó không đáng tin

Phân loại biến — Bước quan trọng trước khi phân tích

Hiểu loại biến quyết định bạn dùng biểu đồ nào, phép tính nào:

python

# Phân loại cột tự động
numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

print(f"Numerical ({len(numerical_cols)}): {numerical_cols}")
# Numerical (6): ['order_value', 'quantity', 'discount', 'shipping_fee',
#                  'review_score', 'customer_age']

print(f"Categorical ({len(categorical_cols)}): {categorical_cols}")
# Categorical (5): ['order_id', 'customer_id', 'product_category',
#                    'city', 'payment_method']

# Kiểm tra cardinality — số giá trị unique của cột categorical
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")
# product_category: 15 unique values
# city: 63 unique values
# payment_method: 4 unique values

💡 So sánh với Excel

Thao tác	Excel	Pandas EDA
Xem data nhanh	Scroll bảng	`df.head()`, `df.sample(5)`
Tổng quan thống kê	Descriptive Statistics add-in	`df.describe()`
Đếm missing	`COUNTBLANK` từng cột	`df.isnull().sum()` — tất cả cột 1 dòng
Phân loại biến	Tự nhìn từng cột	`df.select_dtypes()` — tự động

📌 Phần 2: Univariate Analysis — Phân tích từng biến

Univariate = phân tích 1 biến tại một thời điểm. Mục tiêu: hiểu phân phối (distribution), central tendency (mean, median, mode), và các anomaly (outlier, skewness).

Numerical Variables — Histogram, Box Plot, KDE

python

# Histogram — xem phân phối order_value
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Histogram
axes[0].hist(df["order_value"], bins=50, color="steelblue", edgecolor="white")
axes[0].set_title("Distribution of Order Value")
axes[0].set_xlabel("Order Value (VND)")
axes[0].axvline(df["order_value"].mean(), color="red", linestyle="--", label=f'Mean: {df["order_value"].mean():,.0f}')
axes[0].axvline(df["order_value"].median(), color="green", linestyle="--", label=f'Median: {df["order_value"].median():,.0f}')
axes[0].legend()

# 2. Box Plot — phát hiện outlier
axes[1].boxplot(df["order_value"].dropna(), vert=True)
axes[1].set_title("Box Plot — Order Value")

# 3. KDE (Kernel Density Estimation) — đường cong phân phối mượt
sns.kdeplot(df["order_value"], ax=axes[2], fill=True, color="coral")
axes[2].set_title("KDE — Order Value")

plt.tight_layout()
plt.show()

python

# Thống kê mô tả chi tiết — vượt xa describe()
col = "order_value"
stats = {
    "count": df[col].count(),
    "mean": df[col].mean(),
    "median": df[col].median(),
    "std": df[col].std(),
    "skewness": df[col].skew(),      # > 0: lệch phải, < 0: lệch trái
    "kurtosis": df[col].kurtosis(),  # > 3: nhọn (nhiều outlier), < 3: phẳng
    "min": df[col].min(),
    "max": df[col].max(),
    "Q1": df[col].quantile(0.25),
    "Q3": df[col].quantile(0.75),
    "IQR": df[col].quantile(0.75) - df[col].quantile(0.25)
}
for k, v in stats.items():
    print(f"{k:>12}: {v:>15,.2f}")
#        count:       50,000.00
#         mean:    1,250,000.50
#       median:      980,000.00
#          std:      850,000.30
#     skewness:            1.85   ← Lệch phải — có outlier giá trị cao
#     kurtosis:            4.20   ← Nhọn — tail dài (nhiều outlier)

⚠️ Skewness — Khi Mean ≠ Median

Nếu skewness > 1 hoặc < -1 → phân phối lệch mạnh. Khi đó:

Không dùng Mean để đại diện → dùng Median thay thế
Histogram sẽ bất đối xứng, box plot sẽ có nhiều outlier dots
Cân nhắc log transform trước khi phân tích: df["log_value"] = np.log1p(df["order_value"])

Categorical Variables — Bar Chart, Value Counts

python

# Phân tích biến categorical — product_category
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Bar chart — top 10 category phổ biến
top_cats = df["product_category"].value_counts().head(10)
top_cats.plot(kind="barh", ax=axes[0], color="steelblue")
axes[0].set_title("Top 10 Product Categories")
axes[0].set_xlabel("Number of Orders")

# 2. Payment method — pie chart (khi ít category)
df["payment_method"].value_counts().plot(
    kind="pie", ax=axes[1], autopct="%1.1f%%",
    colors=["#4CAF50", "#2196F3", "#FF9800", "#F44336"]
)
axes[1].set_title("Payment Method Distribution")
axes[1].set_ylabel("")

plt.tight_layout()
plt.show()

python

# value_counts chi tiết — tỷ lệ % + cumulative %
cat_analysis = df["product_category"].value_counts()
cat_pct = df["product_category"].value_counts(normalize=True) * 100
cat_cumsum = cat_pct.cumsum()

cat_summary = pd.DataFrame({
    "count": cat_analysis,
    "pct": cat_pct.round(2),
    "cumulative_pct": cat_cumsum.round(2)
})
print(cat_summary.head(8))
#                   count    pct  cumulative_pct
# Điện thoại         8500  17.00           17.00
# Thời trang         7200  14.40           31.40
# Điện tử            6800  13.60           45.00
# Gia dụng           5500  11.00           56.00
# Mỹ phẩm            4200   8.40           64.40
# Thực phẩm          3800   7.60           72.00
# Sách                3100   6.20           78.20  ← 80% đơn hàng tập trung ở 7 loại
# Thể thao            2900   5.80           84.00

📖 Pareto Principle trong EDA

Nhìn cột cumulative_pct: ~80% đơn hàng đến từ 7/15 category (= 47% category). Đây là Pareto Principle (80/20) — tập trung phân tích sâu 7 category này sẽ cover phần lớn business. Trong Excel, bạn sẽ cần tạo PivotTable + sort + thêm cột tính % cộng dồn thủ công.

📌 Phần 3: Bivariate & Multivariate Analysis — Mối quan hệ giữa các biến

Bivariate = 2 biến, multivariate = nhiều biến. Đây là lúc bạn tìm được insight thực sự — biến nào ảnh hưởng biến nào, customer segment nào chi nhiều nhất, feature nào liên quan mạnh đến target.

Numerical vs Numerical — Scatter Plot & Correlation

python

# Scatter plot — order_value vs quantity
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Scatter plot cơ bản
axes[0].scatter(df["quantity"], df["order_value"],
                alpha=0.3, s=10, color="steelblue")
axes[0].set_xlabel("Quantity")
axes[0].set_ylabel("Order Value (VND)")
axes[0].set_title("Order Value vs Quantity")

# 2. Scatter + regression line (seaborn)
sns.regplot(x="customer_age", y="order_value", data=df,
            ax=axes[1], scatter_kws={"alpha": 0.2, "s": 10},
            line_kws={"color": "red"})
axes[1].set_title("Order Value vs Customer Age")

plt.tight_layout()
plt.show()

python

# Correlation — Pearson vs Spearman
# Pearson: tương quan tuyến tính (giả định phân phối chuẩn)
# Spearman: tương quan thứ bậc (không cần phân phối chuẩn, bền với outlier)

pearson_corr = df[numerical_cols].corr(method="pearson")
spearman_corr = df[numerical_cols].corr(method="spearman")

print("=== Pearson Correlation ===")
print(pearson_corr.round(3))
#                order_value  quantity  discount  shipping_fee  review_score  customer_age
# order_value          1.000     0.720    -0.150         0.450         0.120         0.050
# quantity             0.720     1.000    -0.080         0.380         0.090         0.030
# discount            -0.150    -0.080     1.000        -0.050        -0.020         0.010
# shipping_fee         0.450     0.380    -0.050         1.000         0.060         0.020
# review_score         0.120     0.090    -0.020         0.060         1.000         0.180
# customer_age         0.050     0.030     0.010         0.020         0.180         1.000

📌 Đọc Correlation — Quy tắc nhanh

Giá trị	Ý nghĩa	Ví dụ
0.7 – 1.0	Tương quan mạnh	`order_value` ↔ `quantity` (0.72) → mua nhiều = giá trị cao
0.4 – 0.7	Tương quan trung bình	`order_value` ↔ `shipping_fee` (0.45)
0.0 – 0.4	Tương quan yếu / không có	`customer_age` ↔ `order_value` (0.05) → tuổi không ảnh hưởng
Giá trị âm	Tương quan nghịch	`discount` ↔ `order_value` (-0.15) → chiết khấu ↑ giá trị ↓

⚠️ Correlation ≠ Causation — tương quan không phải nhân quả! Ice cream sales tương quan với drowning rate, nhưng kem không gây đuối nước (cả hai tăng vào mùa hè).

Correlation Matrix Heatmap

python

# Heatmap — ma trận tương quan (biểu đồ DA dùng nhiều nhất)
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(pearson_corr, dtype=bool))  # Ẩn nửa trên (trùng lặp)

sns.heatmap(pearson_corr, mask=mask, annot=True, fmt=".2f",
            cmap="RdBu_r", center=0, vmin=-1, vmax=1,
            square=True, linewidths=0.5)
plt.title("Correlation Matrix — E-commerce Orders", fontsize=14)
plt.tight_layout()
plt.show()

Numerical vs Categorical — Box Plot Comparison

python

# So sánh order_value giữa các payment_method
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Box plot comparison
sns.boxplot(x="payment_method", y="order_value", data=df, ax=axes[0],
            palette="Set2")
axes[0].set_title("Order Value by Payment Method")
axes[0].set_ylabel("Order Value (VND)")

# 2. Violin plot — chi tiết hơn box plot (thấy phân phối)
sns.violinplot(x="payment_method", y="order_value", data=df, ax=axes[1],
               palette="Set2", inner="quartile")
axes[1].set_title("Order Value Distribution by Payment Method")

plt.tight_layout()
plt.show()

# Thống kê nhóm — giống PivotTable trong Excel
group_stats = df.groupby("payment_method")["order_value"].agg(
    ["count", "mean", "median", "std"]
).round(0)
print(group_stats)
#                  count       mean     median        std
# COD              12500  1050000.0   850000.0   720000.0
# Credit Card      15000  1450000.0  1200000.0   950000.0
# E-wallet         18000  1100000.0   900000.0   780000.0
# Bank Transfer     4500  1800000.0  1500000.0  1100000.0

Categorical vs Categorical — Crosstab & Heatmap

python

# Crosstab — giống PivotTable đếm trong Excel
ct = pd.crosstab(df["product_category"], df["payment_method"],
                 normalize="index") * 100  # % theo dòng
ct = ct.round(1)

# Heatmap cho crosstab
plt.figure(figsize=(10, 8))
sns.heatmap(ct.head(8), annot=True, fmt=".1f", cmap="YlOrRd",
            cbar_kws={"label": "% of orders"})
plt.title("Payment Method by Product Category (%)")
plt.ylabel("Product Category")
plt.xlabel("Payment Method")
plt.tight_layout()
plt.show()

💡 So sánh tương đương Excel

Phân tích	Excel	Pandas
Correlation	Data Analysis ToolPak → Correlation	`df.corr()`
Scatter plot	Insert Chart → Scatter	`plt.scatter()` / `sns.regplot()`
Box plot so sánh	Không trực quan	`sns.boxplot(x=cat, y=num)`
PivotTable đếm	PivotTable → Count	`pd.crosstab()`
Conditional formatting	Format → Color Scales	`sns.heatmap()`

📌 Phần 4: Patterns & Anomalies — Phát hiện bất thường

Nếu Phần 1–3 giúp bạn mô tả dữ liệu, Phần 4 giúp bạn phát hiện điều bất thường — outlier đang kéo lệch trung bình, trend ẩn giấu, mùa vụ trong doanh thu, missing data có pattern hay ngẫu nhiên.

Outlier Detection — Visual + Statistical

python

# Phương pháp IQR — chuẩn thống kê phổ biến nhất
def detect_outliers_iqr(series, multiplier=1.5):
    """Phát hiện outlier bằng IQR method"""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    outliers = series[(series < lower) | (series > upper)]
    return outliers, lower, upper

outliers, lower, upper = detect_outliers_iqr(df["order_value"])
print(f"Outlier boundaries: [{lower:,.0f} — {upper:,.0f}]")
print(f"Number of outliers: {len(outliers):,} ({len(outliers)/len(df)*100:.1f}%)")
# Outlier boundaries: [50,000 — 3,200,000]
# Number of outliers: 2,350 (4.7%)

# Visualize outlier
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Box plot với outlier highlight
sns.boxplot(y=df["order_value"], ax=axes[0], color="lightblue")
axes[0].axhline(upper, color="red", linestyle="--", label=f"Upper: {upper:,.0f}")
axes[0].axhline(lower, color="red", linestyle="--", label=f"Lower: {lower:,.0f}")
axes[0].legend()
axes[0].set_title("Outlier Detection — IQR Method")

# Histogram with vs without outlier
df_no_outlier = df[(df["order_value"] >= lower) & (df["order_value"] <= upper)]
axes[1].hist(df_no_outlier["order_value"], bins=50, color="steelblue",
             edgecolor="white", alpha=0.7, label="Normal")
axes[1].hist(outliers, bins=20, color="red", edgecolor="white",
             alpha=0.5, label="Outliers")
axes[1].legend()
axes[1].set_title("Distribution — Normal vs Outliers")

plt.tight_layout()
plt.show()

⚠️ Outlier — Xóa hay giữ?

Xóa khi: lỗi nhập liệu (đơn hàng -500.000 VND), test data, data corruption
Giữ khi: giá trị thật (khách VIP mua 50 triệu), là insight quan trọng
Tách riêng khi: phân tích 2 nhóm — normal customers vs VIP/whale customers

Quy tắc: Luôn điều tra nguyên nhân trước khi quyết định xóa hay giữ. Outlier có thể là bug hoặc là vàng.

Trend & Seasonality Detection

python

# Phân tích trend theo thời gian — doanh thu theo tháng
df["order_date"] = pd.to_datetime(df["order_date"])
monthly = df.groupby(df["order_date"].dt.to_period("M")).agg(
    total_revenue=("order_value", "sum"),
    order_count=("order_id", "count"),
    avg_order_value=("order_value", "mean")
).reset_index()
monthly["order_date"] = monthly["order_date"].dt.to_timestamp()

fig, ax1 = plt.subplots(figsize=(14, 6))

# Doanh thu (trục trái)
ax1.plot(monthly["order_date"], monthly["total_revenue"] / 1e9,
         color="steelblue", marker="o", linewidth=2, label="Revenue (tỷ VND)")
ax1.set_ylabel("Revenue (tỷ VND)", color="steelblue")
ax1.set_xlabel("Month")

# Số đơn (trục phải)
ax2 = ax1.twinx()
ax2.bar(monthly["order_date"], monthly["order_count"],
        alpha=0.3, color="coral", width=20, label="Order Count")
ax2.set_ylabel("Order Count", color="coral")

ax1.set_title("Monthly Revenue & Order Count — Trend Analysis")
fig.legend(loc="upper left", bbox_to_anchor=(0.12, 0.88))
plt.tight_layout()
plt.show()
# Insight: Doanh thu peak tháng 11-12 (mùa sale cuối năm), dip tháng 2 (Tết)

Missing Data Patterns

python

# Missing data có pattern hay ngẫu nhiên?
# MCAR: Missing Completely At Random — ngẫu nhiên hoàn toàn
# MAR:  Missing At Random — thiếu phụ thuộc vào biến khác
# MNAR: Missing Not At Random — thiếu phụ thuộc chính giá trị bị thiếu

# Visualize missing pattern
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Missing matrix — mỗi dòng = 1 row, trắng = NaN
axes[0].imshow(df.isnull().values[:200], aspect="auto", cmap="gray_r",
               interpolation="none")
axes[0].set_title("Missing Data Pattern (200 rows)")
axes[0].set_xlabel("Columns")
axes[0].set_yticks([])

# 2. Missing correlation — cột nào bị missing cùng lúc?
missing_corr = df.isnull().corr()
sns.heatmap(missing_corr, annot=True, fmt=".2f", ax=axes[1],
            cmap="RdBu_r", center=0)
axes[1].set_title("Missing Value Correlation")

plt.tight_layout()
plt.show()
# Insight: review_score và shipping_date có missing correlation 0.65
# → khi đơn chưa giao (thiếu shipping_date), khách chưa review (thiếu review_score)
# → Đây là MAR, không phải MCAR — missing có logic business!

EDA Summary — Viết insight actionable

python

# Template EDA Summary — output cuối cùng của mọi EDA
eda_summary = """
========================================
       EDA SUMMARY — E-commerce Orders
========================================

📊 DATASET OVERVIEW:
- 50,000 orders | 12 columns | Period: 01/2025 – 12/2025
- Missing: review_score (7%), shipping_date (2.4%), discount (1.6%)

📈 KEY FINDINGS:
1. Order Value: median 980K VND, right-skewed (skew=1.85)
   → Dùng median, KHÔNG dùng mean khi báo cáo "đơn hàng trung bình"

2. Top 3 categories (Điện thoại, Thời trang, Điện tử) = 45% tổng đơn
   → Pareto — focus marketing vào 3 category này

3. Correlation: order_value ↔ quantity = 0.72 (mạnh)
   → Khuyến khích mua nhiều sản phẩm = tăng giá trị đơn hàng

4. Payment: Bank Transfer có avg order value cao nhất (1.8M)
   → Đơn hàng giá trị cao prefer chuyển khoản

5. Seasonality: Peak tháng 11-12, dip tháng 2
   → Chuẩn bị inventory & nhân sự cho Q4

⚠️ ANOMALIES:
- 4.7% outlier (> 3.2M VND) — cần kiểm tra: VIP hay lỗi data?
- Missing review_score correlate với missing shipping_date (MAR)

🎯 RECOMMENDATIONS:
- Segment khách hàng theo order_value (VIP > 3M, Standard, Low)
- Phân tích sâu chênh lệch payment method
- Build prediction model cho monthly revenue (có seasonality rõ)
"""
print(eda_summary)

📖 Checklist EDA hoàn chỉnh — Copy để dùng lại

□ Shape, dtypes, head() — overview dataset
□ Missing values — count, %, pattern (MCAR/MAR/MNAR)
□ Duplicates check
□ Univariate — histogram + boxplot cho mỗi numerical column
□ Univariate — value_counts + bar chart cho mỗi categorical column
□ describe() + skew() + kurtosis() cho numerical
□ Bivariate — scatter plot cho top correlated pairs
□ Correlation matrix + heatmap
□ Box plot comparison (numerical vs categorical)
□ Crosstab (categorical vs categorical) nếu cần
□ Time series trend (nếu có date column)
□ Outlier detection — IQR method
□ EDA Summary — key findings + recommendations

🔑 Từ khóa chính

Tiếng Việt	English	Giải thích
Phân tích khám phá	EDA (Exploratory Data Analysis)	Khám phá data trước khi phân tích chính — descriptive, visual
Tương quan	Correlation	Mức độ liên hệ giữa 2 biến — Pearson (tuyến tính), Spearman (thứ bậc)
Phân phối	Distribution	Cách data phân bố — histogram, KDE, skewness
Ma trận tương quan	Correlation Matrix	Bảng tương quan giữa tất cả cặp biến — visualize bằng heatmap
Insight	Insight	Phát hiện có giá trị từ dữ liệu — phải actionable, không chỉ mô tả
Phân tích đơn biến	Univariate Analysis	Phân tích 1 biến riêng lẻ — histogram, box plot, value_counts
Phân tích hai biến	Bivariate Analysis	Phân tích mối quan hệ giữa 2 biến — scatter, box plot comparison
Giá trị ngoại lai	Outlier	Giá trị bất thường, nằm ngoài phạm vi bình thường — IQR method
Độ lệch	Skewness	Mức độ lệch của phân phối — >0 lệch phải, <0 lệch trái
Xu hướng	Trend	Pattern thay đổi theo thời gian — tăng, giảm, mùa vụ

📊 Tổng kết buổi học

✅ Checklist — Bạn đã nắm được

[ ] Thực hiện dataset overview: shape, info, describe, dtypes, isnull().sum()
[ ] Phân loại biến: numerical vs categorical bằng select_dtypes()
[ ] Univariate analysis: histogram, box plot, KDE cho numerical
[ ] Univariate analysis: bar chart, value_counts, pie chart cho categorical
[ ] Tính skew(), kurtosis() và hiểu ý nghĩa
[ ] Bivariate: scatter plot, corr() (Pearson & Spearman)
[ ] Correlation matrix + heatmap với Seaborn
[ ] Box plot comparison: numerical vs categorical
[ ] Crosstab: categorical vs categorical
[ ] Outlier detection bằng IQR method
[ ] Trend & seasonality detection với time series plot
[ ] Missing data pattern analysis (MCAR, MAR, MNAR)
[ ] Viết EDA summary với insight actionable

🗺️ Hành trình học tập

Buổi 3-4: Excel — Pivot Table, biểu đồ cơ bản
    ↓
Buổi 5-6: SQL — Query, JOIN, GROUP BY, Window Functions
    ↓
Buổi 7: Python — Lập trình cơ bản, đọc/ghi file
    ↓
Buổi 8: Pandas & Numpy — Data Cleaning chuyên nghiệp
    ↓
✅ Buổi 9: EDA — Khám phá dữ liệu, phát hiện insight
    ↓
→ Buổi 10: Data Visualization — Matplotlib & Seaborn chuyên sâu

Bạn vừa hoàn thành kỹ năng quan trọng nhất của Data Analyst: biến data thô thành insight có giá trị. EDA không chỉ là chạy describe() rồi vẽ biểu đồ — nó là quá trình đặt câu hỏi, khám phá, phát hiện, kết luận. Từ buổi sau, bạn sẽ nâng cấp khả năng visualization lên chuyên nghiệp với Matplotlib & Seaborn — biến insight thành storytelling bằng hình ảnh.

🔗 Tài liệu tham khảo

Pandas Official Documentation — Visualization — Pandas built-in plotting
Seaborn Official Tutorial — Hướng dẫn Seaborn từ cơ bản đến nâng cao
Matplotlib Official Gallery — Thư viện ví dụ Matplotlib
Kaggle — EDA Notebooks — Hàng ngàn EDA notebooks thực tế từ cộng đồng
Towards Data Science — EDA Guide — Hướng dẫn EDA toàn diện
Python for Data Analysis — Wes McKinney, Chapter 9 — EDA & Visualization từ tác giả Pandas

📘 Buổi 9: EDA — Khám phá dữ liệu và tìm insight ​

🎯 Mục tiêu buổi học ​

📋 Tổng quan ​

📌 Phần 1: EDA Framework — Quy trình khám phá dữ liệu ​

OSEMN Pipeline — Focus vào bước Explore ​

Bước đầu tiên — Dataset Overview ​

Phân loại biến — Bước quan trọng trước khi phân tích ​

📌 Phần 2: Univariate Analysis — Phân tích từng biến ​

Numerical Variables — Histogram, Box Plot, KDE ​

Categorical Variables — Bar Chart, Value Counts ​

📌 Phần 3: Bivariate & Multivariate Analysis — Mối quan hệ giữa các biến ​

Numerical vs Numerical — Scatter Plot & Correlation ​

Correlation Matrix Heatmap ​

Numerical vs Categorical — Box Plot Comparison ​

Categorical vs Categorical — Crosstab & Heatmap ​

📌 Phần 4: Patterns & Anomalies — Phát hiện bất thường ​

Outlier Detection — Visual + Statistical ​

Trend & Seasonality Detection ​

Missing Data Patterns ​

EDA Summary — Viết insight actionable ​

🔑 Từ khóa chính ​

📊 Tổng kết buổi học ​

✅ Checklist — Bạn đã nắm được ​

🗺️ Hành trình học tập ​

🔗 Tài liệu tham khảo ​

📘 Buổi 9: EDA — Khám phá dữ liệu và tìm insight

🎯 Mục tiêu buổi học

📋 Tổng quan

📌 Phần 1: EDA Framework — Quy trình khám phá dữ liệu

OSEMN Pipeline — Focus vào bước Explore

Bước đầu tiên — Dataset Overview

Phân loại biến — Bước quan trọng trước khi phân tích

📌 Phần 2: Univariate Analysis — Phân tích từng biến

Numerical Variables — Histogram, Box Plot, KDE

Categorical Variables — Bar Chart, Value Counts

📌 Phần 3: Bivariate & Multivariate Analysis — Mối quan hệ giữa các biến

Numerical vs Numerical — Scatter Plot & Correlation

Correlation Matrix Heatmap

Numerical vs Categorical — Box Plot Comparison

Categorical vs Categorical — Crosstab & Heatmap

📌 Phần 4: Patterns & Anomalies — Phát hiện bất thường

Outlier Detection — Visual + Statistical

Trend & Seasonality Detection

Missing Data Patterns

EDA Summary — Viết insight actionable

🔑 Từ khóa chính

📊 Tổng kết buổi học

✅ Checklist — Bạn đã nắm được

🗺️ Hành trình học tập

🔗 Tài liệu tham khảo