🏆 Tiêu chuẩn — EDA (Exploratory Data Analysis)

Các tiêu chuẩn và framework giúp bạn thực hiện EDA có hệ thống, tái sản xuất được — thay vì vẽ chart ngẫu nhiên rồi "tìm insight"

Tổng quan tiêu chuẩn buổi 9

Buổi 9 là bước Explore trong hành trình phân tích dữ liệu. Sau khi đã clean data ở Buổi 8, bạn bắt đầu "nói chuyện" với dữ liệu qua histogram, scatter plot, heatmap. Nhưng vẽ chart không có nghĩa là bạn đang làm EDA đúng cách. EDA không phải "vẽ thật nhiều biểu đồ rồi xem cái nào đẹp" — mà là quá trình có phương pháp, tuân theo framework rõ ràng, với mục tiêu khám phá cấu trúc, phát hiện pattern và anomaly trong dữ liệu.

Thiếu framework, DA thường mắc 3 sai lầm kinh điển:

Confirmation bias — chỉ vẽ chart xác nhận giả thuyết đã có sẵn, bỏ qua chart phản bác
Chart overload — tạo 50 biểu đồ mà không có câu hỏi cụ thể, dẫn đến "paralysis by analysis"
Không tái sản xuất được — notebook lung tung, cell chạy sai thứ tự, đồng nghiệp không hiểu

Theo khảo sát Kaggle State of Data Science (2025), 72% data professionals báo cáo rằng EDA là bước chiếm nhiều thời gian nhất trong pipeline phân tích, và 58% thừa nhận EDA notebook của họ khó chia sẻ cho người khác. Nắm vững các tiêu chuẩn dưới đây giúp bạn EDA nhanh hơn, đúng hơn, và tái sản xuất được.

Buổi này tập trung vào 3 tiêu chuẩn cốt lõi cho EDA:

CRISP-DM — Framework quốc tế cho Data Mining, với phase Data Understanding chính là EDA
OSEMN — Pipeline 5 bước, bước Explore là nơi EDA diễn ra
Reproducible Analysis — Nguyên tắc viết EDA notebook tái sản xuất được

📋 Danh sách tiêu chuẩn liên quan

#	Tiêu chuẩn	Tổ chức / Tác giả	Áp dụng cho Buổi 9
1	CRISP-DM	IBM / NCR / SPSS (1996–2000)	Data Understanding phase — khám phá và đánh giá dữ liệu
2	OSEMN	Hilary Mason & Chris Wiggins (2010)	Explore step — univariate, bivariate, multivariate analysis
3	Reproducible Analysis	Cộng đồng Data Science (Jupyter, RStudio, Google)	EDA notebook reusable, shareable, reproducible

1️⃣ CRISP-DM — Data Understanding Phase

Giới thiệu

CRISP-DM (Cross-Industry Standard Process for Data Mining) là framework phổ biến nhất thế giới cho dự án Data Science và Data Analytics, được phát triển bởi IBM, NCR, DaimlerChrysler và SPSS từ năm 1996. Theo khảo sát KDnuggets (2014–2022), CRISP-DM luôn đứng #1 với > 40% data professionals sử dụng.

CRISP-DM gồm 6 phase vòng tròn:

┌─────────────────────────────────────────┐
│  1. Business Understanding              │
│  2. Data Understanding ← ✅ EDA ở đây  │
│  3. Data Preparation                    │
│  4. Modeling                            │
│  5. Evaluation                          │
│  6. Deployment                          │
└─────────────────────────────────────────┘

Phase Data Understanding là nơi EDA diễn ra. Mục tiêu: thu thập dữ liệu ban đầu, mô tả dữ liệu, khám phá dữ liệu, và xác minh chất lượng dữ liệu. Đây chính xác là những gì bạn làm khi gọi df.describe(), vẽ histogram, tính correlation — tất cả thuộc Data Understanding.

Áp dụng trong buổi học

CRISP-DM Data Understanding gồm 4 task, map trực tiếp vào EDA workflow:

CRISP-DM Task	EDA Action	Pandas / Seaborn Code
Collect Initial Data	Load dataset, kiểm tra shape	`pd.read_csv()`, `df.shape`
Describe Data	Summary statistics, dtypes	`df.describe()`, `df.info()`, `df.dtypes`
Explore Data	Univariate: distribution, bivariate: correlation	`sns.histplot()`, `sns.scatterplot()`, `df.corr()`
Verify Data Quality	Missing values, outliers, anomalies	`df.isnull().sum()`, `sns.boxplot()`

Ví dụ — Describe Data

python

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# CRISP-DM Task 2: Describe Data — summary statistics
df = pd.read_csv("hr_employees.csv")

print("📊 Shape:", df.shape)
print("\n📋 Data Types:")
print(df.dtypes)
print("\n📈 Statistical Summary:")
print(df.describe())
print("\n🔍 Categorical Columns:")
for col in df.select_dtypes(include="object").columns:
    print(f"  {col}: {df[col].nunique()} unique — {df[col].value_counts().head(3).to_dict()}")

Ví dụ — Explore Data

python

# CRISP-DM Task 3: Explore Data — univariate distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sns.histplot(df["salary"], kde=True, ax=axes[0], color="steelblue")
axes[0].set_title("Phân phối Salary")

sns.histplot(df["age"], kde=True, ax=axes[1], color="coral")
axes[1].set_title("Phân phối Age")

sns.boxplot(x=df["tenure_years"], ax=axes[2], color="mediumseagreen")
axes[2].set_title("Box Plot Tenure")

plt.tight_layout()
plt.show()

💡 CRISP-DM là vòng tròn, không phải đường thẳng

Trong thực tế, bạn sẽ quay lại Data Understanding nhiều lần: sau khi build model, phát hiện feature lạ → quay lại EDA để kiểm tra. Đừng nghĩ "EDA xong rồi thì bỏ qua" — EDA là bước lặp đi lặp lại trong mọi dự án.

Checklist CRISP-DM Data Understanding

[ ] Collect: Dataset đã load thành công, kiểm tra shape và head()
[ ] Describe: Chạy df.describe() cho numeric, value_counts() cho categorical
[ ] Explore: Vẽ ít nhất 1 histogram, 1 scatter plot, 1 correlation matrix
[ ] Verify: Kiểm tra missing values, outliers bằng box plot, anomalies nổi bật
[ ] Document: Mỗi chart có title, axis label, và interpretation bằng Markdown
[ ] Business link: Kết nối findings với business question ban đầu

2️⃣ OSEMN — Explore Step

Giới thiệu

OSEMN (đọc là "awesome") là framework 5 bước cho Data Science pipeline, được Hilary Mason (cựu Chief Scientist tại Bitly) và Chris Wiggins (Columbia University) đề xuất. OSEMN đơn giản, dễ nhớ, và map trực tiếp vào workflow thực tế của Data Analyst.

5 bước OSEMN:

O → S → E → M → N
│   │   │   │   │
│   │   │   │   └── iNterpret: trình bày kết quả
│   │   │   └────── Model: xây dựng model
│   │   └────────── Explore: ✅ EDA — khám phá dữ liệu
│   └────────────── Scrub: clean data (Buổi 8)
└────────────────── Obtain: lấy dữ liệu (Buổi 7)

Bước Explore là trung tâm của buổi 9 — đây là lúc bạn chuyển từ "data sạch" sang "hiểu data". Explore bao gồm 3 level phân tích:

Univariate — phân tích từng biến riêng lẻ (distribution, central tendency)
Bivariate — phân tích mối quan hệ giữa 2 biến (correlation, comparison)
Multivariate — phân tích nhiều biến đồng thời (heatmap, pair plot)

Áp dụng trong buổi học

Level 1: Univariate Analysis

python

import seaborn as sns
import matplotlib.pyplot as plt

# Histogram — phân phối liên tục
sns.histplot(df["salary"], bins=30, kde=True, color="steelblue")
plt.title("Phân phối lương nhân viên")
plt.xlabel("Salary (VND)")
plt.ylabel("Số nhân viên")
plt.show()

# Bar chart — phân phối categorical
dept_counts = df["department"].value_counts()
sns.barplot(x=dept_counts.index, y=dept_counts.values, palette="viridis")
plt.title("Số nhân viên theo phòng ban")
plt.xticks(rotation=45)
plt.show()

Level 2: Bivariate Analysis

python

# Scatter plot — 2 biến numeric
sns.scatterplot(data=df, x="experience_years", y="salary",
                hue="department", alpha=0.6)
plt.title("Salary vs Experience (theo Department)")
plt.xlabel("Kinh nghiệm (năm)")
plt.ylabel("Lương (VND)")
plt.show()

# Box plot — numeric vs categorical
sns.boxplot(data=df, x="department", y="salary", palette="Set2")
plt.title("Phân phối lương theo phòng ban")
plt.xticks(rotation=45)
plt.show()

Level 3: Multivariate Analysis

python

# Correlation matrix + heatmap
corr_matrix = df[["salary", "age", "experience_years",
                   "tenure_years", "performance_score"]].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0,
            fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix — Các biến số")
plt.tight_layout()
plt.show()

Ví dụ thực tế

OSEMN Explore trong bối cảnh HR analytics:

Câu hỏi nghiệp vụ	Level	Chart type	Insight kỳ vọng
"Lương nhân viên phân bố như nào?"	Univariate	Histogram + KDE	Skewed right, median thấp hơn mean → nhiều NV lương thấp
"Phòng ban nào lương cao nhất?"	Bivariate	Box plot	Engineering > Marketing > Sales (theo median)
"Kinh nghiệm có ảnh hưởng lương không?"	Bivariate	Scatter plot	Positive correlation ~0.65
"Biến nào liên quan mạnh nhất?"	Multivariate	Heatmap	experience_years ↔ salary: r = 0.72

⚠️ Explore ≠ Interpret

Bước Explore (EDA) là khám phá — tìm pattern, anomaly, relationship. Bước iNterpret (visualization) là trình bày — tạo chart đẹp để communicate với stakeholders. Đừng trộn lẫn: EDA chart có thể "xấu" nhưng informative, còn presentation chart phải "đẹp" và rõ ràng.

Checklist OSEMN Explore

[ ] Univariate — vẽ histogram/KDE cho tất cả biến numeric quan trọng
[ ] Univariate — vẽ bar chart/count plot cho tất cả biến categorical
[ ] Bivariate — scatter plot cho các cặp biến numeric có khả năng liên quan
[ ] Bivariate — box plot/violin plot so sánh numeric theo category
[ ] Multivariate — correlation matrix heatmap cho tất cả biến numeric
[ ] Summary — viết ít nhất 3 key findings dạng text (không chỉ chart)
[ ] Questions — ghi ra câu hỏi mới phát sinh từ EDA (để explore tiếp)

3️⃣ Reproducible Analysis — EDA Notebook Reusable

Giới thiệu

Reproducible Analysis (Phân tích tái sản xuất được) là nguyên tắc đảm bảo rằng bất kỳ ai — đồng nghiệp, manager, hoặc chính bạn 3 tháng sau — có thể chạy lại notebook và ra đúng kết quả. Đây không phải tiêu chuẩn từ một tổ chức cụ thể, mà là best practice được cộng đồng Data Science (Jupyter, Google, Netflix, Airbnb) thống nhất qua nhiều năm.

Tại sao Reproducible Analysis đặc biệt quan trọng cho EDA?

EDA thường là bước thử nghiệm nhiều nhất — bạn vẽ hàng chục chart, thay đổi parameters, thử nhiều góc nhìn
Kết quả: notebook lung tung, cell chạy sai thứ tự, biến bị overwrite, chart không rõ context
2 tuần sau quay lại: "chart này tôi vẽ từ data nào? Biến df2 là gì? Cell 15 phải chạy trước cell 8 sao?"

"Your most important collaborator is future-you — and future-you doesn't remember any of this." — Hadley Wickham

Áp dụng trong buổi học

Cấu trúc EDA Notebook chuẩn

python

# ✅ Cell 1: Metadata (Markdown)
"""
# EDA Report: HR Employee Dataset
- Author: Nguyễn Văn A
- Date: 2026-02-18
- Data source: HR database export (Jan 2026)
- Objective: Khám phá patterns trong attrition, salary, performance
"""

python

# ✅ Cell 2: Imports — tất cả thư viện ở 1 cell duy nhất
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Config
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["font.size"] = 12
sns.set_style("whitegrid")
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:,.2f}".format)

print("✅ Libraries loaded — Pandas", pd.__version__)

python

# ✅ Cell 3: Load data — source rõ ràng
DATA_PATH = "data/hr_employees.csv"
df_raw = pd.read_csv(DATA_PATH)
print(f"✅ Loaded {DATA_PATH}: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} cols")

Naming Convention cho EDA

python

# ✅ Biến rõ nghĩa — dễ trace origin
df_raw = pd.read_csv("hr_employees.csv")       # Data gốc — KHÔNG modify
df_clean = df_raw.dropna(subset=["salary"])     # Data đã clean
df_active = df_clean[df_clean["status"] == "Active"]  # Subset: chỉ NV active

# ❌ Sai — df, df2, temp, data, x
df = pd.read_csv("hr_employees.csv")
df2 = df.dropna()
temp = df2[df2["status"] == "Active"]

Chart có Context

python

# ✅ Chart có đầy đủ context: title, labels, interpretation
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(df_clean["salary"], bins=30, kde=True, color="steelblue", ax=ax)
ax.set_title("Phân phối lương nhân viên — Right-skewed, median < mean", fontsize=14)
ax.set_xlabel("Salary (VND)")
ax.set_ylabel("Số nhân viên")
ax.axvline(df_clean["salary"].median(), color="red", linestyle="--",
           label=f"Median: {df_clean['salary'].median():,.0f}")
ax.axvline(df_clean["salary"].mean(), color="orange", linestyle="--",
           label=f"Mean: {df_clean['salary'].mean():,.0f}")
ax.legend()
plt.tight_layout()
plt.show()

# → Markdown cell sau chart: "Lương phân bố lệch phải (right-skewed).
#    Median = 12M < Mean = 15M → nhóm lương cao kéo mean lên.
#    Action: dùng median khi report 'lương trung bình'."

💡 Quy tắc vàng: Restart & Run All

Trước khi chia sẻ EDA notebook, luôn Restart Kernel → Run All Cells. Nếu notebook fail ở bất kỳ cell nào → chưa reproducible. Cell phải chạy tuần tự từ trên xuống mà không cần chạy tay cell nào.

Ví dụ thực tế

Anti-pattern	Hậu quả	Best Practice
Cell chạy sai thứ tự	Biến bị overwrite, kết quả sai	Restart & Run All trước khi nộp
Hardcode file path: `"C:/Users/me/data.csv"`	Đồng nghiệp không chạy được	Dùng relative path: `"data/hr.csv"`
Chart không có title/label	Quên chart hiển thị gì	Luôn có `set_title()`, `set_xlabel()`
`df` overwrite 10 lần	Không trace được data origin	`df_raw`, `df_clean`, `df_subset`
Magic numbers: `df[df["salary"] > 50000000]`	Tại sao 50M? Không có giải thích	`SALARY_THRESHOLD = 50_000_000` + comment
Không có Markdown giữa code cells	Notebook đọc như mã nguồn	Markdown header + interpretation sau mỗi chart

Checklist Reproducible Analysis

[ ] Metadata cell ở đầu notebook — title, author, date, data source, objective
[ ] Import cell duy nhất — tất cả import ở cell 2, không rải rác
[ ] Restart & Run All thành công 100% — không lỗi, không cần chạy tay
[ ] Relative paths — không hardcode tuyệt đối (C:/Users/...)
[ ] Naming convention — df_raw, df_clean, df_subset — không df, df2
[ ] Constants — không magic numbers, dùng UPPER_CASE variables
[ ] Chart context — mỗi chart có title, axis labels, legend, và Markdown interpretation
[ ] Markdown headers — section rõ ràng: ## 1. Data Overview, ## 2. Univariate, ...
[ ] No debug cells — xóa cell test / cell trống trước khi nộp

📊 Bảng tổng hợp

Tiêu chuẩn	Phạm vi	Áp dụng Buổi 9	Mức độ bắt buộc
CRISP-DM	Quy trình — methodology cho Data Mining	Data Understanding: describe + explore + verify data	⭐⭐⭐ Nên tuân thủ — framework #1 toàn cầu
OSEMN	Pipeline — 5 bước phân tích dữ liệu	Explore: univariate → bivariate → multivariate	⭐⭐⭐ Gần như bắt buộc — đảm bảo EDA có hệ thống
Reproducible Analysis	Notebook — best practices cho tái sản xuất	EDA notebook clean, shareable, Restart & Run All OK	⭐⭐ Nên tuân thủ — đặc biệt khi làm việc nhóm

⚠️ Quan trọng

Các tiêu chuẩn này không phải quy trình cứng nhắc — chúng là framework giúp bạn EDA có hệ thống thay vì "vẽ chart ngẫu nhiên rồi hy vọng tìm insight". Trong thực tế, DA giỏi nhất không phải người vẽ chart đẹp nhất — mà là người đặt đúng câu hỏi, chọn đúng chart type, và viết insight rõ ràng, actionable.

📚 Tài liệu tham khảo

Tài liệu	Link	Ghi chú
CRISP-DM Guide (IBM)	ibm.com/docs/en/spss-modeler/saas?topic=dm-crisp-help-overview	Tài liệu chính thức CRISP-DM từ IBM
OSEMN Framework — Hilary Mason	datasciencemasters.org	Framework gốc từ Mason & Wiggins
Exploratory Data Analysis — John Tukey (1977)	Wikipedia	Paper gốc khai sinh EDA
Pandas EDA Guide	pandas.pydata.org/docs/user_guide	Hướng dẫn chính thức EDA với Pandas
Seaborn Tutorial	seaborn.pydata.org/tutorial.html	Hướng dẫn visualization với Seaborn
Reproducible Research Best Practices	turing.ac.uk/reproducible-research	The Turing Way — gold standard cho reproducibility
Ten Simple Rules for Reproducible Research	PLOS Computational Biology	10 quy tắc dễ nhớ cho reproducible analysis

🏆 Tiêu chuẩn — EDA (Exploratory Data Analysis) ​

Tổng quan tiêu chuẩn buổi 9 ​

📋 Danh sách tiêu chuẩn liên quan ​

1️⃣ CRISP-DM — Data Understanding Phase ​

Giới thiệu ​

Áp dụng trong buổi học ​

Ví dụ — Describe Data ​

Ví dụ — Explore Data ​

Checklist CRISP-DM Data Understanding ​

2️⃣ OSEMN — Explore Step ​

Giới thiệu ​

Áp dụng trong buổi học ​

Level 1: Univariate Analysis ​

Level 2: Bivariate Analysis ​

Level 3: Multivariate Analysis ​

Ví dụ thực tế ​

Checklist OSEMN Explore ​

3️⃣ Reproducible Analysis — EDA Notebook Reusable ​

Giới thiệu ​

Áp dụng trong buổi học ​

Cấu trúc EDA Notebook chuẩn ​

Naming Convention cho EDA ​

Chart có Context ​

Ví dụ thực tế ​

Checklist Reproducible Analysis ​

📊 Bảng tổng hợp ​

📚 Tài liệu tham khảo ​

🏆 Tiêu chuẩn — EDA (Exploratory Data Analysis)

Tổng quan tiêu chuẩn buổi 9

📋 Danh sách tiêu chuẩn liên quan

1️⃣ CRISP-DM — Data Understanding Phase

Giới thiệu

Áp dụng trong buổi học

Ví dụ — Describe Data

Ví dụ — Explore Data

Checklist CRISP-DM Data Understanding

2️⃣ OSEMN — Explore Step

Giới thiệu

Áp dụng trong buổi học

Level 1: Univariate Analysis

Level 2: Bivariate Analysis

Level 3: Multivariate Analysis

Ví dụ thực tế

Checklist OSEMN Explore

3️⃣ Reproducible Analysis — EDA Notebook Reusable

Giới thiệu

Áp dụng trong buổi học

Cấu trúc EDA Notebook chuẩn

Naming Convention cho EDA

Chart có Context

Ví dụ thực tế

Checklist Reproducible Analysis

📊 Bảng tổng hợp

📚 Tài liệu tham khảo