Skip to content

🛠 Workshop — Tạo Bộ Chart Chuyên Nghiệp

Từ EDA notebook Buổi 9 → tạo 5 publication-quality charts (bar, line, scatter, heatmap, box) → customize colors, annotations, labels → ghép multi-panel figure → export PNG + SVG. Tất cả trong Jupyter Notebook!

🎯 Mục tiêu workshop

Sau khi hoàn thành workshop này, bạn sẽ:

  1. Tạo 5 chart khác loại — bar, line, scatter, heatmap, box plot — với data từ EDA Buổi 9
  2. Customize chuyên nghiệp — professional color palette, annotations, labels, Tufte-style declutter
  3. Tạo multi-panel figure — 2×2 hoặc 2×3 dashboard tổng hợp trên 1 trang
  4. Export chất lượng cao — PNG (300 dpi) + SVG (vector) cho report & presentation
  5. Áp dụng IBCS & Accessibility — consistent notation, colorblind-safe palette

🧰 Yêu cầu

Yêu cầuChi tiết
Kiến thứcĐã hoàn thành Buổi 9 (EDA) + Buổi 10 lý thuyết (Matplotlib & Seaborn)
Công cụJupyter Notebook (local) HOẶC Google Colab (online)
PythonPython 3.8+
Thư việnpandas, numpy, matplotlib, seaborn (Colab đã có sẵn)
InputDataset HR Employee từ Workshop Buổi 9 (hoặc tạo lại bên dưới)
Thời gian75–100 phút

💡 Naming convention

Đặt tên notebook: HoTen_Buoi10_Visualization.ipynb Chia notebook thành Markdown sections rõ ràng — tuân thủ Reproducible Analysis từ Buổi 9!


📦 Dataset: HR Employee Analytics (từ Buổi 9)

Sử dụng dataset Buổi 9

Nếu bạn đã hoàn thành Workshop Buổi 9, sử dụng lại dataset đó. Nếu chưa, copy đoạn code sau vào Cell 1 để tạo dataset mới:

python
# Cell 1: Setup — Libraries + Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import warnings
warnings.filterwarnings("ignore")

# === CONFIG VISUALIZATION ===
plt.rcParams["figure.figsize"] = (12, 7)
plt.rcParams["font.size"] = 12
plt.rcParams["axes.titlesize"] = 14
plt.rcParams["axes.labelsize"] = 12
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")
sns.set_palette("colorblind")  # Accessibility: colorblind-safe
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:,.2f}".format)

# === PROFESSIONAL COLOR PALETTE ===
# Colorblind-safe palette (Wong 2011 / IBM Design)
COLORS = {
    "primary": "#648FFF",     # Blue
    "secondary": "#DC267F",   # Magenta
    "accent1": "#FE6100",     # Orange
    "accent2": "#FFB000",     # Gold
    "accent3": "#785EF0",     # Violet
    "positive": "#2E7D32",    # Green (IBCS: positive variance)
    "negative": "#C62828",    # Red (IBCS: negative variance)
    "neutral": "#333333",     # Dark gray
    "light": "#B0B0B0",       # Light gray
}
PALETTE_5 = [COLORS["primary"], COLORS["secondary"],
             COLORS["accent1"], COLORS["accent2"], COLORS["accent3"]]

# === TẠO DATASET HR EMPLOYEES (1,500 nhân viên) ===
np.random.seed(42)
n = 1500
departments = ["Engineering", "Marketing", "Sales", "HR", "Finance"]
dept_weights = [0.35, 0.20, 0.25, 0.10, 0.10]
positions = ["Junior", "Mid", "Senior", "Lead", "Manager"]

data = []
for i in range(n):
    dept = np.random.choice(departments, p=dept_weights)
    exp = np.random.randint(0, 26)
    age = max(22, min(58, exp + np.random.randint(22, 28)))

    if exp <= 2: pos = "Junior"
    elif exp <= 5: pos = "Mid"
    elif exp <= 10: pos = "Senior"
    elif exp <= 15: pos = "Lead"
    else: pos = "Manager"

    base_salary = {
        "Junior": 10, "Mid": 16, "Senior": 25, "Lead": 35, "Manager": 45
    }[pos]

    dept_bonus = {
        "Engineering": 1.15, "Finance": 1.10, "Marketing": 1.0,
        "Sales": 0.95, "HR": 0.90
    }[dept]

    salary = base_salary * dept_bonus * (1 + np.random.normal(0, 0.15))
    salary = max(8, round(salary, 1))

    perf = round(np.clip(np.random.normal(3.5, 0.8), 1, 5), 1)
    satisfaction = round(np.clip(5.5 - perf * 0.5 + np.random.normal(0, 0.6), 1, 5), 1)
    tenure = round(min(exp, max(0.5, np.random.exponential(4))), 1)
    training = int(np.clip(np.random.normal(40, 20), 0, 120))

    attrition_prob = 0.15
    if satisfaction < 2.5: attrition_prob += 0.25
    if perf > 4.0 and satisfaction < 3.0: attrition_prob += 0.15
    if exp > 5 and salary < 15: attrition_prob += 0.20
    attrition = "Yes" if np.random.random() < attrition_prob else "No"

    data.append({
        "employee_id": f"E{i+1:04d}",
        "department": dept,
        "position": pos,
        "age": age,
        "gender": np.random.choice(["Male", "Female"], p=[0.6, 0.4]),
        "salary": salary,
        "experience_years": exp,
        "tenure_years": tenure,
        "performance_score": perf,
        "satisfaction_score": satisfaction,
        "training_hours": training,
        "attrition": attrition,
    })

df = pd.DataFrame(data)

# Monthly revenue data (for line chart)
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
monthly_revenue = [25, 28, 32, 30, 35, 38, 36, 40, 42, 46, 38, 58]

print(f"✅ Dataset loaded: {df.shape[0]:,} employees × {df.shape[1]} columns")
print(f"✅ Color palette: {len(PALETTE_5)} colorblind-safe colors")
print(f"✅ Monthly revenue: {len(months)} months")
print(f"\n📋 Departments: {df['department'].value_counts().to_dict()}")
print(f"📋 Attrition: {df['attrition'].value_counts().to_dict()}")

Phần 1: Setup & Data Overview

Verify dataset, define helper functions, set professional defaults

Bước 1.1: Helper Functions

python
# Cell 2: Helper functions cho professional charts
def format_vnd(x, _):
    """Format number as VND (triệu)"""
    return f"{x:.0f}M"

def format_billion(x, _):
    """Format number as Billion VND"""
    return f"{x:.0f}B"

def save_chart(fig, filename, formats=("png", "svg")):
    """Save chart in multiple formats — high quality"""
    for fmt in formats:
        filepath = f"charts/{filename}.{fmt}"
        fig.savefig(filepath, dpi=300, bbox_inches="tight",
                    facecolor="white", edgecolor="none")
        print(f"  💾 Saved: {filepath}")

def add_data_labels(ax, bars, fmt="{:.0f}", offset=0.5, fontsize=10):
    """Add data labels on top of bar chart"""
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + offset,
                fmt.format(height), ha="center", va="bottom",
                fontsize=fontsize, fontweight="bold")

def tufte_style(ax, keep_left=True):
    """Apply Tufte-style declutter to axes"""
    ax.spines[["top", "right"]].set_visible(False)
    if not keep_left:
        ax.spines["left"].set_visible(False)
        ax.tick_params(left=False)

print("✅ Helper functions defined: format_vnd, format_billion, save_chart, add_data_labels, tufte_style")

Bước 1.2: Tạo thư mục output

python
# Cell 3: Tạo thư mục charts/
import os
os.makedirs("charts", exist_ok=True)
print("✅ Created directory: charts/")

Phần 2: Chart 1 — Bar Chart (Comparison)

Mục đích: So sánh headcount và average salary giữa 5 phòng ban. Áp dụng IBCS notation + Tufte declutter.

Bước 2.1: Bar Chart — Headcount by Department

python
# Cell 4: Chart 1 — Bar Chart: Headcount + Avg Salary by Department
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# --- Panel A: Headcount by Department ---
dept_counts = df["department"].value_counts().sort_values(ascending=True)
bars1 = ax1.barh(dept_counts.index, dept_counts.values,
                 color=PALETTE_5[:len(dept_counts)], edgecolor="white", linewidth=0.5)

# Data labels
for bar, val in zip(bars1, dept_counts.values):
    ax1.text(val + 5, bar.get_y() + bar.get_height()/2,
             f"{val} ({val/len(df)*100:.0f}%)",
             va="center", fontsize=11, fontweight="bold")

ax1.set_title("Headcount by Department", fontsize=14, fontweight="bold", loc="left")
ax1.set_xlabel("Number of Employees")
tufte_style(ax1)
ax1.set_xlim(0, max(dept_counts.values) * 1.25)

# --- Panel B: Average Salary by Department ---
dept_salary = df.groupby("department")["salary"].agg(["mean", "median"]).sort_values("mean")

x = np.arange(len(dept_salary))
width = 0.35

# IBCS-style: Mean = solid, Median = outline
bars_mean = ax2.barh(x + width/2, dept_salary["mean"], width,
                     color=COLORS["neutral"], label="Mean")
bars_median = ax2.barh(x - width/2, dept_salary["median"], width,
                       facecolor="white", edgecolor=COLORS["neutral"],
                       linewidth=1.5, label="Median")

ax2.set_yticks(x)
ax2.set_yticklabels(dept_salary.index)
ax2.set_title("Avg Salary by Department — Mean vs Median",
              fontsize=14, fontweight="bold", loc="left")
ax2.set_xlabel("Salary (Million VND)")
ax2.xaxis.set_major_formatter(FuncFormatter(format_vnd))
ax2.legend(frameon=False, loc="lower right")
tufte_style(ax2)

plt.suptitle("📊 Chart 1: Department Overview", fontsize=16, y=1.02, fontweight="bold")
plt.tight_layout()
save_chart(fig, "01_bar_department")
plt.show()

Markdown interpretation (Cell 5):

python
# Cell 5: Markdown — Interpretation Chart 1

Thêm Markdown cell sau Chart 1:

Interpretation Chart 1:

  • Engineering chiếm headcount lớn nhất (~35%), phù hợp với công ty tech
  • Engineering cũng có avg salary cao nhất — mean > median → right-skewed (có senior/lead lương cao kéo mean)
  • HR có mean ≈ median → phân phối symmetric hơn
  • Gap mean-median lớn nhất ở Engineering & Sales → bất bình đẳng lương cần investigate

Phần 3: Chart 2 — Line Chart (Trend)

Mục đích: Thể hiện monthly revenue trend, phát hiện anomaly tháng 11, annotation highlight.

Bước 3.1: Line Chart — Monthly Revenue Trend

python
# Cell 6: Chart 2 — Line Chart: Monthly Revenue Trend
fig, ax = plt.subplots(figsize=(14, 6))

# Main line
ax.plot(months, monthly_revenue, color=COLORS["primary"],
        linewidth=2.5, marker="o", markersize=8, markerfacecolor="white",
        markeredgecolor=COLORS["primary"], markeredgewidth=2, zorder=3)

# Fill area under line
ax.fill_between(months, monthly_revenue, alpha=0.1, color=COLORS["primary"])

# Highlight anomaly — tháng 11
ax.plot("Nov", 38, "o", markersize=14, markerfacecolor=COLORS["negative"],
        markeredgecolor="white", markeredgewidth=2, zorder=4)
ax.annotate("⚠️ Anomaly: -18%\nvs Oct (46B → 38B)",
            xy=(10, 38), xytext=(7.5, 30),
            fontsize=10, fontweight="bold", color=COLORS["negative"],
            arrowprops=dict(arrowstyle="->", color=COLORS["negative"], lw=1.5),
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#FFEBEE", edgecolor=COLORS["negative"]))

# Highlight peak — tháng 12
ax.plot("Dec", 58, "o", markersize=14, markerfacecolor=COLORS["positive"],
        markeredgecolor="white", markeredgewidth=2, zorder=4)
ax.annotate("📈 Peak: 58B\n+53% recovery",
            xy=(11, 58), xytext=(9, 62),
            fontsize=10, fontweight="bold", color=COLORS["positive"],
            arrowprops=dict(arrowstyle="->", color=COLORS["positive"], lw=1.5),
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#E8F5E9", edgecolor=COLORS["positive"]))

# Data labels for all points
for i, (m, r) in enumerate(zip(months, monthly_revenue)):
    if m not in ["Nov", "Dec"]:  # Don't double-label annotated points
        ax.text(i, r + 1.5, f"{r}B", ha="center", fontsize=9, color=COLORS["neutral"])

# Trend line (linear regression)
x_numeric = np.arange(len(months))
z = np.polyfit(x_numeric, monthly_revenue, 1)
p = np.poly1d(z)
ax.plot(months, p(x_numeric), "--", color=COLORS["light"],
        linewidth=1.5, label=f"Trend: +{z[0]:.1f}B/month")

# Styling
ax.set_title("Monthly Revenue Trend — FY2025", fontsize=14, fontweight="bold", loc="left")
ax.set_xlabel("Month")
ax.set_ylabel("Revenue (Billion VND)")
ax.yaxis.set_major_formatter(FuncFormatter(format_billion))
ax.legend(frameon=False, loc="upper left")
tufte_style(ax)
ax.set_ylim(15, 70)

# Source footnote
ax.text(0, -0.12, "Source: Finance Dept | Note: Nov dip under investigation (seasonal or one-time event)",
        transform=ax.transAxes, fontsize=8, color="gray", style="italic")

plt.tight_layout()
save_chart(fig, "02_line_revenue_trend")
plt.show()

Markdown interpretation (Cell 7):

Interpretation Chart 2:

  • Doanh thu upward trend ổn định ~+2.5B/tháng từ Jan đến Oct
  • Anomaly tháng 11: giảm 18% (46B → 38B) — phá vỡ trend. Cần investigate: seasonal dip hay one-time event?
  • Tháng 12 recover mạnh +53% → anomaly T11 có thể là temporary
  • YoY growth mạnh: Jan 25B → Dec 58B (+132%)

Phần 4: Chart 3 — Scatter Plot (Relationship)

Mục đích: Thể hiện mối quan hệ salary vs experience, color by department, highlight underpaid group.

Bước 4.1: Scatter Plot — Salary vs Experience

python
# Cell 8: Chart 3 — Scatter Plot: Salary vs Experience
fig, ax = plt.subplots(figsize=(14, 8))

# Scatter plot — color by department
for i, dept in enumerate(departments):
    subset = df[df["department"] == dept]
    ax.scatter(subset["experience_years"], subset["salary"],
               c=PALETTE_5[i], label=dept, alpha=0.5, s=30,
               edgecolor="white", linewidth=0.3)

# Trend line (all data)
x_all = df["experience_years"].values
y_all = df["salary"].values
z = np.polyfit(x_all, y_all, 1)
p = np.poly1d(z)
x_line = np.linspace(0, 25, 100)
ax.plot(x_line, p(x_line), "--", color=COLORS["negative"],
        linewidth=2, label=f"Trend: salary ≈ {z[1]:.1f} + {z[0]:.1f} × exp")

# Correlation annotation
corr = df["experience_years"].corr(df["salary"])
ax.text(0.02, 0.95, f"r = {corr:.2f} (strong positive)",
        transform=ax.transAxes, fontsize=11, fontweight="bold",
        bbox=dict(boxstyle="round", facecolor="lightyellow", edgecolor="orange"))

# Highlight underpaid group — experience > 8 but salary < 15
underpaid = df[(df["experience_years"] >= 8) & (df["salary"] < 15)]
if len(underpaid) > 0:
    ax.scatter(underpaid["experience_years"], underpaid["salary"],
               facecolor="none", edgecolor=COLORS["negative"],
               s=100, linewidth=2, zorder=5, label=f"Underpaid ({len(underpaid)} NV)")
    ax.annotate(f"⚠️ {len(underpaid)} underpaid employees\nExp ≥ 8yr but salary < 15M",
                xy=(10, 12), xytext=(16, 8),
                fontsize=10, fontweight="bold", color=COLORS["negative"],
                arrowprops=dict(arrowstyle="->", color=COLORS["negative"], lw=1.5),
                bbox=dict(boxstyle="round,pad=0.3", facecolor="#FFEBEE",
                          edgecolor=COLORS["negative"]))

# Styling
ax.set_title("Salary vs Experience — by Department",
             fontsize=14, fontweight="bold", loc="left")
ax.set_xlabel("Experience (years)")
ax.set_ylabel("Salary (Million VND)")
ax.yaxis.set_major_formatter(FuncFormatter(format_vnd))
ax.legend(frameon=True, facecolor="white", edgecolor="lightgray",
          loc="upper left", fontsize=9)
tufte_style(ax)

plt.tight_layout()
save_chart(fig, "03_scatter_salary_experience")
plt.show()

Markdown interpretation (Cell 9):

Interpretation Chart 3:

  • Strong positive correlation (r ≈ 0.72) — experience tăng → salary tăng
  • Engineering (blue) cluster ở trên trend line → lương cao hơn trung bình
  • Underpaid group detected: nhân viên experience ≥ 8 năm nhưng salary < 15M — nằm xa dưới trend line → HR cần review
  • C-level outliers (salary > 50M) là hợp lý theo position

Phần 5: Chart 4 — Heatmap (Correlation)

Mục đích: Correlation matrix của tất cả biến numeric, masked triangle, annoted values.

Bước 5.1: Heatmap — Correlation Matrix

python
# Cell 10: Chart 4 — Heatmap: Correlation Matrix
fig, ax = plt.subplots(figsize=(10, 8))

# Select numeric columns
numeric_cols = ["salary", "age", "experience_years", "tenure_years",
                "performance_score", "satisfaction_score", "training_hours"]
corr_matrix = df[numeric_cols].corr()

# Mask upper triangle (Tufte: remove redundant data-ink)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Heatmap
hm = sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f",
                  cmap="RdBu_r", center=0, vmin=-1, vmax=1,
                  linewidths=0.5, linecolor="white",
                  square=True, ax=ax,
                  cbar_kws={"shrink": 0.8, "label": "Correlation Coefficient"},
                  annot_kws={"fontsize": 11, "fontweight": "bold"})

# Highlight strong correlations
for i in range(len(corr_matrix)):
    for j in range(i):
        val = corr_matrix.iloc[i, j]
        if abs(val) > 0.7:
            ax.add_patch(plt.Rectangle((j, i), 1, 1, fill=False,
                                        edgecolor="gold", linewidth=3))

ax.set_title("Correlation Matrix — HR Employee Variables\n"
             "🔲 Gold border = |r| > 0.7 (strong correlation)",
             fontsize=14, fontweight="bold", loc="left")

# Rename labels for readability
short_labels = ["Salary", "Age", "Experience", "Tenure",
                "Performance", "Satisfaction", "Training"]
ax.set_xticklabels(short_labels, rotation=45, ha="right")
ax.set_yticklabels(short_labels, rotation=0)

plt.tight_layout()
save_chart(fig, "04_heatmap_correlation")
plt.show()

Markdown interpretation (Cell 11):

Interpretation Chart 4:

  • Strong positive: experience ↔ salary (r ≈ 0.72), age ↔ experience (r ≈ 0.89) — highlighted gold
  • Multicollinearity warning: age ↔ experience = 0.89 → nếu build model, chọn 1 trong 2
  • Negative correlation: performance ↔ satisfaction (r ≈ -0.42) → high performers có thể đang burnout
  • Weak: training_hours hầu như không correlate với salary hay performance → training program cần review effectiveness

Phần 6: Chart 5 — Box Plot (Distribution Comparison)

Mục đích: So sánh distribution salary giữa departments, phát hiện spread, outliers, pay equity.

Bước 6.1: Box Plot — Salary by Department

python
# Cell 12: Chart 5 — Box Plot: Salary Distribution by Department + Violin
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# --- Panel A: Box Plot ---
dept_order = df.groupby("department")["salary"].median().sort_values().index.tolist()

bp = sns.boxplot(data=df, x="department", y="salary", order=dept_order,
                 palette=PALETTE_5, ax=ax1, width=0.6, linewidth=1.2,
                 flierprops=dict(marker="o", markerfacecolor=COLORS["negative"],
                                 markeredgecolor="white", markersize=6))

# Add median labels
medians = df.groupby("department")["salary"].median().reindex(dept_order)
for i, median in enumerate(medians):
    ax1.text(i, median + 0.5, f"{median:.1f}M",
             ha="center", fontsize=10, fontweight="bold", color="white",
             bbox=dict(boxstyle="round,pad=0.2", facecolor=COLORS["neutral"]))

ax1.set_title("Salary Distribution by Department — Box Plot",
              fontsize=13, fontweight="bold", loc="left")
ax1.set_xlabel("Department")
ax1.set_ylabel("Salary (Million VND)")
ax1.yaxis.set_major_formatter(FuncFormatter(format_vnd))
tufte_style(ax1)
ax1.tick_params(axis="x", rotation=30)

# --- Panel B: Violin Plot (more detail on distribution shape) ---
vp = sns.violinplot(data=df, x="department", y="salary", order=dept_order,
                    palette=PALETTE_5, ax=ax2, inner="quartile",
                    linewidth=1.2, cut=0)

ax2.set_title("Salary Distribution by Department — Violin Plot",
              fontsize=13, fontweight="bold", loc="left")
ax2.set_xlabel("Department")
ax2.set_ylabel("Salary (Million VND)")
ax2.yaxis.set_major_formatter(FuncFormatter(format_vnd))
tufte_style(ax2)
ax2.tick_params(axis="x", rotation=30)

plt.suptitle("📦 Chart 5: Pay Equity Analysis", fontsize=16, y=1.02, fontweight="bold")
plt.tight_layout()
save_chart(fig, "05_boxplot_salary_department")
plt.show()

Markdown interpretation (Cell 13):

Interpretation Chart 5:

  • Engineering có median salary cao nhất và IQR rộng nhất → phân tán lớn (Junior vs Lead/Manager)
  • HR có IQR nhỏ nhất → lương tập trung, ít variance
  • Sales có nhiều outliers phía trên → có thể là commission-based high earners → cần audit pay structure
  • Violin plot cho thấy Engineering có distribution bimodal (2 đỉnh) → 2 nhóm lương rõ rệt (Junior cluster vs Senior cluster)

Phần 7: Multi-Panel Figure — Dashboard Tổng Hợp

Ghép 4 chart quan trọng nhất vào 1 figure — layout 2×2 — professional dashboard cho CEO.

Bước 7.1: Multi-Panel Dashboard

python
# Cell 14: Multi-Panel Dashboard — 2×2
fig, axes = plt.subplots(2, 2, figsize=(18, 14))

# ===== Panel 1 (Top-Left): Bar Chart — Headcount =====
dept_counts = df["department"].value_counts().sort_values(ascending=True)
bars = axes[0, 0].barh(dept_counts.index, dept_counts.values,
                        color=PALETTE_5[:len(dept_counts)], edgecolor="white")
for bar, val in zip(bars, dept_counts.values):
    axes[0, 0].text(val + 5, bar.get_y() + bar.get_height()/2,
                     f"{val}", va="center", fontsize=10, fontweight="bold")
axes[0, 0].set_title("A. Headcount by Department", fontsize=13, fontweight="bold", loc="left")
axes[0, 0].set_xlabel("Employees")
tufte_style(axes[0, 0])

# ===== Panel 2 (Top-Right): Line Chart — Revenue Trend =====
axes[0, 1].plot(months, monthly_revenue, color=COLORS["primary"],
                linewidth=2.5, marker="o", markersize=6,
                markerfacecolor="white", markeredgecolor=COLORS["primary"])
axes[0, 1].fill_between(months, monthly_revenue, alpha=0.1, color=COLORS["primary"])
# Anomaly highlight
axes[0, 1].plot("Nov", 38, "o", markersize=12,
                markerfacecolor=COLORS["negative"], markeredgecolor="white", zorder=4)
axes[0, 1].set_title("B. Monthly Revenue Trend — FY2025",
                      fontsize=13, fontweight="bold", loc="left")
axes[0, 1].set_ylabel("Revenue (B VND)")
axes[0, 1].tick_params(axis="x", rotation=45)
tufte_style(axes[0, 1])

# ===== Panel 3 (Bottom-Left): Scatter — Salary vs Experience =====
scatter = axes[1, 0].scatter(df["experience_years"], df["salary"],
                              c=df["department"].map({d: i for i, d in enumerate(departments)}),
                              cmap="Set2", alpha=0.4, s=15, edgecolor="white", linewidth=0.2)
# Trend line
z = np.polyfit(df["experience_years"], df["salary"], 1)
p = np.poly1d(z)
x_line = np.linspace(0, 25, 100)
axes[1, 0].plot(x_line, p(x_line), "--", color=COLORS["negative"], linewidth=1.5)
corr = df["experience_years"].corr(df["salary"])
axes[1, 0].text(0.02, 0.92, f"r = {corr:.2f}", transform=axes[1, 0].transAxes,
                fontsize=10, fontweight="bold",
                bbox=dict(boxstyle="round", facecolor="lightyellow"))
axes[1, 0].set_title("C. Salary vs Experience", fontsize=13, fontweight="bold", loc="left")
axes[1, 0].set_xlabel("Experience (years)")
axes[1, 0].set_ylabel("Salary (M VND)")
tufte_style(axes[1, 0])

# ===== Panel 4 (Bottom-Right): Box Plot — Salary by Dept =====
dept_order = df.groupby("department")["salary"].median().sort_values().index.tolist()
sns.boxplot(data=df, x="department", y="salary", order=dept_order,
            palette=PALETTE_5, ax=axes[1, 1], width=0.6, linewidth=1)
axes[1, 1].set_title("D. Salary Distribution by Department",
                      fontsize=13, fontweight="bold", loc="left")
axes[1, 1].set_xlabel("Department")
axes[1, 1].set_ylabel("Salary (M VND)")
axes[1, 1].tick_params(axis="x", rotation=30)
tufte_style(axes[1, 1])

# ===== Global Title & Layout =====
fig.suptitle("🏢 HR Analytics Dashboard — Q4/2025\n"
             "TechVN | 1,500 Employees | Prepared by: Data Analytics Team",
             fontsize=18, fontweight="bold", y=1.02)

# Footnote
fig.text(0.5, -0.02,
         "Data source: HR Database (Jan 2026) | Charts follow IBCS & Tufte standards | Colorblind-safe palette",
         ha="center", fontsize=9, color="gray", style="italic")

plt.tight_layout()
save_chart(fig, "dashboard_hr_analytics")
plt.show()

Markdown interpretation (Cell 15):

Dashboard Summary — Key Takeaways:

  1. Workforce: Engineering dominates headcount (35%) — aligned with tech company profile
  2. Revenue: Strong upward trend with anomaly in Nov (−18%) — needs investigation
  3. Compensation: Strong experience-salary correlation (r ≈ 0.72), underpaid group detected
  4. Pay equity: Engineering widest salary spread; Sales has outlier high-earners

Phần 8: Export Chất Lượng Cao

Đảm bảo tất cả 5 charts + dashboard đều được export PNG (300 dpi) + SVG.

Bước 8.1: Verify Exports

python
# Cell 16: Verify all exports
import os

print("=" * 60)
print("📁 EXPORT VERIFICATION")
print("=" * 60)

expected_files = [
    "01_bar_department",
    "02_line_revenue_trend",
    "03_scatter_salary_experience",
    "04_heatmap_correlation",
    "05_boxplot_salary_department",
    "dashboard_hr_analytics",
]

for name in expected_files:
    for fmt in ["png", "svg"]:
        filepath = f"charts/{name}.{fmt}"
        if os.path.exists(filepath):
            size_kb = os.path.getsize(filepath) / 1024
            print(f"  ✅ {filepath} ({size_kb:.0f} KB)")
        else:
            print(f"  ❌ MISSING: {filepath}")

total_files = len([f for f in os.listdir("charts") if f.endswith((".png", ".svg"))])
print(f"\n📊 Total chart files: {total_files}")
print(f"📋 Expected: {len(expected_files) * 2} (6 charts × 2 formats)")

Bước 8.2: Export Summary Image

python
# Cell 17: Summary — all charts overview
from matplotlib.image import imread

fig, axes = plt.subplots(2, 3, figsize=(20, 12))
chart_pngs = [f"charts/{name}.png" for name in expected_files]

for ax, png_path, name in zip(axes.flat, chart_pngs, expected_files):
    if os.path.exists(png_path):
        img = imread(png_path)
        ax.imshow(img)
        ax.set_title(name.replace("_", " ").title(), fontsize=10)
    ax.axis("off")

fig.suptitle("📊 All Charts — Workshop Buổi 10", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("charts/00_overview_all_charts.png", dpi=150, bbox_inches="tight")
plt.show()

print("✅ Overview image saved: charts/00_overview_all_charts.png")

🌟 Bonus Challenges

Bonus 1: Attrition Rate by Department — Stacked Bar

python
# Cell 18: Bonus 1 — Attrition Analysis
fig, ax = plt.subplots(figsize=(12, 6))

# Calculate attrition rate by department
attrition_rate = df.groupby("department")["attrition"].apply(
    lambda x: (x == "Yes").mean() * 100
).sort_values(ascending=True)

# Color by risk level
colors = [COLORS["positive"] if r < 20 else
          COLORS["accent2"] if r < 30 else
          COLORS["negative"] for r in attrition_rate]

bars = ax.barh(attrition_rate.index, attrition_rate.values, color=colors, edgecolor="white")

# Data labels + risk indicators
for bar, rate in zip(bars, attrition_rate.values):
    risk = "🟢 Low" if rate < 20 else "🟡 Medium" if rate < 30 else "🔴 High"
    ax.text(rate + 0.5, bar.get_y() + bar.get_height()/2,
            f"{rate:.1f}% {risk}", va="center", fontsize=11, fontweight="bold")

ax.set_title("Attrition Rate by Department — Risk Assessment",
             fontsize=14, fontweight="bold", loc="left")
ax.set_xlabel("Attrition Rate (%)")
ax.axvline(20, color=COLORS["accent2"], linestyle="--", linewidth=1, alpha=0.5, label="Warning threshold (20%)")
ax.legend(frameon=False)
tufte_style(ax)

plt.tight_layout()
save_chart(fig, "bonus_attrition_rate")
plt.show()

Bonus 2: Custom Style Theme

python
# Cell 19: Bonus 2 — Custom Matplotlib Style
def apply_corporate_theme():
    """Apply corporate/professional Matplotlib theme"""
    plt.rcParams.update({
        # Figure
        "figure.figsize": (12, 7),
        "figure.dpi": 100,
        "figure.facecolor": "white",

        # Font
        "font.size": 12,
        "font.family": "sans-serif",
        "axes.titlesize": 14,
        "axes.labelsize": 12,

        # Axes
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.facecolor": "white",
        "axes.edgecolor": "#333333",
        "axes.linewidth": 0.8,

        # Grid
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linewidth": 0.5,
        "grid.color": "#CCCCCC",

        # Ticks
        "xtick.color": "#333333",
        "ytick.color": "#333333",

        # Legend
        "legend.frameon": False,
        "legend.fontsize": 10,

        # Save
        "savefig.dpi": 300,
        "savefig.bbox": "tight",
        "savefig.facecolor": "white",
    })
    sns.set_palette("colorblind")
    print("✅ Corporate theme applied!")

apply_corporate_theme()

# Test theme
fig, ax = plt.subplots()
ax.bar(["Q1", "Q2", "Q3", "Q4"], [42, 48, 45, 52], color=COLORS["primary"])
ax.set_title("Theme Test — Corporate Style")
ax.set_ylabel("Revenue (B VND)")
plt.show()

Bonus 3: Animated Progress (Optional — advanced)

python
# Cell 20: Bonus 3 — Chart Style Comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
styles = ["default", "seaborn-v0_8-whitegrid", "ggplot"]

for ax, style_name in zip(axes, styles):
    with plt.style.context(style_name):
        ax.bar(["A", "B", "C", "D"], [25, 40, 30, 35])
        ax.set_title(f"Style: {style_name}", fontsize=12)
        ax.set_ylabel("Value")

plt.suptitle("Matplotlib Style Gallery — Choose Your Favorite", fontsize=15, fontweight="bold")
plt.tight_layout()
save_chart(fig, "bonus_style_comparison")
plt.show()

📋 Deliverable

Sau khi hoàn thành workshop, nộp:

#FileMô tả
1HoTen_Buoi10_Visualization.ipynbJupyter Notebook hoàn chỉnh — Restart & Run All thành công
2charts/01_bar_department.pngChart 1: Bar chart — Headcount & Avg Salary
3charts/02_line_revenue_trend.pngChart 2: Line chart — Monthly Revenue Trend
4charts/03_scatter_salary_experience.pngChart 3: Scatter plot — Salary vs Experience
5charts/04_heatmap_correlation.pngChart 4: Heatmap — Correlation Matrix
6charts/05_boxplot_salary_department.pngChart 5: Box plot — Salary Distribution
7charts/dashboard_hr_analytics.pngMulti-panel dashboard tổng hợp

💡 Checklist trước khi nộp

  • [ ] Restart & Run All — notebook chạy từ đầu đến cuối không lỗi
  • [ ] 5 chart files trong thư mục charts/ (PNG 300dpi)
  • [ ] Dashboard có 4 panels tổng hợp
  • [ ] Mỗi chart có title, axis labels, annotations, interpretation
  • [ ] Colorblind-safe palette — dùng colorblind hoặc Wong palette
  • [ ] Markdown cells — interpretation sau mỗi chart

📊 Rubric — Thang điểm

Tiêu chíĐiểmMô tả
Chart 1: Bar Chart (Phần 2)15Horizontal bar, data labels, mean vs median comparison, IBCS-style
Chart 2: Line Chart (Phần 3)15Trend line, anomaly annotation, data labels, fill area
Chart 3: Scatter Plot (Phần 4)15Color by department, trend line, correlation value, underpaid highlight
Chart 4: Heatmap (Phần 5)15Masked triangle, annot values, strong correlation highlight
Chart 5: Box Plot (Phần 6)15Sorted by median, median labels, violin bonus, dept comparison
Multi-Panel Dashboard (Phần 7)152×2 layout, consistent style, global title, footnote
Export & Verify (Phần 8)5PNG (300dpi) + SVG export, verification cell
Notebook Quality5Markdown sections, Restart & Run All OK, interpretation cells, no debug cells
Bonus+15Attrition analysis (+5), Corporate theme (+5), Style comparison (+5)
Tổng100 + 15 bonus

⚠️ Lưu ý quan trọng

  • Restart & Run All trước khi nộp — notebook phải chạy từ đầu đến cuối không lỗi
  • Mỗi chart phải có Markdown interpretation — chart không có interpretation = 0 điểm
  • Charts phải export thành file PNG trong thư mục charts/ — không chỉ hiển thị inline
  • Sử dụng colorblind-safe palette — mất 3 điểm nếu dùng red-green default
  • Title chart phải là insight (nêu finding), không chỉ label (mô tả data)
  • Code phải có comments giải thích logic quan trọng