🧠 Case Study — Pandas & Numpy Data Cleaning

Trong buổi học này, chúng ta đã nắm được Pandas DataFrame, Series, loc/iloc, xử lý missing values, duplicates, và Numpy vectorized operations. Bây giờ hãy xem các kỹ năng đó được áp dụng trong thực tế như thế nào — từ một Kaggle competition kinh điển, đến production pipeline tại VNG, và cuối cùng là bài toán merge dữ liệu đa nguồn trong e-commerce Việt Nam.

Case Study 1: Kaggle Titanic — Data Cleaning Pipeline cho ML Dataset Kinh Điển

Bối cảnh

Titanic dataset là bộ dữ liệu "Hello World" của Machine Learning trên Kaggle, với hơn 15,000 teams đã tham gia competition. Dataset chứa thông tin 891 hành khách trên tàu Titanic — mục tiêu là dự đoán ai sống sót dựa trên các đặc điểm như tuổi, giới tính, hạng vé.

Tuy nhiên, trước khi xây dựng bất kỳ model nào, data cleaning chiếm 60-70% thời gian của một Data Analyst. Dataset Titanic có đủ các vấn đề kinh điển: missing values, inconsistent data types, và cần feature engineering.

Vấn đề

Khi load dataset lần đầu, một DA mới sẽ thấy ngay các vấn đề:

python

import pandas as pd
import numpy as np

df = pd.read_csv('titanic.csv')
print(df.shape)           # (891, 12)
print(df.info())
print(df.isnull().sum())

Kết quả cho thấy:

Cột	Missing	Tỷ lệ	Đánh giá
Age	177	19.87%	Có thể xử lý
Cabin	687	77.10%	Quá nhiều missing → cân nhắc drop
Embarked	2	0.22%	Dễ xử lý

Ngoài ra, cột Name chứa title (Mr, Mrs, Miss...) có thể extract, cột Ticket format không nhất quán, và Fare có outliers rõ rệt.

Giải pháp

Team xây dựng một data cleaning pipeline có hệ thống với Pandas và Numpy:

Bước 1 — Xử lý Missing Values theo chiến lược phù hợp:

python

# Age: fillna bằng median theo Pclass (hành khách cùng hạng có tuổi tương đồng)
df['Age'] = df.groupby('Pclass')['Age'].transform(
    lambda x: x.fillna(x.median())
)

# Cabin: 77% missing → tạo binary feature "HasCabin" thay vì drop hoàn toàn
df['HasCabin'] = np.where(df['Cabin'].notna(), 1, 0)
df.drop(columns=['Cabin'], inplace=True)

# Embarked: chỉ 2 missing → fillna bằng mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

Nguyên tắc chọn chiến lược fillna

< 5% missing → fillna bằng mean/median/mode tùy distribution
5-30% missing → fillna có điều kiện (groupby rồi fill) hoặc model-based imputation
> 50% missing → cân nhắc drop cột hoặc tạo binary feature "HasX"

Bước 2 — Feature Engineering với Numpy:

python

# Extract Title từ Name bằng str.extract
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.')

# Gom nhóm Title hiếm bằng np.where
rare_titles = ['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr',
               'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
df['Title'] = np.where(df['Title'].isin(rare_titles), 'Rare', df['Title'])
df['Title'] = np.where(df['Title'] == 'Mlle', 'Miss', df['Title'])
df['Title'] = np.where(df['Title'] == 'Ms', 'Miss', df['Title'])
df['Title'] = np.where(df['Title'] == 'Mme', 'Mrs', df['Title'])

# Tạo FamilySize và IsAlone
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = np.where(df['FamilySize'] == 1, 1, 0)

Bước 3 — Chuyển đổi dtypes và loại bỏ duplicates:

python

# Kiểm tra duplicates
print(f"Duplicates: {df.duplicated().sum()}")  # 0 trong trường hợp này

# Chuyển đổi dtypes cho phù hợp
df['Survived'] = df['Survived'].astype('category')
df['Pclass'] = df['Pclass'].astype('category')
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1}).astype('int8')

# Kiểm tra data quality sau cleaning
print(f"Missing values: {df.isnull().sum().sum()}")  # 0
print(f"Shape after cleaning: {df.shape}")

Kết quả

Metric	Trước cleaning	Sau cleaning
Missing values tổng	866	0
Số cột	12	13
Data quality score	58%	100%
Memory usage	83.7 KB	62.4 KB
Kaggle accuracy (sau modeling)	~72%	~80%

INFO

Chỉ riêng việc cleaning và feature engineering đúng cách đã giúp tăng accuracy từ 72% → 80% — một improvement đáng kể mà không cần thay đổi model.

Bài học cho DA

Hiểu data trước khi clean — dùng df.info(), df.describe(), df.isnull().sum() là bước đầu tiên bắt buộc.
Missing values không phải lúc nào cũng cần drop — Cabin 77% missing nhưng vẫn trích xuất được thông tin hữu ích (HasCabin).
Groupby + fillna chính xác hơn fillna toàn bộ — tuổi trung bình hạng 1 khác hạng 3 rất nhiều.
np.where là công cụ cực mạnh cho conditional transformation — gọn hơn nhiều so với if-else loops.

Case Study 2: VNG — Pandas Pipeline Xử Lý Log Data Game Zalo Mini App

Bối cảnh

VNG Corporation là một trong những công ty công nghệ lớn nhất Việt Nam, vận hành hệ sinh thái Zalo với hơn 75 triệu người dùng. Bộ phận Zalo Mini App cho phép developers tạo mini games ngay trong Zalo — mỗi ngày sinh ra hàng triệu dòng log data từ các game sessions.

Team Data Analytics gồm 5 người chịu trách nhiệm phân tích hành vi người chơi, phát hiện anomalies, và cung cấp insights cho product team. Trước đây, pipeline chạy bằng Python thuần với for-loops, mất 45 phút để xử lý log 1 ngày (~3 triệu rows).

Vấn đề

Log data game có các vấn đề đặc trưng:

Volume lớn: 3-5 triệu rows/ngày, mỗi row là 1 event (login, play, purchase, logout)
Timestamps không nhất quán: một số log ghi UTC+0, một số UTC+7, format lẫn lộn giữa ISO 8601 và Unix timestamp
Duplicate events: do retry mechanism, ~8% events bị gửi 2-3 lần
Missing fields: khoảng 12% rows thiếu user_id hoặc session_id do bug client
Performance: pipeline cũ dùng for-loop chạy quá chậm cho daily report deadline 9h sáng

Giải pháp

Team refactor toàn bộ pipeline sang Pandas + Numpy vectorized operations:

Bước 1 — Load và chuẩn hóa timestamps:

python

import pandas as pd
import numpy as np

# Load chunked để tránh memory overflow
chunks = pd.read_json('game_logs_20260217.json', lines=True, chunksize=500_000)
df = pd.concat(chunks, ignore_index=True)
print(f"Raw rows: {len(df):,}")  # 3,247,891

# Chuẩn hóa timestamp: detect & convert mixed formats
def normalize_timestamp(ts_series):
    # Unix timestamps (số nguyên > 1e9) → convert
    mask_unix = ts_series.apply(lambda x: isinstance(x, (int, float)) and x > 1e9)
    ts_series = ts_series.copy()
    ts_series[mask_unix] = pd.to_datetime(ts_series[mask_unix], unit='s', utc=True)
    ts_series[~mask_unix] = pd.to_datetime(ts_series[~mask_unix], utc=True)
    return pd.to_datetime(ts_series, utc=True).dt.tz_convert('Asia/Ho_Chi_Minh')

df['event_time'] = normalize_timestamp(df['event_time'])

Bước 2 — Loại bỏ duplicates thông minh:

python

# Duplicate detection: cùng user_id + event_type + timestamp (within 2 seconds)
df = df.sort_values(['user_id', 'event_time'])
df['time_diff'] = df.groupby(['user_id', 'event_type'])['event_time'].diff()
duplicate_mask = df['time_diff'] < pd.Timedelta(seconds=2)

print(f"Duplicates detected: {duplicate_mask.sum():,}")  # 267,412 (8.2%)
df = df[~duplicate_mask].drop(columns=['time_diff'])
print(f"After dedup: {len(df):,}")  # 2,980,479

Bước 3 — Xử lý missing values và tính metrics bằng Numpy:

python

# Missing user_id: drop (không thể phân tích nếu không biết user)
df = df.dropna(subset=['user_id'])

# Missing session_id: generate từ user_id + event_time (30-min window)
mask_no_session = df['session_id'].isna()
df.loc[mask_no_session, 'session_id'] = (
    df.loc[mask_no_session, 'user_id'].astype(str) + '_' +
    df.loc[mask_no_session, 'event_time'].dt.strftime('%Y%m%d%H%M')
)

# Numpy vectorized: tính session duration, spend per session
session_stats = df.groupby('session_id').agg(
    duration_sec=('event_time', lambda x: (x.max() - x.min()).total_seconds()),
    total_spend=('amount', 'sum'),
    event_count=('event_type', 'count')
)

# Numpy statistical functions cho anomaly detection
mean_duration = np.mean(session_stats['duration_sec'])
std_duration = np.std(session_stats['duration_sec'])
anomaly_threshold = mean_duration + 3 * std_duration
anomalies = session_stats[session_stats['duration_sec'] > anomaly_threshold]
print(f"Anomalous sessions: {len(anomalies):,}")

Lưu ý với dữ liệu lớn

Khi xử lý > 1 triệu rows, tuyệt đối tránh dùng for-loop hoặc df.apply() row-by-row. Numpy vectorized operations nhanh hơn 50-100x. Nếu vẫn chậm, hãy cân nhắc chunked processing hoặc chuyển sang Dask/Polars.

Kết quả

Metric	Pipeline cũ (for-loop)	Pipeline mới (Pandas + Numpy)
Thời gian xử lý 3M rows	45 phút	3 phút 20 giây
Memory peak	8.2 GB	2.1 GB
Duplicates removed	Không xử lý	267,412 rows (8.2%)
Missing sessions filled	Không xử lý	358,000 sessions
Anomalies detected/ngày	0	~150 sessions

Pipeline mới giúp team hoàn thành daily report trước 8h sáng thay vì chạy deadline 9h30. Product team nhận insights sớm hơn 90 phút mỗi ngày.

Bài học cho DA

Vectorized operations là bắt buộc khi xử lý data lớn — performance gap giữa loop và vectorized là 10-100x.
Duplicate detection cần logic, không chỉ đơn giản drop_duplicates() — trong log data, "duplicate" nghĩa là events quá gần nhau chứ không phải identical rows.
Missing data có thể reconstruct — session_id thiếu nhưng có thể tạo lại từ user_id + time window.
np.mean, np.std kết hợp với threshold rule là cách anomaly detection đơn giản nhưng hiệu quả.

Case Study 3: E-commerce Việt Nam — Merge Data Đa Nguồn Thành Customer 360°

Bối cảnh

Một sàn e-commerce top 5 Việt Nam (doanh thu ~2,000 tỷ VNĐ/năm) bán hàng qua 3 kênh: website, mobile app, và call center. Mỗi kênh sử dụng hệ thống riêng biệt với database schema khác nhau. Ban lãnh đạo yêu cầu team DA xây dựng Customer 360° view — hợp nhất toàn bộ dữ liệu khách hàng từ 3 nguồn để phân tích cross-channel behavior.

Team DA gồm 3 người, deadline: 2 tuần để có được unified dataset phục vụ báo cáo chiến lược quý.

Vấn đề

Dữ liệu 3 nguồn có schema và format rất khác nhau:

Đặc điểm	Website	Mobile App	Call Center
Customer ID	`email`	`phone`	`phone` hoặc `email`
Order date	`created_at` (ISO)	`order_ts` (Unix)	`ngay_dat` (dd/mm/yyyy)
Tổng tiền	`total_usd` (USD)	`amount` (VND)	`gia_tri` (VND, có dấu chấm)
Địa chỉ	`shipping_address`	`addr`	`dia_chi_giao`
Rows/tháng	~120,000	~85,000	~15,000

Vấn đề chính: không có chung 1 customer ID giữa 3 hệ thống. Cùng 1 khách hàng có thể dùng email trên website, phone trên app, và cả hai trên call center.

Giải pháp

Team xây dựng multi-source merge pipeline gồm 3 giai đoạn:

Giai đoạn 1 — Chuẩn hóa schema từng nguồn:

python

import pandas as pd
import numpy as np

# Load data từ 3 nguồn
web = pd.read_csv('website_orders.csv')       # 120,847 rows
app = pd.read_json('app_orders.json')          # 84,521 rows
call = pd.read_excel('callcenter_orders.xlsx') # 15,233 rows

# --- Chuẩn hóa Website ---
web = web.rename(columns={
    'created_at': 'order_date', 'total_usd': 'amount_vnd',
    'shipping_address': 'address', 'email': 'customer_email'
})
web['order_date'] = pd.to_datetime(web['order_date'])
web['amount_vnd'] = (web['amount_vnd'] * 25_400).astype(int)  # USD → VND
web['source'] = 'website'

# --- Chuẩn hóa Mobile App ---
app = app.rename(columns={
    'order_ts': 'order_date', 'amount': 'amount_vnd',
    'addr': 'address', 'phone': 'customer_phone'
})
app['order_date'] = pd.to_datetime(app['order_date'], unit='s')
app['source'] = 'app'

# --- Chuẩn hóa Call Center ---
call = call.rename(columns={
    'ngay_dat': 'order_date', 'gia_tri': 'amount_vnd',
    'dia_chi_giao': 'address'
})
call['order_date'] = pd.to_datetime(call['order_date'], format='%d/%m/%Y')
# Xử lý format tiền: "1.250.000" → 1250000
call['amount_vnd'] = (
    call['amount_vnd'].astype(str).str.replace('.', '', regex=False).astype(int)
)
call['source'] = 'callcenter'

Giai đoạn 2 — Concat tất cả orders và xử lý data quality:

python

# Chọn cột chung và concat
common_cols = ['order_date', 'amount_vnd', 'address', 'source',
               'customer_email', 'customer_phone']

# Đảm bảo tất cả DataFrame có đủ cột (fill NaN nếu thiếu)
for df in [web, app, call]:
    for col in common_cols:
        if col not in df.columns:
            df[col] = np.nan

orders = pd.concat(
    [web[common_cols], app[common_cols], call[common_cols]],
    ignore_index=True
)
print(f"Total orders: {len(orders):,}")  # 220,601

# Chuẩn hóa email và phone
orders['customer_email'] = orders['customer_email'].str.lower().str.strip()
orders['customer_phone'] = (
    orders['customer_phone'].astype(str)
    .str.replace(r'[^0-9]', '', regex=True)   # xóa ký tự đặc biệt
    .str.replace(r'^84', '0', regex=True)      # 84xxx → 0xxx
)
orders.loc[orders['customer_phone'] == 'nan', 'customer_phone'] = np.nan

# Xóa duplicates (cùng email/phone + cùng ngày + cùng amount)
before_dedup = len(orders)
orders = orders.drop_duplicates(
    subset=['customer_email', 'customer_phone', 'order_date', 'amount_vnd']
)
print(f"Removed {before_dedup - len(orders):,} duplicates")  # 3,412

Giai đoạn 3 — Tạo Customer 360° bằng merge:

python

# Bước 3a: Tạo mapping email ↔ phone từ call center (có cả 2 thông tin)
email_phone_map = (
    call[['customer_email', 'customer_phone']]
    .dropna()
    .drop_duplicates()
)
print(f"Email-phone mappings: {len(email_phone_map):,}")  # 8,742

# Bước 3b: Merge mapping vào orders để fill missing identifiers
orders = orders.merge(
    email_phone_map, on='customer_phone', how='left', suffixes=('', '_mapped')
)
orders['customer_email'] = orders['customer_email'].fillna(
    orders['customer_email_mapped']
)
orders.drop(columns=['customer_email_mapped'], inplace=True)

# Bước 3c: Tạo unified customer_id
orders['customer_id'] = (
    orders['customer_email']
    .fillna(orders['customer_phone'])
    .fillna('unknown_' + orders.index.astype(str))
)

# Bước 3d: Aggregate thành Customer 360°
customer_360 = orders.groupby('customer_id').agg(
    total_orders=('order_date', 'count'),
    total_revenue=('amount_vnd', 'sum'),
    avg_order_value=('amount_vnd', 'mean'),
    first_order=('order_date', 'min'),
    last_order=('order_date', 'max'),
    channels_used=('source', 'nunique'),
    channel_list=('source', lambda x: ', '.join(sorted(x.unique())))
).reset_index()

# Phân loại khách hàng multi-channel
customer_360['is_multichannel'] = np.where(
    customer_360['channels_used'] > 1, True, False
)
print(f"Total unique customers: {len(customer_360):,}")
print(f"Multi-channel customers: {customer_360['is_multichannel'].sum():,}")

Tại sao dùng Call Center làm "cầu nối" email-phone?

Call center là kênh duy nhất thường thu thập cả email lẫn số điện thoại khi xác nhận đơn hàng. Dù volume nhỏ nhất (15K orders/tháng), nó đóng vai trò bridge table để link website customers (chỉ có email) với app customers (chỉ có phone). Đây là kỹ thuật identity resolution phổ biến trong data analytics.

Kết quả

Metric	Giá trị
Tổng orders hợp nhất	217,189
Unique customers identified	142,367
Multi-channel customers	18,421 (12.9%)
Revenue multi-channel	31% tổng doanh thu
Customer match rate	94.7%
Unknown customers	5.3%

Insight quan trọng

Khách hàng multi-channel chỉ chiếm 12.9% số lượng nhưng đóng góp 31% doanh thu. AOV (Average Order Value) của họ cao gấp 2.4 lần so với single-channel customers. Insight này giúp ban lãnh đạo quyết định đầu tư vào chiến lược omnichannel.

Bài học cho DA

pd.concat để stack, pd.merge để join — hiểu rõ khi nào dùng cái nào là skill quan trọng nhất khi làm việc với multi-source data.
Schema normalization phải làm trước khi merge — rename columns, convert dtypes, standardize formats.
Identity resolution là bài toán thực tế mà hầu hết công ty Việt Nam đều gặp — không có unified customer ID.
Call center data nhỏ nhưng giá trị lớn vì chứa cross-reference giữa email và phone.

So sánh & Tổng hợp

Tiêu chí	Case 1: Kaggle Titanic	Case 2: VNG Game Logs	Case 3: E-commerce Merge
Scale	891 rows	3+ triệu rows/ngày	220K rows/tháng
Kỹ năng chính	fillna, np.where, dtypes	vectorized ops, dedup	pd.merge, pd.concat
Thách thức	Missing values strategy	Performance	Schema mismatch
Pandas functions	groupby, fillna, map, astype	sort_values, diff, groupby.agg	rename, merge, concat, agg
Numpy functions	np.where	np.mean, np.std	np.where, np.nan
Impact	+8% model accuracy	13x faster pipeline	Customer 360° view

Pandas & Numpy Skill Progression

Level 1 (Titanic): Làm sạch 1 dataset nhỏ — fillna, dropna, astype, np.whereLevel 2 (VNG): Xử lý data lớn hiệu quả — vectorized ops, chunked loading, groupby.aggLevel 3 (E-commerce): Tích hợp nhiều nguồn — pd.merge, pd.concat, identity resolution, schema normalization

Bài tập tư duy

1. Chiến lược fillna nào phù hợp?

Bạn có dataset 50,000 nhân viên với cột salary bị missing 25%. Ba cách tiếp cận:

A: df['salary'].fillna(df['salary'].median())
B: df.groupby('department')['salary'].transform(lambda x: x.fillna(x.median()))
C: Drop toàn bộ rows missing salary

Cách nào tốt nhất và tại sao? (Gợi ý: nghĩ về sự khác biệt lương giữa các phòng ban — kế toán vs engineering vs marketing)

2. Loop vs Vectorized — Khi nào chấp nhận loop?

Team VNG chuyển từ for-loop sang vectorized operations và tăng speed 13x. Nhưng có trường hợp nào bạn buộc phải dùng loop không? Cho ví dụ cụ thể.

(Gợi ý: nghĩ về trường hợp giá trị row hiện tại phụ thuộc vào kết quả tính toán từ row trước đó — stateful computation)

3. Thiết kế merge pipeline

Bạn cần merge dữ liệu từ 4 nguồn: Shopee (API JSON), Lazada (CSV), Tiki (Excel), và POS cửa hàng (SQL database). Hãy viết pseudo-code cho pipeline, bao gồm:

Load data từ 4 nguồn
Chuẩn hóa schema
Xử lý customer matching (chỉ có phone là common field)
Tạo unified orders table
Tính KPIs theo channel

(Áp dụng pattern từ Case Study 3 nhưng mở rộng thêm 1 nguồn)

4. Data quality scoring

Được hỗ trợ bởi Case Study 1, hãy viết hàm data_quality_score(df) trả về điểm từ 0-100 dựa trên:

% missing values (trọng số 40%)
% duplicates (trọng số 20%)
Đúng dtypes (trọng số 20%)
Không có outliers extreme (trọng số 20%)

python

def data_quality_score(df):
    # Viết implementation ở đây
    pass

🧠 Case Study — Pandas & Numpy Data Cleaning ​

Case Study 1: Kaggle Titanic — Data Cleaning Pipeline cho ML Dataset Kinh Điển ​

Bối cảnh ​

Vấn đề ​

Giải pháp ​

Kết quả ​

Bài học cho DA ​

Case Study 2: VNG — Pandas Pipeline Xử Lý Log Data Game Zalo Mini App ​

Bối cảnh ​

Vấn đề ​

Giải pháp ​

Kết quả ​

Bài học cho DA ​

Case Study 3: E-commerce Việt Nam — Merge Data Đa Nguồn Thành Customer 360° ​

Bối cảnh ​

Vấn đề ​

Giải pháp ​

Kết quả ​

Bài học cho DA ​

So sánh & Tổng hợp ​

Bài tập tư duy ​

🧠 Case Study — Pandas & Numpy Data Cleaning

Case Study 1: Kaggle Titanic — Data Cleaning Pipeline cho ML Dataset Kinh Điển

Bối cảnh

Vấn đề

Giải pháp

Kết quả

Bài học cho DA

Case Study 2: VNG — Pandas Pipeline Xử Lý Log Data Game Zalo Mini App

Bối cảnh

Vấn đề

Giải pháp

Kết quả

Bài học cho DA

Case Study 3: E-commerce Việt Nam — Merge Data Đa Nguồn Thành Customer 360°

Bối cảnh

Vấn đề

Giải pháp

Kết quả

Bài học cho DA

So sánh & Tổng hợp

Bài tập tư duy