Hướng Dẫn Dự Án Data Mining

README này cung cấp hướng dẫn để chạy các bước tiền xử lý dữ liệu, huấn luyện và đánh giá mô hình trong notebook Jupyter DM.ipynb. Dự án bao gồm tiền xử lý dữ liệu điện thoại, huấn luyện mô hình dự đoán giá (XGBoost và MLP), và thực hiện phân cụm (KMeans, GMM, và Hierarchical Clustering).

Yêu Cầu Cần Thiết

Phần Mềm và Thư Viện

Python 3.11 (hoặc phiên bản tương thích, như trong Colab)
Google Colab (khuyến nghị để hỗ trợ GPU và tích hợp Google Drive)

Cài đặt các thư viện Python sau:

pip install pandas numpy scikit-learn xgboost matplotlib scipy

Cấu Trúc Thư Mục

Đảm bảo các file và thư mục sau được thiết lập trong Google Drive tại /content/drive/MyDrive/KHDL_CK:

File Đầu Vào:
- gsmarena_phone_data.csv: Bộ dữ liệu gốc
- process_train.py: Script xử lý dữ liệu huấn luyện
- process_test.py: Script xử lý dữ liệu kiểm tra
- drop.py: Script làm sạch dữ liệu đã xử lý
Thư Mục Đầu Ra:
- Tạo thư mục tại /content/drive/MyDrive/KHDL_CK để lưu các file đầu ra (ví dụ: file CSV đã xử lý, biểu đồ).

Thiết Lập Google Drive

Kết nối Google Drive trong Colab:

from google.colab import drive drive.mount('/content/drive')

Đảm bảo tất cả các file cần thiết được tải lên /content/drive/MyDrive/KHDL_CK.

Hướng Dẫn Chạy

Bước 1: Tiền Xử Lý Dữ Liệu

Các bước tiền xử lý chia dữ liệu thành tập huấn luyện/kiểm tra, xử lý và làm sạch dữ liệu.

Chia Dữ Liệu:

Chạy cell đầu tiên trong DM.ipynb để:
- Đọc gsmarena_phone_data.csv.
- Chia dữ liệu thành tập huấn luyện (80%) và kiểm tra (20%).
- Lưu dưới dạng raw_data_train.csv và raw_data_test.csv (cả cục bộ và trên Google Drive).

Code Tham Khảo:

from google.colab import drive drive.mount('/content/drive') import os import pandas as pd from sklearn.model_selection import train_test_split drive_path = '/content/drive/MyDrive/KHDL_CK' csv_input = os.path.join(drive_path, 'gsmarena_phone_data.csv') df = pd.read_csv(csv_input) train_df, test_df = train_test_split(df, test_size=0.2, random_state=42) train_df.to_csv(os.path.join(drive_path, 'raw_data_train.csv'), index=False) test_df.to_csv(os.path.join(drive_path, 'raw_data_test.csv'), index=False)

Xử Lý Dữ Liệu:
- Chạy các script process_train.py và process_test.py để tạo processed_data_train.csv và processed_data_test.csv.
- Code Tham Khảo:
```
import subprocess for script in ['process_train.py', 'process_test.py']: subprocess.run(['python', script], cwd=drive_path, check=True)
```

Làm Sạch Dữ Liệu:

Chạy script drop.py để làm sạch dữ liệu đã xử lý và tạo clean_data_train.csv và clean_data_test.csv.

Code Tham Khảo:

!python drop.py -i /content/drive/MyDrive/KHDL_CK/processed_data_train.csv -o /content/drive/MyDrive/KHDL_CK/clean_data_train.csv !python drop.py -i /content/drive/MyDrive/KHDL_CK/processed_data_test.csv -o /content/drive/MyDrive/KHDL_CK/clean_data_test.csv

Tóm Tắt Dữ Liệu:

Chạy code tóm tắt để kiểm tra các bộ dữ liệu đã làm sạch (clean_data_train.csv và clean_data_test.csv).

Code Tham Khảo:

import pandas as pd def summarize(df, label): print(f"\n=== Báo cáo cho: {label} ===") print("Shape (rows, cols):", df.shape) print("\n--- Thống kê mô tả ---") print(df.describe()) print("\n--- Số giá trị null mỗi cột ---") print(df.isnull().sum()) paths = { 'clean_data_train': '/content/drive/MyDrive/KHDL_CK/clean_data_train.csv', 'clean_data_test': '/content/drive/MyDrive/KHDL_CK/clean_data_test.csv' } for label, path in paths.items(): df = pd.read_csv(path) summarize(df, label)

Bước 2: Xử Lý Dữ Liệu Nâng Cao

Bước này thêm đặc trưng cpu_score và xử lý giá trị thiếu.

Chạy Code Xử Lý:

Thực thi code để:
- Thêm cpu_score dựa trên total_cores, pref_cores, và max_freq_ghz.
- Xử lý giá trị thiếu:
  - Điền cpu_cols (total_cores, max_freq_ghz, pref_cores) bằng 0.
  - Xóa các hàng có Price_VND null.
  - Điền brand và GPU_brand bằng 'NON'.
  - Điền các cột số khác bằng giá trị trung bình.
- Lưu file đã xử lý thành clean_data_train_processed.csv và clean_data_test_processed.csv.

Code Tham Khảo:

import pandas as pd base_path = '/content/drive/MyDrive/KHDL_CK' files = { 'train': f'{base_path}/clean_data_train.csv', 'test': f'{base_path}/clean_data_test.csv' } def process_file(path, label): df = pd.read_csv(path) df['cpu_score'] = (df['total_cores'] + df['pref_cores']) * df['max_freq_ghz'] cpu_cols = ['total_cores', 'max_freq_ghz', 'pref_cores'] df[cpu_cols] = df[cpu_cols].fillna(0) df = df.dropna(subset=['Price_VND']) df['brand'] = df['brand'].fillna('NON') df['GPU_brand'] = df['GPU_brand'].fillna('NON') numeric = df.select_dtypes(include='number').columns others = numeric.difference(cpu_cols + ['Price_VND', 'cpu_score']) df[others] = df[others].fillna(df[others].mean()) out_path = path.replace('.csv', '_processed.csv') df.to_csv(out_path, index=False) print(f"→ Saved processed to {out_path}") for label, fp in files.items(): process_file(fp, label)

Bước 3: Huấn Luyện và Đánh Giá Mô Hình

Huấn luyện và đánh giá các mô hình XGBoost và MLP để dự đoán giá.

Mô Hình XGBoost

Huấn Luyện XGBoost:

Sử dụng RandomizedSearchCV để tinh chỉnh siêu tham số và huấn luyện mô hình XGBoost.
Đánh giá bằng chỉ số R² và MSE.
Lưu các biểu đồ (Actual vs Predicted, Residuals, Histogram, QQ-Plot, Feature Importance) vào /content/drive/MyDrive/KHDL_CK.

Code Tham Khảo:

import os import numpy as np import pandas as pd import matplotlib.pyplot as plt from xgboost import XGBRegressor, plot_importance from sklearn.model_selection import RandomizedSearchCV, train_test_split from sklearn.metrics import r2_score, mean_squared_error import scipy.stats as stats output_dir = '/content/drive/MyDrive/KHDL_CK' os.makedirs(output_dir, exist_ok=True) # Load dữ liệu (thay bằng dữ liệu đã xử lý) train_df = pd.read_csv('/content/drive/MyDrive/KHDL_CK/clean_data_train_processed.csv') test_df = pd.read_csv('/content/drive/MyDrive/KHDL_CK/clean_data_test_processed.csv') X_train = train_df.drop(columns=['Price_VND']) y_train = train_df['Price_VND'] X_test = test_df.drop(columns=['Price_VND']) y_test = test_df['Price_VND'] param_dist = { 'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3], 'max_depth': [3, 4, 5, 6, 7, 8], 'n_estimators': [100, 200, 300, 500], 'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0], 'gamma': [0, 0.1, 0.2], 'reg_alpha': [0, 0.1, 0.5], 'reg_lambda': [1.0, 1.5, 2.0] } xgb = XGBRegressor(objective='reg:squarederror', random_state=42) rs = RandomizedSearchCV( estimator=xgb, param_distributions=param_dist, n_iter=50, cv=5, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1, verbose=1 ) rs.fit(X_train, y_train) best_xgb = rs.best_estimator_ y_pred = best_xgb.predict(X_test) r2 = r2_score(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) print("Best XGB params:", rs.best_params_) print(f"Test R²: {r2:.4f}, Test MSE: {mse:.2f}") residuals = y_test - y_pred # Lưu các biểu đồ (Actual vs Predicted, Residuals, Histogram, QQ-Plot, Feature Importance) # ... (xem code vẽ biểu đồ trong DM.ipynb)

Mô Hình MLP

Huấn Luyện MLP:

Sử dụng RandomizedSearchCV để tinh chỉnh siêu tham số cho MLPRegressor.
Đánh giá bằng chỉ số R² và MSE.
Lưu các biểu đồ (Actual vs Predicted, Residuals, Histogram, QQ-Plot, Loss Curve) vào /content/drive/MyDrive/KHDL_CK.

Code Tham Khảo:

from sklearn.neural_network import MLPRegressor param_dist = { 'hidden_layer_sizes': [(50,), (100,), (100,50), (100,100,50)], 'activation': ['relu', 'tanh'], 'alpha': np.logspace(-5, -1, 5), 'learning_rate_init': [1e-4, 1e-3, 1e-2], 'learning_rate': ['constant', 'adaptive'] } mlp = MLPRegressor(max_iter=1000, random_state=42) rs_mlp = RandomizedSearchCV( estimator=mlp, param_distributions=param_dist, n_iter=30, cv=5, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1, verbose=1 ) rs_mlp.fit(X_train, y_train) best_mlp = rs_mlp.best_estimator_ y_pred = best_mlp.predict(X_test) r2 = r2_score(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) print("Best MLP params:", rs_mlp.best_params_) print(f"Test R²: {r2:.4f}, Test MSE: {mse:.2f}") # Lưu các biểu đồ (Actual vs Predicted, Residuals, Histogram, QQ-Plot, Loss Curve) # ... (xem code vẽ biểu đồ trong DM.ipynb)

Bước 4: Phân Cụm

Thực hiện phân cụm bằng KMeans, Gaussian Mixture Model (GMM), và Hierarchical Clustering.

Phân Cụm KMeans

Chạy KMeans:

Tải dữ liệu chuẩn hóa (train_norm.csv, test_norm.csv).
Chọn các cột số và chuẩn hóa.
Sử dụng phương pháp Elbow và Silhouette để chọn số cụm tối ưu (k).
Fit KMeans với k tốt nhất và trực quan hóa kết quả.
Lưu các biểu đồ (Elbow, Silhouette, PCA Scatter, Price Boxplot) vào /content/drive/MyDrive/KHDL_CK.

Code Tham Khảo:

from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA df_all = pd.concat([train_norm, test_norm], ignore_index=True) numeric_cols = ['Price_VND', 'Announced_year', 'total_cores', 'max_freq_ghz', 'pref_cores', 'length_mm', 'width_mm', 'height_mm', 'internal_gb', 'resolution_px_w', 'resolution_px_h', 'size_in', 'battery_mAh', 'weight_g', 'ram_gb', 'cpu_score', 'brand_encoded', 'GPU_brand_encoded'] X = df_all[numeric_cols].copy() scaler = StandardScaler() X_scaled = scaler.fit_transform(X) sse = [] sil_scores = [] K_range = range(2, 11) for k in K_range: km = KMeans(n_clusters=k, random_state=42, n_init=10) labels = km.fit_predict(X_scaled) sse.append(km.inertia_) sil_scores.append(silhouette_score(X_scaled, labels)) best_k = K_range[int(np.argmax(sil_scores))] kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10) df_all['cluster'] = kmeans.fit_predict(X_scaled) # Lưu các biểu đồ (Elbow, Silhouette, PCA Scatter, Price Boxplot) # ... (xem code vẽ biểu đồ trong DM.ipynb)

Gaussian Mixture Model (GMM)

Chạy GMM:

Tải dữ liệu chuẩn hóa và chọn các cột số.
Sử dụng BIC và Silhouette để chọn số thành phần tối ưu.
Fit GMM và trực quan hóa kết quả.
Lưu các biểu đồ (BIC, Silhouette, PCA Scatter, Price Boxplot) vào /content/drive/MyDrive/KHDL_CK.

Code Tham Khảo:

from sklearn.mixture import GaussianMixture X = df_all[numeric_cols].values X_scaled = scaler.fit_transform(X) n_components = range(2, 11) bics = [] sils = [] for n in n_components: gmm = GaussianMixture(n_components=n, covariance_type='full', random_state=42) gmm.fit(X_scaled) labels = gmm.predict(X_scaled) bics.append(gmm.bic(X_scaled)) sils.append(silhouette_score(X_scaled, labels)) best_n = n_components[int(np.argmin(bics))] gmm_final = GaussianMixture(n_components=best_n, covariance_type='full', random_state=42) df_all['cluster'] = gmm_final.fit_predict(X_scaled) # Lưu các biểu đồ (BIC, Silhouette, PCA Scatter, Price Boxplot) # ... (xem code vẽ biểu đồ trong DM.ipynb)

Phân Cụm Hierarchical

Chạy Hierarchical Clustering:

Tải dữ liệu chuẩn hóa và chọn các cột số.
Sử dụng dendrogram và Silhouette để chọn số cụm tối ưu.
Fit AgglomerativeClustering với Ward linkage và trực quan hóa kết quả.
Lưu các biểu đồ (Dendrogram, Silhouette, PCA Scatter, Price Boxplot) vào /content/drive/MyDrive/KHDL_CK.

Code Tham Khảo:

from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage Z = linkage(X_scaled, method='ward') range_n = range(2, 11) sil_scores = [] for n in range_n: hc = AgglomerativeClustering(n_clusters=n, linkage='ward') labels = hc.fit_predict(X_scaled) sil_scores.append(silhouette_score(X_scaled, labels)) best_n = range_n[int(np.argmax(sil_scores))] hc_final = AgglomerativeClustering(n_clusters=best_n, linkage='ward') df_all['cluster'] = hc_final.fit_predict(X_scaled) # Lưu các biểu đồ (Dendrogram, Silhouette, PCA Scatter, Price Boxplot) # ... (xem code vẽ biểu đồ trong DM.ipynb)

Bước 5: Xem Kết Quả

File Đã Xử Lý:
- Kiểm tra clean_data_train_processed.csv và clean_data_test_processed.csv trong /content/drive/MyDrive/KHDL_CK.
Biểu Đồ:
- Xem các biểu đồ đã lưu (ví dụ: XGB_Actual_vs_Predicted.jpg, KMeans_scatter_pca.jpg) trong cùng thư mục.
Chỉ Số:
- Xem R² và MSE cho các mô hình XGBoost và MLP.
- Xem điểm Silhouette/BIC cho các phương pháp phân cụm.

Lưu Ý

Đảm bảo train_norm.csv và test_norm.csv có sẵn cho các bước phân cụm.
Mô hình XGBoost và MLP giả định dữ liệu đã được xử lý (ví dụ: clean_data_train_processed.csv).
Điều chỉnh đường dẫn nếu cấu trúc thư mục của bạn khác.
Nếu chạy cục bộ thay vì Colab, bỏ qua bước drive.mount và cập nhật đường dẫn file.
Các script drop.py, process_train.py, và process_test.py phải có trong thư mục chỉ định và tương thích với dữ liệu.

Khắc Phục Sự Cố

Thiếu File: Đảm bảo tất cả file đầu vào và script nằm trong /content/drive/MyDrive/KHDL_CK.
Lỗi Thư Viện: Cài đặt thư viện bị thiếu bằng pip install.
Vấn Đề Bộ Nhớ: Nếu Colab hết bộ nhớ, giảm n_iter trong RandomizedSearchCV hoặc sử dụng tập dữ liệu nhỏ hơn.
Vấn Đề Vẽ Biểu Đồ: Đảm bảo matplotlib được cài đặt và kiểm tra thư mục output_dir tồn tại.

File Đầu Ra

Dữ Liệu Đã Xử Lý:
- raw_data_train.csv, raw_data_test.csv
- processed_data_train.csv, processed_data_test.csv
- clean_data_train.csv, clean_data_test.csv
- clean_data_train_processed.csv, clean_data_test_processed.csv
Biểu Đồ:
- XGBoost: XGB_Actual_vs_Predicted.jpg, XGB_Residuals_vs_Predicted.jpg, v.v.
- MLP: MLP_Actual_vs_Predicted.jpg, MLP_Loss_Curve.jpg, v.v.
- KMeans: KMeans_elbow.jpg, KMeans_silhouette.jpg, v.v.
- GMM: GMM_bic.jpg, GMM_scatter_pca.jpg, v.v.
- Hierarchical: hier_dendrogram.jpg, hier_scatter_pca.jpg, v.v.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Huấn luyện và đánh giá mô hình		Huấn luyện và đánh giá mô hình
Thu thập dữ liệu		Thu thập dữ liệu
Xử lý dữ liệu		Xử lý dữ liệu
Final_Data Science.pdf		Final_Data Science.pdf
Nhom09.docx		Nhom09.docx
Nhom09.pdf		Nhom09.pdf
README.md		README.md
desktop.ini		desktop.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hướng Dẫn Dự Án Data Mining

Yêu Cầu Cần Thiết

Phần Mềm và Thư Viện

Cấu Trúc Thư Mục

Thiết Lập Google Drive

Hướng Dẫn Chạy

Bước 1: Tiền Xử Lý Dữ Liệu

Bước 2: Xử Lý Dữ Liệu Nâng Cao

Bước 3: Huấn Luyện và Đánh Giá Mô Hình

Mô Hình XGBoost

Mô Hình MLP

Bước 4: Phân Cụm

Phân Cụm KMeans

Gaussian Mixture Model (GMM)

Phân Cụm Hierarchical

Bước 5: Xem Kết Quả

Lưu Ý

Khắc Phục Sự Cố

File Đầu Ra

About

Uh oh!

Releases

Packages

Languages

KhoaS84/Data_Science

Folders and files

Latest commit

History

Repository files navigation

Hướng Dẫn Dự Án Data Mining

Yêu Cầu Cần Thiết

Phần Mềm và Thư Viện

Cấu Trúc Thư Mục

Thiết Lập Google Drive

Hướng Dẫn Chạy

Bước 1: Tiền Xử Lý Dữ Liệu

Bước 2: Xử Lý Dữ Liệu Nâng Cao

Bước 3: Huấn Luyện và Đánh Giá Mô Hình

Mô Hình XGBoost

Mô Hình MLP

Bước 4: Phân Cụm

Phân Cụm KMeans

Gaussian Mixture Model (GMM)

Phân Cụm Hierarchical

Bước 5: Xem Kết Quả

Lưu Ý

Khắc Phục Sự Cố

File Đầu Ra

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages