Posted on Jul 6

Python for Machine Learning: From Simple to Advanced

Part 1: The Core Idea

Machine Learning = Finding patterns in data to make predictions.

# Simple pattern: Height predicts weight height = 170 # cm weight = height * 0.5 # Simple rule: weight = height * 0.5 print(f"Predicted weight: {weight} kg")

Intuition: We find mathematical relationships between inputs (height) and outputs (weight).

Part 2: Working with Data

import numpy as np # Data is just numbers in arrays heights = np.array([160, 170, 180, 190]) # Input features weights = np.array([60, 70, 80, 90]) # Target values  print(f"Average height: {heights.mean()}")

What happened: NumPy arrays store our data efficiently and provide math operations.

Part 3: Finding Patterns

# Find the best line through data points slope = np.corrcoef(heights, weights)[0,1] # Correlation print(f"Correlation: {slope:.2f}") # Close to 1 = strong pattern

Intuition: Correlation tells us how strongly two variables are related.

Part 4: Making Predictions

# Simple linear prediction def predict_weight(height): return height * 0.5 - 15 # Our discovered pattern  new_height = 175 predicted = predict_weight(new_height) print(f"Height {new_height}cm → Weight {predicted}kg")

Key insight: Once we find the pattern, we can predict new values.

Part 5: Measuring Errors

# How wrong are our predictions? actual = np.array([65, 75, 85, 95]) predicted = np.array([60, 70, 80, 90]) error = np.mean((actual - predicted) ** 2) # Mean Squared Error print(f"Average error: {error}")

Why this matters: We need to know how good our predictions are.

Part 6: Learning from Data

from sklearn.linear_model import LinearRegression # Let the computer find the pattern model = LinearRegression() model.fit(heights.reshape(-1, 1), weights) # Learn from data  # Make predictions prediction = model.predict([[175]]) print(f"Learned prediction: {prediction[0]:.1f}kg")

Magic moment: The computer automatically finds the best line through our data.

Part 7: Train/Test Split

from sklearn.model_selection import train_test_split # Split data: some for learning, some for testing X_train, X_test, y_train, y_test = train_test_split(heights.reshape(-1,1), weights, test_size=0.5) model.fit(X_train, y_train) # Learn from training data score = model.score(X_test, y_test) # Test on unseen data print(f"Model accuracy: {score:.2f}")

Why split: We test on data the model hasn't seen to avoid cheating.

Part 8: Multiple Features

# Use multiple inputs for better predictions data = np.array([[170, 25], # [height, age]  [180, 30], [160, 20], [175, 35]]) weights = np.array([70, 80, 60, 75]) model.fit(data, weights) # Learn from height AND age prediction = model.predict([[172, 28]]) # Predict using both features

Power of ML: Use many features to make better predictions.

Part 9: Classification vs Regression

# Regression: Predict numbers (weight, price, temperature) regressor = LinearRegression() # Classification: Predict categories (spam/not spam, cat/dog) from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier() # Same interface, different problems

Two main types: Predicting numbers vs predicting categories.

Part 10: Real Data with Pandas

import pandas as pd # Load real data df = pd.DataFrame({ 'height': [160, 170, 180, 190, 165], 'weight': [60, 70, 80, 90, 65], 'age': [25, 30, 35, 40, 28] }) print(df.head()) # See first few rows print(df.describe()) # Get statistics

Pandas power: Handle real-world messy data with ease.

Part 11: Data Preprocessing

# Clean and prepare data df['bmi'] = df['weight'] / (df['height']/100)**2 # Create new feature df = df.dropna() # Remove missing values  # Separate features and target X = df[['height', 'age']] # Features y = df['weight'] # Target

Essential step: Clean data before feeding to algorithms.

Part 12: Different Algorithms

from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.svm import SVR # Try different algorithms models = { 'Linear': LinearRegression(), 'Tree': DecisionTreeRegressor(), 'Forest': RandomForestRegressor(), 'SVM': SVR() } # Each finds patterns differently

Algorithm zoo: Different algorithms good for different problems.

Part 13: Model Evaluation

from sklearn.metrics import mean_squared_error, r2_score # Evaluate model performance y_pred = model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"RMSE: {rmse:.2f}") # Lower is better print(f"R²: {r2:.2f}") # Higher is better (max 1.0)

Metrics matter: Different ways to measure how good your model is.

Part 14: Cross-Validation

from sklearn.model_selection import cross_val_score # Test model on multiple train/test splits scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation print(f"Average score: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")

Robust testing: Get more reliable estimate of model performance.

Part 15: Feature Engineering

# Create better features df['height_squared'] = df['height'] ** 2 df['age_height'] = df['age'] * df['height'] # Interaction feature  # Sometimes simple transformations improve predictions

Domain knowledge: Understanding your data helps create better features.

Part 16: Handling Categorical Data

# Text categories need special handling df['gender'] = ['M', 'F', 'M', 'F', 'M'] # Convert to numbers df_encoded = pd.get_dummies(df, columns=['gender']) print(df_encoded.columns) # gender_F, gender_M columns

Encoding: Convert text to numbers for ML algorithms.

Part 17: Scaling Features

from sklearn.preprocessing import StandardScaler # Scale features to similar ranges scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Mean=0, Std=1  # Some algorithms work better with scaled data

Why scale: Prevents features with large values from dominating.

Part 18: Pipeline

from sklearn.pipeline import Pipeline # Chain preprocessing and modeling pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestRegressor()) ]) pipeline.fit(X_train, y_train) # Scaling and training in one step

Clean workflow: Combines preprocessing and modeling automatically.

Part 19: Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV # Find best model settings param_grid = {'n_estimators': [50, 100, 200]} grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5) grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}")

Optimization: Automatically find best settings for your model.

Part 20: Putting It All Together

# Complete ML workflow def ml_workflow(data, target_column): # 1. Split features and target  X = data.drop(target_column, axis=1) y = data[target_column] # 2. Train/test split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 3. Create pipeline  pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestRegressor()) ]) # 4. Train model  pipeline.fit(X_train, y_train) # 5. Evaluate  score = pipeline.score(X_test, y_test) return pipeline, score # Usage model, accuracy = ml_workflow(df, 'weight') print(f"Model accuracy: {accuracy:.2f}")

Complete solution: From raw data to trained model in one function.

Key Takeaways

Data = Numbers: Everything must be converted to numbers
Patterns = Models: Algorithms find mathematical relationships
Train/Test = Validation: Always test on unseen data
Features = Input: Good features make good predictions
Metrics = Evaluation: Measure how well your model works
Pipeline = Workflow: Combine steps for clean, reproducible ML

This foundation gives you the tools to solve real machine learning problems!

DEV Community