Part 1: The Core Idea
Machine Learning = Finding patterns in data to make predictions.
# Simple pattern: Height predicts weight height = 170 # cm weight = height * 0.5 # Simple rule: weight = height * 0.5 print(f"Predicted weight: {weight} kg")
Intuition: We find mathematical relationships between inputs (height) and outputs (weight).
Part 2: Working with Data
import numpy as np # Data is just numbers in arrays heights = np.array([160, 170, 180, 190]) # Input features weights = np.array([60, 70, 80, 90]) # Target values print(f"Average height: {heights.mean()}")
What happened: NumPy arrays store our data efficiently and provide math operations.
Part 3: Finding Patterns
# Find the best line through data points slope = np.corrcoef(heights, weights)[0,1] # Correlation print(f"Correlation: {slope:.2f}") # Close to 1 = strong pattern
Intuition: Correlation tells us how strongly two variables are related.
Part 4: Making Predictions
# Simple linear prediction def predict_weight(height): return height * 0.5 - 15 # Our discovered pattern new_height = 175 predicted = predict_weight(new_height) print(f"Height {new_height}cm → Weight {predicted}kg")
Key insight: Once we find the pattern, we can predict new values.
Part 5: Measuring Errors
# How wrong are our predictions? actual = np.array([65, 75, 85, 95]) predicted = np.array([60, 70, 80, 90]) error = np.mean((actual - predicted) ** 2) # Mean Squared Error print(f"Average error: {error}")
Why this matters: We need to know how good our predictions are.
Part 6: Learning from Data
from sklearn.linear_model import LinearRegression # Let the computer find the pattern model = LinearRegression() model.fit(heights.reshape(-1, 1), weights) # Learn from data # Make predictions prediction = model.predict([[175]]) print(f"Learned prediction: {prediction[0]:.1f}kg")
Magic moment: The computer automatically finds the best line through our data.
Part 7: Train/Test Split
from sklearn.model_selection import train_test_split # Split data: some for learning, some for testing X_train, X_test, y_train, y_test = train_test_split(heights.reshape(-1,1), weights, test_size=0.5) model.fit(X_train, y_train) # Learn from training data score = model.score(X_test, y_test) # Test on unseen data print(f"Model accuracy: {score:.2f}")
Why split: We test on data the model hasn't seen to avoid cheating.
Part 8: Multiple Features
# Use multiple inputs for better predictions data = np.array([[170, 25], # [height, age] [180, 30], [160, 20], [175, 35]]) weights = np.array([70, 80, 60, 75]) model.fit(data, weights) # Learn from height AND age prediction = model.predict([[172, 28]]) # Predict using both features
Power of ML: Use many features to make better predictions.
Part 9: Classification vs Regression
# Regression: Predict numbers (weight, price, temperature) regressor = LinearRegression() # Classification: Predict categories (spam/not spam, cat/dog) from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier() # Same interface, different problems
Two main types: Predicting numbers vs predicting categories.
Part 10: Real Data with Pandas
import pandas as pd # Load real data df = pd.DataFrame({ 'height': [160, 170, 180, 190, 165], 'weight': [60, 70, 80, 90, 65], 'age': [25, 30, 35, 40, 28] }) print(df.head()) # See first few rows print(df.describe()) # Get statistics
Pandas power: Handle real-world messy data with ease.
Part 11: Data Preprocessing
# Clean and prepare data df['bmi'] = df['weight'] / (df['height']/100)**2 # Create new feature df = df.dropna() # Remove missing values # Separate features and target X = df[['height', 'age']] # Features y = df['weight'] # Target
Essential step: Clean data before feeding to algorithms.
Part 12: Different Algorithms
from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.svm import SVR # Try different algorithms models = { 'Linear': LinearRegression(), 'Tree': DecisionTreeRegressor(), 'Forest': RandomForestRegressor(), 'SVM': SVR() } # Each finds patterns differently
Algorithm zoo: Different algorithms good for different problems.
Part 13: Model Evaluation
from sklearn.metrics import mean_squared_error, r2_score # Evaluate model performance y_pred = model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"RMSE: {rmse:.2f}") # Lower is better print(f"R²: {r2:.2f}") # Higher is better (max 1.0)
Metrics matter: Different ways to measure how good your model is.
Part 14: Cross-Validation
from sklearn.model_selection import cross_val_score # Test model on multiple train/test splits scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation print(f"Average score: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")
Robust testing: Get more reliable estimate of model performance.
Part 15: Feature Engineering
# Create better features df['height_squared'] = df['height'] ** 2 df['age_height'] = df['age'] * df['height'] # Interaction feature # Sometimes simple transformations improve predictions
Domain knowledge: Understanding your data helps create better features.
Part 16: Handling Categorical Data
# Text categories need special handling df['gender'] = ['M', 'F', 'M', 'F', 'M'] # Convert to numbers df_encoded = pd.get_dummies(df, columns=['gender']) print(df_encoded.columns) # gender_F, gender_M columns
Encoding: Convert text to numbers for ML algorithms.
Part 17: Scaling Features
from sklearn.preprocessing import StandardScaler # Scale features to similar ranges scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Mean=0, Std=1 # Some algorithms work better with scaled data
Why scale: Prevents features with large values from dominating.
Part 18: Pipeline
from sklearn.pipeline import Pipeline # Chain preprocessing and modeling pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestRegressor()) ]) pipeline.fit(X_train, y_train) # Scaling and training in one step
Clean workflow: Combines preprocessing and modeling automatically.
Part 19: Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV # Find best model settings param_grid = {'n_estimators': [50, 100, 200]} grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5) grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}")
Optimization: Automatically find best settings for your model.
Part 20: Putting It All Together
# Complete ML workflow def ml_workflow(data, target_column): # 1. Split features and target X = data.drop(target_column, axis=1) y = data[target_column] # 2. Train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 3. Create pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestRegressor()) ]) # 4. Train model pipeline.fit(X_train, y_train) # 5. Evaluate score = pipeline.score(X_test, y_test) return pipeline, score # Usage model, accuracy = ml_workflow(df, 'weight') print(f"Model accuracy: {accuracy:.2f}")
Complete solution: From raw data to trained model in one function.
Key Takeaways
- Data = Numbers: Everything must be converted to numbers
- Patterns = Models: Algorithms find mathematical relationships
- Train/Test = Validation: Always test on unseen data
- Features = Input: Good features make good predictions
- Metrics = Evaluation: Measure how well your model works
- Pipeline = Workflow: Combine steps for clean, reproducible ML
This foundation gives you the tools to solve real machine learning problems!
Top comments (0)