Machine Learning has never been more accessible — and with tools like Amazon SageMaker, you can go from raw data to a trained model in just a few steps. In this post, I’ll walk you through how I used Amazon SageMaker to train an ML model with a dataset I uploaded to an S3 bucket. Whether you’re a student, researcher, or builder working on a cool AI project, this guide is for you.
📦 Prerequisites
Before we begin, make sure you have:
- An AWS account
- Amazon SageMaker and S3 access
- AWS IAM role with necessary permissions
- A dataset ready to upload (CSV, JSON, Parquet, etc.)
🪣 Step 1: Upload Your Dataset to S3
Go to the S3 Console:
- Create a new bucket or use an existing one.
- Upload your dataset file.
- Make note of the S3 URI, e.g.,
s3://my-ml-bucket/datasets/my-data.csv
.
📌 Permissions Note:
Ensure that your SageMaker execution role has access to the S3 bucket:
{ "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::my-ml-bucket/*" }
🧠 Step 2: Set Up SageMaker Notebook Instance
- Go to the SageMaker Console.
- Create a Notebook Instance.
- Attach the IAM role with S3 access.
- Once the instance is running, open Jupyter Notebook.
🧪 Step 3: Load and Explore the Data
Use the SageMaker SDK inside a Jupyter notebook:
import pandas as pd import boto3 s3_path = 's3://my-ml-bucket/datasets/my-data.csv' df = pd.read_csv(s3_path) df.head()
🧰 Step 4: Preprocess and Prepare for Training
Prepare your data as needed:
from sklearn.model_selection import train_test_split X = df.drop('target', axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
🛠️ Step 5: Use SageMaker Built-in Algorithms (Optional)
SageMaker provides prebuilt algorithms like XGBoost:
import sagemaker from sagemaker import get_execution_role from sagemaker.inputs import TrainingInput from sagemaker.estimator import Estimator role = get_execution_role() session = sagemaker.Session() bucket = 'my-ml-bucket' # Upload training data to S3 train_data = pd.concat([X_train, y_train], axis=1) train_data.to_csv('train.csv', index=False) session.upload_data('train.csv', bucket=bucket, key_prefix='train') # Set up the estimator xgboost_container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name) xgb = Estimator( xgboost_container, role=role, instance_count=1, instance_type='ml.m5.large', output_path=f's3://{bucket}/output', sagemaker_session=session ) xgb.set_hyperparameters( objective='binary:logistic', num_round=100 ) # Start training xgb.fit({'train': TrainingInput(f's3://{bucket}/train/train.csv', content_type='csv')})
✅ Step 6: Deploy and Test
predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m5.large') # Make predictions result = predictor.predict(X_test.to_numpy())
🔒 Clean Up
To avoid unnecessary charges:
predictor.delete_endpoint()
🚀 Wrapping Up
Using Amazon SageMaker with an S3-hosted dataset is a powerful, scalable way to train ML models without worrying about infrastructure. With just a few lines of code, you’re able to upload data, preprocess it, train a model, and deploy it into production.
💬 Let's Connect!
If you're building something with SageMaker or just getting into ML/AI, drop a comment below or reach out on Twitter/X [@x.com/SimonNungwa ] — I'd love to connect and collaborate!
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.