Posted on Jun 10

Training Your First ML Model on Amazon SageMaker Using S3 Data

Machine Learning has never been more accessible — and with tools like Amazon SageMaker, you can go from raw data to a trained model in just a few steps. In this post, I’ll walk you through how I used Amazon SageMaker to train an ML model with a dataset I uploaded to an S3 bucket. Whether you’re a student, researcher, or builder working on a cool AI project, this guide is for you.

📦 Prerequisites

Before we begin, make sure you have:

An AWS account
Amazon SageMaker and S3 access
AWS IAM role with necessary permissions
A dataset ready to upload (CSV, JSON, Parquet, etc.)

🪣 Step 1: Upload Your Dataset to S3

Go to the S3 Console:

Create a new bucket or use an existing one.
Upload your dataset file.
Make note of the S3 URI, e.g., s3://my-ml-bucket/datasets/my-data.csv.

📌 Permissions Note:

Ensure that your SageMaker execution role has access to the S3 bucket:

{ "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::my-ml-bucket/*" }

🧠 Step 2: Set Up SageMaker Notebook Instance

Go to the SageMaker Console.
Create a Notebook Instance.
Attach the IAM role with S3 access.
Once the instance is running, open Jupyter Notebook.

🧪 Step 3: Load and Explore the Data

Use the SageMaker SDK inside a Jupyter notebook:

import pandas as pd import boto3 s3_path = 's3://my-ml-bucket/datasets/my-data.csv' df = pd.read_csv(s3_path) df.head()

🧰 Step 4: Preprocess and Prepare for Training

Prepare your data as needed:

from sklearn.model_selection import train_test_split X = df.drop('target', axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

🛠️ Step 5: Use SageMaker Built-in Algorithms (Optional)

SageMaker provides prebuilt algorithms like XGBoost:

import sagemaker from sagemaker import get_execution_role from sagemaker.inputs import TrainingInput from sagemaker.estimator import Estimator role = get_execution_role() session = sagemaker.Session() bucket = 'my-ml-bucket' # Upload training data to S3 train_data = pd.concat([X_train, y_train], axis=1) train_data.to_csv('train.csv', index=False) session.upload_data('train.csv', bucket=bucket, key_prefix='train') # Set up the estimator xgboost_container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name) xgb = Estimator( xgboost_container, role=role, instance_count=1, instance_type='ml.m5.large', output_path=f's3://{bucket}/output', sagemaker_session=session ) xgb.set_hyperparameters( objective='binary:logistic', num_round=100 ) # Start training xgb.fit({'train': TrainingInput(f's3://{bucket}/train/train.csv', content_type='csv')})

✅ Step 6: Deploy and Test

predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m5.large') # Make predictions result = predictor.predict(X_test.to_numpy())

🔒 Clean Up

To avoid unnecessary charges:

predictor.delete_endpoint()

🚀 Wrapping Up

Using Amazon SageMaker with an S3-hosted dataset is a powerful, scalable way to train ML models without worrying about infrastructure. With just a few lines of code, you’re able to upload data, preprocess it, train a model, and deploy it into production.

💬 Let's Connect!

If you're building something with SageMaker or just getting into ML/AI, drop a comment below or reach out on Twitter/X [@x.com/SimonNungwa ] — I'd love to connect and collaborate!

🔗 Resources

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.