Latest News π₯
- [2025/11] Please fill this survey to shape the future of Kubeflow SDK.
- [2025/11] The Kubeflow SDK v0.2 is officially released. Check out the announcement blog post.
The Kubeflow SDK is a set of unified Pythonic APIs that let you run any AI workload at any scale β without the need to learn Kubernetes. It provides simple and consistent APIs across the Kubeflow ecosystem, enabling users to focus on building AI applications rather than managing complex infrastructure.
- Unified Experience: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
- Simplified AI Workloads: Abstract away Kubernetes complexity and work effortlessly across all Kubeflow projects using familiar Python APIs
- Built for Scale: Seamlessly scale any AI workload β from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.
- Rapid Iteration: Reduced friction between development and production environments
- Local Development: First-class support for local development without a Kubernetes cluster requiring only
pipinstallation
pip install -U kubeflowfrom kubeflow.trainer import TrainerClient, CustomTrainer, TrainJobTemplate def get_torch_dist(learning_rate: str, num_epochs: str): import os import torch import torch.distributed as dist dist.init_process_group(backend="gloo") print("PyTorch Distributed Environment") print(f"WORLD_SIZE: {dist.get_world_size()}") print(f"RANK: {dist.get_rank()}") print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}") lr = float(learning_rate) epochs = int(num_epochs) loss = 1.0 - (lr * 2) - (epochs * 0.01) if dist.get_rank() == 0: print(f"loss={loss}") # Create the TrainJob template template = TrainJobTemplate( runtime=TrainerClient().get_runtime("torch-distributed"), trainer=CustomTrainer( func=get_torch_dist, func_args={"learning_rate": "0.01", "num_epochs": "5"}, num_nodes=3, resources_per_node={"cpu": 2}, ), ) # Create the TrainJob job_id = TrainerClient().train(**template) # Wait for TrainJob to complete TrainerClient().wait_for_job_status(job_id) # Print TrainJob logs print("\n".join(TrainerClient().get_job_logs(name=job_id)))from kubeflow.optimizer import OptimizerClient, Search, TrialConfig # Create OptimizationJob with the same template optimization_id = OptimizerClient().optimize( trial_template=template, trial_config=TrialConfig(num_trials=10, parallel_trials=2), search_space={ "learning_rate": Search.loguniform(0.001, 0.1), "num_epochs": Search.choice([5, 10, 15]), }, ) print(f"OptimizationJob created: {optimization_id}")Kubeflow Trainer client supports local development without needing a Kubernetes cluster.
- KubernetesBackend (default) - Production training on Kubernetes
- ContainerBackend - Local development with Docker/Podman isolation
- LocalProcessBackend - Quick prototyping with Python subprocesses
Quick Start: Install container support: pip install kubeflow[docker] or pip install kubeflow[podman]
from kubeflow.trainer import TrainerClient, ContainerBackendConfig, CustomTrainer # Switch to local container execution client = TrainerClient(backend_config=ContainerBackendConfig()) # Your training runs locally in isolated containers job_id = client.train(trainer=CustomTrainer(func=train_fn))| Project | Status | Version Support | Description |
|---|---|---|---|
| Kubeflow Trainer | β Available | v2.0.0+ | Train and fine-tune AI models with various frameworks |
| Kubeflow Katib | β Available | v0.19.0+ | Hyperparameter optimization |
| Kubeflow Pipelines | π§ Planned | TBD | Build, run, and track AI workflows |
| Kubeflow Model Registry | π§ Planned | TBD | Manage model artifacts, versions and ML artifacts metadata |
| Kubeflow Spark Operator | π§ Planned | TBD | Manage Spark applications for data processing and feature engineering |
- Slack: Join our #kubeflow-ml-experience Slack channel
- Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings
- GitHub: Discussions, issues and contributions at kubeflow/sdk
Kubeflow SDK is a community project and is still under active development. We welcome contributions! Please see our CONTRIBUTING Guide for details.
- Blog Post Announcement: Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale
- Design Document: Kubeflow SDK design proposal
- Component Guides: Individual component documentation
- DeepWiki: AI-powered repository documentation
We couldn't have done it without these incredible people: