PyTorch Training (PyTorchJob)
Old Version
This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation.
This page describes PyTorchJob for training a machine learning model with PyTorch.
The PyTorchJob is a Kubernetes custom resource to run PyTorch training jobs on Kubernetes. The Kubeflow implementation of the PyTorchJob is in the training-operator.
Note: PyTorchJob doesn’t work in a user namespace by default because of Istio automatic sidecar injection. In order to get it running, it needs the annotation sidecar.istio.io/inject: "false" to disable it for either PyTorchJob pods or the namespace. To view an example of how to add this annotation to your yaml file, see the TFJob documentation.
Creating a PyTorch training job
You can create a training job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.
Deploy the PyTorchJob resource to start training:
kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/refs/heads/release-1.9/examples/pytorch/simple.yaml You should now be able to see the created pods matching the specified number of replicas.
kubectl get pods -l training.kubeflow.org/job-name=pytorch-simple -n kubeflow Training takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.
PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=pytorch-simple,training.kubeflow.org/replica-type=master,training.kubeflow.org/replica-index=0 -o name -n kubeflow) kubectl logs -f ${PODNAME} -n kubeflow Monitoring a PyTorchJob
kubectl get -o yaml pytorchjobs pytorch-simple -n kubeflow See the status section to monitor the job status. Here is sample output when the job is successfully completed.
apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: clusterName: "" creationTimestamp: 2018-12-16T21:39:09Z generation: 1 name: pytorch-tcp-dist-mnist namespace: default resourceVersion: "15532" selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist uid: 059391e8-017b-11e9-bf13-06afd8f55a5c spec: cleanPodPolicy: None pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: creationTimestamp: null spec: containers: - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 name: pytorch ports: - containerPort: 23456 name: pytorchjob-port resources: {} Worker: replicas: 3 restartPolicy: OnFailure template: metadata: creationTimestamp: null spec: containers: - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 name: pytorch ports: - containerPort: 23456 name: pytorchjob-port resources: {} status: completionTime: 2018-12-16T21:43:27Z conditions: - lastTransitionTime: 2018-12-16T21:39:09Z lastUpdateTime: 2018-12-16T21:39:09Z message: PyTorchJob pytorch-tcp-dist-mnist is created. reason: PyTorchJobCreated status: "True" type: Created - lastTransitionTime: 2018-12-16T21:39:09Z lastUpdateTime: 2018-12-16T21:40:45Z message: PyTorchJob pytorch-tcp-dist-mnist is running. reason: PyTorchJobRunning status: "False" type: Running - lastTransitionTime: 2018-12-16T21:39:09Z lastUpdateTime: 2018-12-16T21:43:27Z message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed. reason: PyTorchJobSucceeded status: "True" type: Succeeded replicaStatuses: Master: {} Worker: {} startTime: 2018-12-16T21:40:45Z Next steps
Learn about distributed training in the Training Operator.
See how to run a job with gang-scheduling.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.