This operator has been merged into Kubeflow Training Operator. This repository is not maintained and has been archived.
This repository contains the specification and implementation of PyTorchJob custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See CRD definition
- Kubernetes >= 1.8
- kubectl
Please refer to the installation instructions in the Kubeflow user guide. This installs pytorchjob CRD and pytorch-operator controller to manage the lifecycle of PyTorch jobs.
You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.
cat examples/mnist/v1/pytorch_job_mnist_gloo.yaml Deploy the PyTorchJob resource to start training:
kubectl create -f examples/mnist/v1/pytorch_job_mnist_gloo.yaml You should now be able to see the created pods matching the specified number of replicas.
kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.
PODNAME=$(kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-type=master -o name) kubectl logs -f ${PODNAME} kubectl get -o yaml pytorchjobs pytorch-dist-mnist-gloo See status section to monitor the job status. Here is sample output when the job is successfully completed.
apiVersion: v1 items: - apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: creationTimestamp: 2019-01-11T00:51:48Z generation: 1 name: pytorch-dist-mnist-gloo namespace: default resourceVersion: "2146573" selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/pytorchjobs/pytorch-dist-mnist-gloo uid: 13ad0e7f-153b-11e9-b5c1-42010a80001e spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - args: - --backend - gloo image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 name: pytorch resources: limits: nvidia.com/gpu: "1" Worker: replicas: 1 restartPolicy: OnFailure template: spec: containers: - args: - --backend - gloo image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 name: pytorch resources: limits: nvidia.com/gpu: "1" status: completionTime: 2019-01-11T01:03:15Z conditions: - lastTransitionTime: 2019-01-11T00:51:48Z lastUpdateTime: 2019-01-11T00:51:48Z message: PyTorchJob pytorch-dist-mnist-gloo is created. reason: PyTorchJobCreated status: "True" type: Created - lastTransitionTime: 2019-01-11T00:57:22Z lastUpdateTime: 2019-01-11T00:57:22Z message: PyTorchJob pytorch-dist-mnist-gloo is running. reason: PyTorchJobRunning status: "False" type: Running - lastTransitionTime: 2019-01-11T01:03:15Z lastUpdateTime: 2019-01-11T01:03:15Z message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed. reason: PyTorchJobSucceeded status: "True" type: Succeeded replicaStatuses: Master: succeeded: 1 Worker: succeeded: 1 startTime: 2019-01-11T00:57:22Z Please refer to the developer_guide.