Kubernetes Custom Resource and Operator for PyTorch jobs

⚠️ kubeflow/pytorch-operator is not maintained

This operator has been merged into Kubeflow Training Operator. This repository is not maintained and has been archived.

Overview

This repository contains the specification and implementation of PyTorchJob custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See CRD definition

Prerequisites

Kubernetes >= 1.8
kubectl

Installing PyTorch Operator

Please refer to the installation instructions in the Kubeflow user guide. This installs pytorchjob CRD and pytorch-operator controller to manage the lifecycle of PyTorch jobs.

Creating a PyTorch Job

You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

cat examples/mnist/v1/pytorch_job_mnist_gloo.yaml

Deploy the PyTorchJob resource to start training:

kubectl create -f examples/mnist/v1/pytorch_job_mnist_gloo.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo

Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-type=master -o name) kubectl logs -f ${PODNAME}

Monitoring a PyTorch Job

kubectl get -o yaml pytorchjobs pytorch-dist-mnist-gloo

See status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: v1 items: - apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: creationTimestamp: 2019-01-11T00:51:48Z generation: 1 name: pytorch-dist-mnist-gloo namespace: default resourceVersion: "2146573" selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/pytorchjobs/pytorch-dist-mnist-gloo uid: 13ad0e7f-153b-11e9-b5c1-42010a80001e spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - args: - --backend - gloo image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 name: pytorch resources: limits: nvidia.com/gpu: "1" Worker: replicas: 1 restartPolicy: OnFailure template: spec: containers: - args: - --backend - gloo image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 name: pytorch resources: limits: nvidia.com/gpu: "1" status: completionTime: 2019-01-11T01:03:15Z conditions: - lastTransitionTime: 2019-01-11T00:51:48Z lastUpdateTime: 2019-01-11T00:51:48Z message: PyTorchJob pytorch-dist-mnist-gloo is created. reason: PyTorchJobCreated status: "True" type: Created - lastTransitionTime: 2019-01-11T00:57:22Z lastUpdateTime: 2019-01-11T00:57:22Z message: PyTorchJob pytorch-dist-mnist-gloo is running. reason: PyTorchJobRunning status: "False" type: Running - lastTransitionTime: 2019-01-11T01:03:15Z lastUpdateTime: 2019-01-11T01:03:15Z message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed. reason: PyTorchJobSucceeded status: "True" type: Succeeded replicaStatuses: Master: succeeded: 1 Worker: succeeded: 1 startTime: 2019-01-11T00:57:22Z

Contributing

Please refer to the developer_guide.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
cmd/pytorch-operator.v1		cmd/pytorch-operator.v1
docs/monitoring		docs/monitoring
examples		examples
hack		hack
manifests		manifests
pkg		pkg
scripts		scripts
sdk/python		sdk/python
test		test
third_party/library		third_party/library
third_party_licenses		third_party_licenses
vendor		vendor
version		version
.gcloudignore		.gcloudignore
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
OWNERS		OWNERS
README.md		README.md
build_image.sh		build_image.sh
defaulter-gen		defaulter-gen
dependency.sh		dependency.sh
developer_guide.md		developer_guide.md
go.mod		go.mod
go.sum		go.sum
linter_config.json		linter_config.json
linter_config.yaml		linter_config.yaml
prow_config.yaml		prow_config.yaml
pytorch-operator.v1		pytorch-operator.v1
releasing.md		releasing.md
submit_release_job.sh		submit_release_job.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kubernetes Custom Resource and Operator for PyTorch jobs

⚠️ kubeflow/pytorch-operator is not maintained

Overview

Prerequisites

Installing PyTorch Operator

Creating a PyTorch Job

Monitoring a PyTorch Job

Contributing

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors 43

Uh oh!

Languages

License

kubeflow/pytorch-operator

Folders and files

Latest commit

History

Repository files navigation

Kubernetes Custom Resource and Operator for PyTorch jobs

⚠️ kubeflow/pytorch-operator is not maintained

Overview

Prerequisites

Installing PyTorch Operator

Creating a PyTorch Job

Monitoring a PyTorch Job

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors 43

Uh oh!

Languages

Packages