- Notifications
You must be signed in to change notification settings - Fork 635
[draft pr][RayJob] Use timeout to prevent RayCluster leak #4090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Future-Outlier wants to merge 2 commits into ray-project:master Choose a base branch from Future-Outlier:rayjob-raycluster-leak-idea-2-timeout
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
Draft
[draft pr][RayJob] Use timeout to prevent RayCluster leak #4090
Future-Outlier wants to merge 2 commits into ray-project:master from Future-Outlier:rayjob-raycluster-leak-idea-2-timeout
+44 −0
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Signed-off-by: Future-Outlier <eric901201@gmail.com>
cc @machichima apiVersion: ray.io/v1 kind: RayJob metadata: name: rayjob-sample-external-redis-2-49-0-2 spec: # submissionMode specifies how RayJob submits the Ray job to the RayCluster. # The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job. # The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster. # submissionMode: "K8sJobMode" entrypoint: python -c "import os, time; print(os.environ.get('HOSTNAME')); [time.sleep(1) or print(i) for i in range(1000)]" # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false. # shutdownAfterJobFinishes: false # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes. # ttlSecondsAfterFinished: 10 # activeDeadlineSeconds is the duration in seconds that the RayJob may be active before # KubeRay actively tries to terminate the RayJob; value must be positive integer. # activeDeadlineSeconds: 120 # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string. # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details. # (New in KubeRay version 1.0.) runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: counter_name: "test_counter" # Suspend specifies whether the RayJob controller should create a RayCluster instance. # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false. # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created. # suspend: false # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller. rayClusterSpec: rayVersion: '2.46.0' # should match the Ray version in the image of the containers gcsFaultToleranceOptions: # In most cases, you don't need to set `externalStorageNamespace` because KubeRay will # automatically set it to the UID of RayCluster. Only modify this annotation if you fully understand # the behaviors of the Ray GCS FT and RayService to avoid misconfiguration. # [Example]: # externalStorageNamespace: "my-raycluster-storage" redisAddress: "redis:6379" redisPassword: valueFrom: secretKeyRef: name: redis-password-secret key: password # Ray head pod template headGroupSpec: # The `rayStartParams` are used to configure the `ray start` command. # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. rayStartParams: {} #pod template template: spec: # terminationGracePeriodSeconds: 1203 containers: - name: ray-head image: rayproject/ray:2.46.0 resources: limits: cpu: "1" requests: cpu: "1" lifecycle: preStop: exec: command: ["python -c 'import ray; ray.shutdown()'"] ports: - containerPort: 6379 name: redis - containerPort: 8265 name: dashboard - containerPort: 10001 name: client volumeMounts: - mountPath: /tmp/ray name: ray-logs - mountPath: /home/ray/samples name: ray-example-configmap volumes: - name: ray-logs emptyDir: {} - name: ray-example-configmap configMap: name: ray-example defaultMode: 0777 items: - key: detached_actor.py path: detached_actor.py - key: increment_counter.py path: increment_counter.py workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 10 groupName: small-group # The `rayStartParams` are used to configure the `ray start` command. # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. rayStartParams: {} # Pod template template: spec: containers: - name: ray-worker image: rayproject/ray:2.46.0 volumeMounts: - mountPath: /tmp/ray name: ray-logs resources: limits: cpu: "1" requests: cpu: "1" volumes: - name: ray-logs emptyDir: {} --- kind: ConfigMap apiVersion: v1 metadata: name: redis-config labels: app: redis data: redis.conf: |- dir /data port 6379 bind 0.0.0.0 appendonly yes protected-mode no requirepass 5241590000000000 pidfile /data/redis-6379.pid --- apiVersion: v1 kind: Service metadata: name: redis labels: app: redis spec: type: ClusterIP ports: - name: redis port: 6379 selector: app: redis --- apiVersion: apps/v1 kind: Deployment metadata: name: redis labels: app: redis spec: replicas: 1 selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis:7.4.0 command: - "sh" - "-c" - "redis-server /usr/local/etc/redis/redis.conf" ports: - containerPort: 6379 volumeMounts: - name: config mountPath: /usr/local/etc/redis/redis.conf subPath: redis.conf volumes: - name: config configMap: name: redis-config --- # Redis password apiVersion: v1 kind: Secret metadata: name: redis-password-secret type: Opaque data: # echo -n "5241590000000000" | base64 password: NTI0MTU5MDAwMDAwMDAwMA== --- apiVersion: v1 kind: ConfigMap metadata: name: ray-example data: detached_actor.py: | import ray @ray.remote(num_cpus=1) class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value ray.init(namespace="default_namespace") Counter.options(name="counter_actor", lifetime="detached").remote() increment_counter.py: | import ray ray.init(namespace="default_namespace") counter = ray.get_actor("counter_actor") print(ray.get(counter.increment.remote())) |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
#3860 (comment)
Checks