Skip to content

Conversation

@spencer-p
Copy link
Contributor

Why are these changes needed?

When deploying a RayJob, the RayCluster status may have valuable info about runtime failures before the cluster becomes ready. In such cases, the RayJob remains Initializing and never updates its status.

This is a common failure mode and may confuse new users that don't realize they need to inspect the ray cluster -- but we should use the existing ray cluster status subfield to reflect this information.

Related issue number

n/a

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
Signed-off-by: Spencer Peterson <spencerjp@google.com>
@spencer-p
Copy link
Contributor Author

I also noticed that an updated ray cluster status is silently discarded unless the job status has changed, which may further obscure runtime issues after job submission. IMO this is another opportunity for improvement and would be happy to contribute that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant