-
Notifications
You must be signed in to change notification settings - Fork 30
REP: Ray History Server #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1940bb0 to
0cf9789
Compare
MengjinYan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for come up with the REP!
|
|
||
| For Beta: | ||
| * A user can specify a top-level API in RayCluster to enable the history server. | ||
| * A local Ray dashboard can use the history server as an API backend to view the state of a terminated Ray cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a very similar point in alpha. So just curious, what's the difference between the point in alpha and beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're the same, I added it here to indicate that using a local Ray dashboard should still be supported in Beta, let me know if you think otherwise
| For Beta: | ||
| * A user can specify a top-level API in RayCluster to enable the history server. | ||
| * A local Ray dashboard can use the history server as an API backend to view the state of a terminated Ray cluster. | ||
| * A remote Ray dashboard running on Kubernetes (managed by KubeRay) can be used to view the state of a terminated Ray cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think between beta and GA, based on the experience working with the existing online dashboard, we might need to adjust the dashboard for it to better show the history information of a cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the dashboard changes are actually needed in alpha. I think @KunWuLuan fork of the dashboard has changes that need to somehow be incorporated in the upstream dashboard to unblock even Alpha-level support. Let me know if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in alpha version the dashboard changes is need and the dashboard can not be used independently. We will not try to merge the changes of dashboard in alpha back to the upstream because in beta version there will be no changes of dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Apart from the dashboard serving perspective, I think my point is more regarding the actual content of the dashboard. Basically in GA, we might want to adjust the dashboard to make it better showcasing the history information and remove the components/fields that are not applicable. But details can be discussed when we have more experiences in running the dashboards.
| ## (Optional) Follow-on Work | ||
|
|
||
| We will start with a naive approach to event processing on the history server. However, we may need to explore | ||
| more optimal strategies if processing events introduces significant latency overhead or memory usage. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should link the original design doc from @KunWuLuan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a reference to the doc in this section
|
cc: @alanwguo for awareness |
2224da1 to
92a6859
Compare
Signed-off-by: Andrew Sy Kim <[email protected]>
92a6859 to
fac40c8
Compare
|
Hi, will the /api/jobs/{job_id} be supported in v1.7? We have not discussed about how to rebuild these pages. I am not sure if we can complete before v1.7 release. |
| is responsible for grouping the events. | ||
|
|
||
| All events will initially be partitioned by Job ID. Specifically, task events associated with the same Job ID will be stored in the same directory. | ||
| * Node-level events will be stored in: cluster_name_cluster_uid/session_id/node_events/<nodeName>-<time> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should here be nodeName or nodeID?
cc @KunWuLuan
I'm not sure about the release version but I think if it is part of the dashboard, we should support it. |
ray-project/kuberay#3966