-
Notifications
You must be signed in to change notification settings - Fork 664
[RayService] Migrate from Endpoints API to EndpointSlice API for RayService #4245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
74c42aa to
50e6c19
Compare
win5923
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also open a PR in the Ray doc to mention that starting from v1.6.0, the KubeRay Operator will use EndpointSlice for RayService? Thanks!
Yes, will open up a PR to add the info in the Ray doc. |
|
Hi @rueian @Future-Outlier @andrewsykim , please help review the PR when you get a chance. Thanks! |
|
|
||
| // Count ready endpoints across all EndpointSlices. | ||
| numReadyEndpoints := 0 | ||
| for _, endpointSlice := range endpointSliceList.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Service happens to be IPv4/IPv6 dual-stack, then counting each individual address from all EndpointSlices will double-count Pods that have both an IPv4 and IPv6 address. Not sure if dual-stack is supported, but it would be inconsistent with how the RayService status field is documented:
// NumServeEndpoints indicates the number of Ray Pods that are actively serving or have been selected by the serve service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! You are absolutely right. Although KubeRay doesn't fully support dual-stack clusters yet (see issue #4205), the logic should strictly align with the API definition of counting Pods, not addresses. I will update the implementation to deduplicate endpoints based on their Pod UID to ensure the count remains accurate regardless of the networking stack.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, plz ping me when you all comments are addressed, and sorry for the late review.
095342f to
4a8c679
Compare
Signed-off-by: seanlaii <[email protected]>
4a8c679 to
e8c1c24
Compare
|
Hi @Future-Outlier , no worries at all! Thanks for the review. I have addressed all your comments. Please help take a look when you get a chance! |
113acaa to
db85876
Compare
| */}} | ||
| {{- define "role.consistentRules" -}} | ||
| rules: | ||
| # TODO: Remove endpoints permission in a future release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked the removal progress in #4255.
db85876 to
321d7b5
Compare
| # TODO: Remove endpoints permission in a future release. | ||
| # This permission is retained for backward compatibility to support users deploying older operator | ||
| # images with the new Helm chart. Older operator versions still use the deprecated Endpoints API. | ||
| # This can be safely removed once those older versions are no longer supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!!
I think it's ok to remove endpoints permission in this PR, since when you use helm install, your kuberay operator will have the old release's rbac permission.
so 3 files to change.
- helm-chart/kuberay-operator/templates/_helpers.tpl
- ray-operator/config/rbac/role.yaml
- ray-operator/controllers/ray/rayservice_controller.go
(the kubebuilder part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if they use the new helm chart with old images, there will be an error. I think this is a breaking change and it might be better to decouple the helm chart version and the image version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay, since we upgrade the KubeRay operator via the Helm chart like this.
In this case, it will not be a breaking change, because we will have a Helm chart version that maps to each KubeRay operator version.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.4.2.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.2There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will remove the endpoint RBAC. We might need to add this in the release doc as it will be a breaking change for users who use the helm chart with value override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem, thank you!
let me add to my release draft notes now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and closed the tracking issue as well.
Please help review when you get a chance. Thank you!
| // +kubebuilder:rbac:groups=core,resources=pods/status,verbs=get;list;watch;create;update;patch;delete | ||
| // +kubebuilder:rbac:groups=core,resources=pods/proxy,verbs=get;update;patch | ||
| // +kubebuilder:rbac:groups=core,resources=endpoints,verbs=get;list;watch | ||
| // +kubebuilder:rbac:groups=discovery.k8s.io,resources=endpointslices,verbs=get;list;watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this line to the bottom of the RBAC section so that related permissions stay grouped together.
|
cc @rueian to merge |
|
|
||
| numPods := len(uniqueNumReadyPods) | ||
|
|
||
| logger.Info("Counted serve-ready pods via EndpointSlices", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this result in too many logs? Using the debug level could be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated by changing the log level.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again, thank you!
Why are these changes needed?
The Kubernetes Endpoints API has been deprecated since v1.21 in favor of the more scalable EndpointSlice API.
The Endpoints API has limitations when handling services with many endpoints (single object size limits), while EndpointSlice supports up to 100 endpoints per slice and can scale horizontally.
This migration ensures KubeRay follows Kubernetes best practices and prepares for future Kubernetes versions where the Endpoints API may be removed.
Note:
We need to mention that the endpoints RBAC has been deprecated, and we will remove the endpoints RBAC in the new future release in the release note.
Related issue number
Closes #4127
Checks