Skip to content

Conversation

@seanlaii
Copy link
Contributor

@seanlaii seanlaii commented Dec 1, 2025

Why are these changes needed?

The Kubernetes Endpoints API has been deprecated since v1.21 in favor of the more scalable EndpointSlice API.
The Endpoints API has limitations when handling services with many endpoints (single object size limits), while EndpointSlice supports up to 100 endpoints per slice and can scale horizontally.
This migration ensures KubeRay follows Kubernetes best practices and prepares for future Kubernetes versions where the Endpoints API may be removed.

Note:
We need to mention that the endpoints RBAC has been deprecated, and we will remove the endpoints RBAC in the new future release in the release note.

Related issue number

Closes #4127

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@seanlaii seanlaii changed the title Migrate from Endpoints API to EndpointSlice API for RayService [RayService] Migrate from Endpoints API to EndpointSlice API for RayService Dec 1, 2025
@seanlaii seanlaii marked this pull request as ready for review December 1, 2025 23:54
Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also open a PR in the Ray doc to mention that starting from v1.6.0, the KubeRay Operator will use EndpointSlice for RayService? Thanks!

@seanlaii
Copy link
Contributor Author

seanlaii commented Dec 3, 2025

Could you also open a PR in the Ray doc to mention that starting from v1.6.0, the KubeRay Operator will use EndpointSlice for RayService? Thanks!

Yes, will open up a PR to add the info in the Ray doc.

@seanlaii
Copy link
Contributor Author

seanlaii commented Dec 3, 2025

Hi @rueian @Future-Outlier @andrewsykim , please help review the PR when you get a chance. Thanks!


// Count ready endpoints across all EndpointSlices.
numReadyEndpoints := 0
for _, endpointSlice := range endpointSliceList.Items {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Service happens to be IPv4/IPv6 dual-stack, then counting each individual address from all EndpointSlices will double-count Pods that have both an IPv4 and IPv6 address. Not sure if dual-stack is supported, but it would be inconsistent with how the RayService status field is documented:

// NumServeEndpoints indicates the number of Ray Pods that are actively serving or have been selected by the serve service.

Copy link
Contributor Author

@seanlaii seanlaii Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! You are absolutely right. Although KubeRay doesn't fully support dual-stack clusters yet (see issue #4205), the logic should strictly align with the API definition of counting Pods, not addresses. I will update the implementation to deduplicate endpoints based on their Pod UID to ensure the count remains accurate regardless of the networking stack.

cc @Future-Outlier @rueian

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, plz ping me when you all comments are addressed, and sorry for the late review.

@seanlaii
Copy link
Contributor Author

seanlaii commented Dec 6, 2025

Hi @Future-Outlier , no worries at all! Thanks for the review. I have addressed all your comments. Please help take a look when you get a chance!

*/}}
{{- define "role.consistentRules" -}}
rules:
# TODO: Remove endpoints permission in a future release.
Copy link
Contributor Author

@seanlaii seanlaii Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked the removal progress in #4255.

Comment on lines 122 to 125
# TODO: Remove endpoints permission in a future release.
# This permission is retained for backward compatibility to support users deploying older operator
# images with the new Helm chart. Older operator versions still use the deprecated Endpoints API.
# This can be safely removed once those older versions are no longer supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!!
I think it's ok to remove endpoints permission in this PR, since when you use helm install, your kuberay operator will have the old release's rbac permission.

so 3 files to change.

  1. helm-chart/kuberay-operator/templates/_helpers.tpl
  2. ray-operator/config/rbac/role.yaml
  3. ray-operator/controllers/ray/rayservice_controller.go
    (the kubebuilder part)

Copy link
Contributor Author

@seanlaii seanlaii Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if they use the new helm chart with old images, there will be an error. I think this is a breaking change and it might be better to decouple the helm chart version and the image version.

Copy link
Member

@Future-Outlier Future-Outlier Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay, since we upgrade the KubeRay operator via the Helm chart like this.
In this case, it will not be a breaking change, because we will have a Helm chart version that maps to each KubeRay operator version.

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.4.2.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will remove the endpoint RBAC. We might need to add this in the release doc as it will be a breaking change for users who use the helm chart with value override.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem, thank you!
let me add to my release draft notes now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and closed the tracking issue as well.
Please help review when you get a chance. Thank you!

// +kubebuilder:rbac:groups=core,resources=pods/status,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=pods/proxy,verbs=get;update;patch
// +kubebuilder:rbac:groups=core,resources=endpoints,verbs=get;list;watch
// +kubebuilder:rbac:groups=discovery.k8s.io,resources=endpointslices,verbs=get;list;watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this line to the bottom of the RBAC section so that related permissions stay grouped together.

@Future-Outlier
Copy link
Member

cc @rueian to merge


numPods := len(uniqueNumReadyPods)

logger.Info("Counted serve-ready pods via EndpointSlices",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this result in too many logs? Using the debug level could be better.

Copy link
Contributor Author

@seanlaii seanlaii Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated by changing the log level.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again, thank you!

@rueian rueian merged commit e32405e into ray-project:master Dec 8, 2025
27 checks passed
@seanlaii seanlaii deleted the endpointslices branch December 8, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transition from endpoints to endpointslices

5 participants