-
Notifications
You must be signed in to change notification settings - Fork 606
Description
Description
Currently, the XDS Translator may encounter several types of failures that would lead to an error being returned from the Translate function:
- Failure to update Routes, Network/HTTP filters, Clusters and Secrets due to unexpected missing XDS resources or unexpected existing conflicting XDS resource (existing route config, filters, default filter chain, ... ).
- Failure to translate IR config to XDS due to missing/unsupported IR input, which was not properly verified in earlier validation phases (CEL, gateway-api translator validations, ... )
- Failure to apply Envoy Patch Policies: errors in input validation of patches or proto validation of the patched resources
- Failures in validation of EG-generated envoy protos during translation, leading to resources being dropped from the xds resource table.
- Failures in validation of all XDS resources post-translation, leading to errors being reported without resources being dropped.
- Failures in translation by EG Extension Server.
The XDS translators works in a "best-effort" manner: when errors are encountered, some resources may be omitted/changed due to errors. However, the resulting configuration is still published and saved to the XDS cache. This configuration may be partial or invalid. As a result:
- The proxy may not reflect the user's intent.
- The proxy may reject some or all of the config. This may be due to partial configuration that contains invalid resource references or invalid resources. Rejection of config creates a risk that new proxy instances would not be programmed at all.
- Users may not notice that translation is partially failing, if the impact is limited.
- There's no way for users to understand which part of their desired state is "really" applied.
- The proxies remain in an "incorrect" while users work to identify which issue/configuration change triggered this state.
In #4155, users requested the following:
Ideally, similar to #3873, there would be an option added to ExtensionManager which would allow either "fail open" (current behavior of best effort) or "fail closed" (alternate behavior of disabling the resource associated with the failed hook).
This option was implemented in #4936. The FailClosed option in #4936 provides greater reliability that traffic is either handled according to configuration provided by the extension server or rejected entirely.
#4936 is a partial solution:
- It only addresses the extension server use case
- users may prefer to handle translation errors by pausing XDS publishes until the issue is resolved and the translator produces error-free xds resources.
- Existing proxies are serving a recent version of XDS config which was valid at the time of the translation.
- By persisting a valid XDS snapshot, there is a guarantee that new proxies can still be programmed.
An alternative approach, supported by other projects, is to store and use the last successful translation results, and continue using them when translation errors occur:
- Kong: https://docs.konghq.com/kubernetes-ingress-controller/latest/guides/high-availability/last-known-good-config/
- Gloo Mesh: https://docs.solo.io/gloo-mesh-gateway/main/traffic_management/concepts/route-tables/route-failure-modes/#freeze-configuration
Already today, EG behaves this way in some edge cases:
- If an empty translation result being produced by the translator, EG falls back to the last known translation result:
if result == nil { - If a panic is raised, EG translation is halted and the existing cache is used.
By avoiding the publish of snapshots when any xds translator error occurs, EG can offer a similar error handling strategy that might be preferred by users.
Concerns
- How would this interact with existing strategies for handling failures in the XDS translation layer?
- What would happen if EG starts up and produces an empty cache - is there an impact on live proxies that have an existing config?
- There's not guarantee that the last "successful" translation is preferable (from an end user perspective)
Tracking related work:
- Avoid updating XDS snapshot in case of translator error: Skip snapshot updates on XDS translator error #5540
- Improve Configurability of extension server retry policy: Improve configurability of retry policy for extension server #5612
- Avoid missing route configs when moving from FailClosed to recovered state: fix: use rds instead of inline route config in fail closed mode #5611
[optional Relevant Links:]
Any extra documentation required to understand the issue.