-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
I am trying to reduce latency in detecting failed channel connections and reattempting connections sooner.
I am using the cloud spanner client library code which uses google api gax library code to connect to the spanner service. There was a period where rpcs sent to a newly created channel all failed after their deadline with errors like:
DEADLINE_EXCEEDED: CallOptions deadline exceeded after 29.996032979s. Name resolution delay 0.002361332 seconds. [closed=[], open=[[buffered_nanos=29996563395, waiting_for_connection]]]. The waiting_for_connection indicates that the connection had not yet completed connecting and that rpcs were instead buffered waiting for the connection before failing due to the deadline. This state lasted for 20 minutes until the process was restarted.
The following options are used when creating the channel:
keepAliveTime=2m
keepAliveTimeout=default
keepAliveWithoutCalls=false
idleTimeout=default (30m)
I see that keepAliveWithoutCalls being disabled means that keepalives are not sent when the transport is started.
KeepAliveManager.
This seems like the keepAliveTime thus won't help with detecting slow/stuck connection attempts. It seems then it may take up to the idleTimeout of 30 minutes to transfer the channel from CONNECTING->IDLE to detect the stuck CONNECTING state and reattempt connecting on the next call.
- Are there other mechanisms that would detect and exit long CONNECTING state sooner with this configuration? I see that Netty has
ChannelOptions.CONNECT_TIMEOUT_MILLISbut this doesn't appear to be set by grpc itself (or gax). - Though setting
keepAliveWithoutCalls=truemay help reduce the timeout to the keep-alive time it changes the behavior when the channel is not connecting. As a user since there are attempted calls it seems like keep-alives should be sent in this case. Is this a bug that should be addressed?