Understanding effective connection attempt timeout with heartbeat and idle settings

I am trying to reduce latency in detecting failed channel connections and reattempting connections sooner.

I am using the cloud spanner client library [code](https://github.com/googleapis/java-spanner/blob/f9505a97bdd9f6da7dd5ab1b60b47f7ed0a70402/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/GapicSpannerRpc.java#L354) which uses google api gax library [code](https://github.com/googleapis/sdk-platform-java/blob/main/gax-java/gax-grpc/src/main/java/com/google/api/gax/grpc/InstantiatingGrpcChannelProvider.java#L735) to connect to the spanner service. There was a period where rpcs sent to a newly created channel all failed after their deadline with errors like:
`DEADLINE_EXCEEDED: CallOptions deadline exceeded after 29.996032979s. Name resolution delay 0.002361332 seconds. [closed=[], open=[[buffered_nanos=29996563395, waiting_for_connection]]]`. The `waiting_for_connection` indicates that the connection had not yet completed connecting and that rpcs were instead buffered waiting for the connection before failing due to the deadline.  This state lasted for 20 minutes until the process was restarted.

The following options are used when creating the channel:
```
keepAliveTime=2m
keepAliveTimeout=default
keepAliveWithoutCalls=false
idleTimeout=default (30m)
```

I see that keepAliveWithoutCalls being disabled means that keepalives are not sent when the transport is started.
[KeepAliveManager](https://github.com/grpc/grpc-java/blob/226b1e5ecc7ddcbb819cfc893308ac46973d16d5/core/src/main/java/io/grpc/internal/KeepAliveManager.java#L152).
This seems like the keepAliveTime thus won't help with detecting slow/stuck connection attempts.  It seems then it may take up to the `idleTimeout` of 30 minutes to transfer the channel from `CONNECTING`->`IDLE` to detect the stuck CONNECTING state and reattempt connecting on the next call.

- Are there other mechanisms that would detect and exit long CONNECTING state sooner with this configuration?  I see that Netty has `ChannelOptions.CONNECT_TIMEOUT_MILLIS` but this doesn't appear to be set by grpc itself (or gax).
- Though setting `keepAliveWithoutCalls=true` may help reduce the timeout to the keep-alive time it changes the behavior when the channel is not connecting. As a user since there are attempted calls it seems like keep-alives should be sent in this case. Is this a bug that should be addressed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding effective connection attempt timeout with heartbeat and idle settings #12561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understanding effective connection attempt timeout with heartbeat and idle settings #12561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions