Skip to content

Conversation

@luislavena
Copy link
Contributor

Replace Go-native SSH implementation with SSH CLI execution to support diverse SSH configurations and agents. Implements 'uncloudd dial-stdio' subcommand that proxies gRPC connections over stdin/stdout, similar to Docker's approach.

This change addresses compatibility issues with:

  • SSH agents exposing many keys (1Password, causing "too many authentication failures")
  • Tailscale SSH (which doesn't support advanced SSH channel types like direct-streamlocal)
  • Custom SSH configurations in ~/.ssh/config

The dial-stdio approach reduces SSH feature requirements by streaming the unix socket connection over stdin/stdout instead of using SSH channel forwarding.

Changes:

  • Add 'uncloudd dial-stdio' hidden subcommand for socket proxy
  • Add SSHCLIConnector using ssh command + dial-stdio
  • Update connection logic to use SSH CLI connector
  • Maintain backward compatibility with SSHKeyFile config

Resolves #131

@luislavena
Copy link
Contributor Author

Hola!

Took a quick stab at this as it was a bit painful for me having to unlock 1Password every single time when Tailscale manages it transparently and keyless 😅

I didn't remove the current SSH connector, but I replaced it with SSHCLI one for the CLI to use it.

This is a very naive implementation. Blindly stole the logic from Docker's dial stdio implementation, but is not exactly the same as I didn't fully understand the code (I'm not versed in all the dark incantations of Go) 😬

Since there was no SSH test, I wasn't sure if we should be testing it or not, so I added a kinda pointless test (actually was Claude Code who did it, but I take responsibility for that) 😁

I've tested it against my fleet of machines connected to my tailnet and my finger is much better that I don't have to unlock 1Password on every deployment 🤣

Let me know what do you think!

Cheers
❤️ ❤️ ❤️

Copy link
Owner

@psviderski psviderski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on the new SSH connector! 👏

Looks much simpler than what I expected to see. While it's on the right track, please see a few comments and questions we need to resolve before we can merge it.

@luislavena
Copy link
Contributor Author

Hello @psviderski, thank you for your patience and looking into this!

In the case of Tailscale, it does support both port forwarding and IO forwarding transiently, so the Dialer should work.

Am I right that the new connector is only used for already provisioned machines? So for provisioning new machines you will still need to change your ssh agent setup, right?

Ideally the connector should replace all the flows, including that to avoid discrepancies.

What do you think about optionally enabling this connector using the connection schema --connect ssh+cli:// in additional to the existing ssh:// and tcp://?

That is a great idea, just for testing I did ssh_cli: for the configuration file, so we can test these side-by-side.

I went ahead and completed the Dialer so uc image push works.

Adding responses do your other comments now 😊

Important note: I run out of brain for some stuff so decided to use Claude Code to assist in some cleanup and address some of the feedback. Note that it was heavily guided as the generated code was not verbatim as it generated it. Anyhow, wanted to disclose that as I'm not sure what would be your position towards AI-assisted contributions.

Thank you.
❤️ ❤️ ❤️

Copy link
Owner

@psviderski psviderski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dialer looks good to me! 👍 Great work! ❤️
So it looks like we need to address half-closing in dial-stdio and we can merge this?
I think the Executor could be addressed in the next one. What do you think?

Replace Go-native SSH implementation with SSH CLI execution
to support diverse SSH configurations and agents. Implements
'uncloudd dial-stdio' subcommand that proxies gRPC connections
over stdin/stdout, similar to Docker's approach.

This change addresses compatibility issues with:
- SSH agents exposing many keys (1Password, causing "too many
  authentication failures")
- Tailscale SSH (which doesn't support advanced SSH channel
  types like direct-streamlocal)
- Custom SSH configurations in ~/.ssh/config

The dial-stdio approach reduces SSH feature requirements by
streaming the unix socket connection over stdin/stdout instead
of using SSH channel forwarding.

Changes:
- Add 'uncloudd dial-stdio' hidden subcommand for socket proxy
- Add SSHCLIConnector using ssh command + dial-stdio
- Update connection logic to use SSH CLI connector
- Maintain backward compatibility with SSHKeyFile config

Resolves psviderski#131
Introduced short connection timeout on the first change but forgot to
update tests to match.
- Add SSHCLI field to support ssh_cli YAML configuration
- Update String() to use URI-like format (ssh://, ssh+cli://, tcp://)
- Add Validate() to ensure connection methods are mutually exclusive
- Add tests for validation and String() method
Revert provisionOrConnectRemoteMachine to use Go SSH connector:
- Root users: reuse SSH connection from provisioning
- Non-root users: establish new connection for group membership
- Remove SSH CLI as default connector

SSH CLI connector remains available via ssh_cli config field.
Implement proxy.ContextDialer for SSHCLIConnector using SSH -W flag.
Each dial spawns a new SSH process for TCP forwarding, enabling
independent connections separate from the gRPC dial-stdio connection.
Return sshCLIDialer instead of error, enabling uc image push
functionality with SSHCLIConnector. Validates connector is
configured before returning dialer.
This way I can skip the configuration file while testing things out,
and confirm it works correctly:

  $ unset SSH_AUTH_SOCK

  $ ./uncloud --connect ssh://provision@blatta11 machine ls
  Error: connect to cluster: connect to machine: SSH login to
    provision@blatta11:22: connect using SSH agent: connect to SSH
    agent: dial unix: missing address

  $ ./uncloud --connect ssh+cli://provision@blatta11 machine ls
  NAME       STATE   ADDRESS         PUBLIC IP   WIREGUARD ENDPOINTS                                                                                                                  MACHINE ID
  blatta11   Up      10.210.0.1/24   -           100.64.0.22:51820, ...
No longer attempt to validate the connection when instantiating a new
client.

Later on we could validate it in different places.
There were some serious slop in those tests, so took the time to clean
them up and kept only the relevant ones.

There is some repetition between buildSSHArgs and buildDialArgs but can
be tackled at a later stage.
@luislavena luislavena force-pushed the exp-ssh-cli-dial-stdio branch from f25b320 to 1129a43 Compare November 2, 2025 17:41
@luislavena
Copy link
Contributor Author

Hello @psviderski, thank you for your patience.

Finally I was able to take a look to the feedback and address it.

  • Kept both ssh and ssh+cli as connection schema, and made it work with --connect
  • Dealt similar to Docker with stdio/socket connection (introduced halfCloser interface)
  • Got rid of the inspect during connection, we can deal with that later
  • Corrected the AI-slop associated with tests. I take full responsibility about that 😅

re: sshexec, interestingly, I didn't change anything and it works out of the box with Tailscale SSH (no need for SSH keys or anything) but then either init or add fail as it tries to use the SSH connector to initialize the node:

$ ./uncloud --uncloud-config ./blatta.yaml machine init root@blatta11 --no-dns --public-ip none
Downloading Uncloud install script: https://raw.githubusercontent.com/psviderski/uncloud/refs/heads/main/scripts/install.sh
⏳ Running Uncloud install script...
✓ Docker is already installed.
Client: Docker Engine - Community
 Version:           28.5.1
 API version:       1.51
 Go version:        go1.24.8
 Git commit:        e180ab8
 Built:             Wed Oct  8 12:17:24 2025
 OS/Arch:           linux/amd64
 Context:           default
...
✓ Linux user 'uncloud' already exists.
✓ uncloudd binary is already installed.
✓ Systemd unit file created: /etc/systemd/system/uncloud.service
✓ uncloud-corrosion binary is already installed.
✓ Systemd unit file created: /etc/systemd/system/uncloud-corrosion.service
⏳ Starting Uncloud machine daemon (uncloud.service)...
✓ Uncloud machine daemon started.
✓ Uncloud installed on the machine successfully! 🎉
Error: inspect machine: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: connect to machine API socket '/run/uncloud/uncloud.sock' through SSH tunnel (is uncloud.service running on the remote machine and does the SSH user 'root' have permissions to access the socket?): ssh: rejected: unknown channel type (unsupported channel type)"

Note that blatta.yaml does not exist, this is the first machine to be initialized but is not possible with this scenario.

I would love to tackle the sshexec/Executor aspect in another PR, for the time being, I can rely on 1Password SSH Agent socket and the connection for the time being.

Apologies for the force push, I wanted to ensure I was testing against the latest changes from main but I didn't wanted to merge main inside (I prefer to avoid the merges-on-merges scenarios).

Thank you in advance for looking into this and I hope this looks much better now 😊

Cheers,
❤️ ❤️ ❤️

Prefix connection with ssh ssh+cli respectively.
}
case err = <-socket2stdout:
// return immediately, matching Docker's approach
// (stdin is never closed when TTY)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment mean that _, err := io.Copy(conn, stdin) may never finish if stdin is a TTY, hence err = <-stdin2socket never succeeds?

case err = <-stdin2socket:
if err != nil {
return err
}
Copy link
Owner

@psviderski psviderski Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand the intended behaviour correctly?

If we finished reading from stdin and there is an error, return it and complete the dial-stdio command.
If there is no error, complete the dial-stdio command, potentially not finishing reading everything from socket to stdout. Note that cases in the select statements don't fallback to the next one.

Ah, I checked the docker implementation, and it indeed contains the missing logic to wait for another goroutine before returning in this case:
https://github.com/docker/cli/blob/7414d73fc119a6b97ff240b7c7df91f1f642e323/cli/command/system/dial_stdio.go#L66-L67

Copy link
Owner

@psviderski psviderski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! 👏 👏 👏
I left a comment about a tiny edge cases we need to address in dial-stdio and we can merge this one 💪

re: sshexec, interestingly, I didn't change anything and it works out of the box with Tailscale SSH (no need for SSH keys or anything) but then either init or add fail as it tries to use the SSH connector to initialize the node:

Hm, this is indeed interesting as sshexec.Connect is what is directly used in the SSHConnector that doesn't seem to work:

c.client, err = sshexec.Connect(c.config.User, c.config.Host, c.config.Port, c.config.KeyPath)

As you said, tackling sshexec/Executor aspect in another PR would make a lot of sense.

Thanks again for a lot of effort you put into this work! ❤️

Missed copy & pasta from Docker dial-stdio implementation (this happens
when you stare at the code for too long that it burns your eyes).
@luislavena luislavena marked this pull request as ready for review November 5, 2025 23:32
@luislavena
Copy link
Contributor Author

Hello @psviderski, addressed the comment about stdio, I completely missed it! 🤦

Ready for last review! I've already worked a bit on a SSH CLI connector, but I will explain that more in detail in a separate PR.

Thank you!
❤️ ❤️ ❤️

@psviderski psviderski merged commit 48d2239 into psviderski:main Nov 5, 2025
4 checks passed
@psviderski psviderski mentioned this pull request Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible for Uncloud to use SSH configuration? (or provide an specific identity?)

3 participants