V2 certificate format #1216

nbrownus · 2024-09-13T04:05:57Z

This is a near complete implementation of an ipv6 enabled overlay network. There are a handful of pull requests targeting this branch in addition to a number of incomplete issues within this PR.

I have tried to highlight the main trouble spots with comments on the PR and in addition here are my internal notes:

nebula-cert is defaulting to mint both v1 and v2 certificates and nebula config is defaulting to transmit v1 certificates if both are present. This will lead to a situation where net new adopters will be stuck using v1 certificates. Should this be the default behavior?
Currently nebula wont allow you to remove a v1 certificate on reload but this means a restart is required to complete a migration to v2. A comment on this review highlights this, we should allow it.
Currently there are a few spots (control, ssh) where we return the certificate in use but they only return 1 cert, should we leave it returning the default or force folks to consume both? How do we indicate which one is default?
Every time we encounter a certificate we need to sort the networks as well as only record the union of applicable addresses. See [cert-v2] punchy-respond on an address in common with the querying host #1261
We need to take extra care to ensure we never allow a ipv4 mapped ipv6 address. This is not currently addressed in this PR.
There may be an opportunity to improve the lighthouse protocol by directly using the MarshalBinary and UnmarshalBinary functions for netip objects. This is more of a nit than anything but we should agree on this before merging.
nebula-cert sign help text needs to talk about primary vpn address (or the union of effective addresses) being the first one after being sorted and what that means for tunnels.
The firewall config still uses naming like ca-sha should we alias this to ca-fingerprint to normalize the verbiage?
The firewall config still uses naming like ip should we alias this to network to normalize the verbiage
Normalize all names for Marshal* and Unmarshal*, we have some From*/To* and others without From/To.
Should we use this moment to swap p256 keys to use their compressed form? Saves ~32 bytes.

Once this is merged the upgrade path should likely be:

Give every host dual certs and leave the default pki.default_version to 1. Everything continues to use v1 protocols.
Switch all am_relay: true hosts pki.default_version to 2. Relays can now initiate v2 tunnels, but this is an uncommon flow.
Switch all non lighthouse hosts pki.default_version to 2. Lighthouses will reply to v2 hosts using v2 protocols.
Switch all lighthouses pki.default_version to 2.
Remove v1 certs from all hosts

cert/cert_v2.asn1

cert/cert_v2.go

cert/cert_v2.asn1

nebula.proto

cert/pem.go

cert/cert_v2.asn1

lighthouse.go

overlay/tun_linux.go

udp/udp_rio_windows.go

cmd/nebula-cert/sign.go

--------- Co-authored-by: Jack Doan <[email protected]>

cert/cert_v1.go

cert/cert_v2.go

cert/sign.go

…didn't work on the first try (#1268)

…st (#1261)

lighthouse.go

JackDoanRivian · 2024-12-12T15:10:19Z

TODO: nebula-cert verify only checks the first cert -- see #1291

maggie44 · 2024-12-29T16:30:35Z

This PR no longer exports the certificate structs; neither v1 or v2:

type certificateV2 struct {
	details detailsV2

	// RawDetails contains the entire asn.1 DER encoded Details struct
	// This is to benefit forwards compatibility in signature checking.
	// signature(RawDetails + Curve + PublicKey) == Signature
	rawDetails []byte
	curve      Curve
	publicKey  []byte
	signature  []byte
}

type detailsV2 struct {
	name           string
	networks       []netip.Prefix
	unsafeNetworks []netip.Prefix
	groups         []string
	isCA           bool
	notBefore      time.Time
	notAfter       time.Time
	issuer         string
}

type certificateV1 struct {
	details   detailsV1
	signature []byte
}

type detailsV1 struct {
	name           string
	networks       []netip.Prefix
	unsafeNetworks []netip.Prefix
	groups         []string
	notBefore      time.Time
	notAfter       time.Time
	publicKey      []byte
	isCA           bool
	issuer         string

	curve Curve
}

Before:


type NebulaCertificate struct {
	Details   NebulaCertificateDetails
	Signature []byte

	// the cached hex string of the calculated sha256sum
	// for VerifyWithCache
	sha256sum atomic.Pointer[string]

	// the cached public key bytes if they were verified as the signer
	// for VerifyWithCache
	signatureVerified atomic.Pointer[[]byte]
}

type NebulaCertificateDetails struct {
	Name      string
	Ips       []*net.IPNet
	Subnets   []*net.IPNet
	Groups    []string
	NotBefore time.Time
	NotAfter  time.Time
	PublicKey []byte
	IsCA      bool
	Issuer    string

	// Map of groups for faster lookup
	InvertedGroups map[string]struct{}

	Curve Curve
}

It is going to be a big breaking change and inconvenience for anyone who has used and uses the libraries. Would prefer to keep these exported. I see that it is a breaking change PR, but it should at least allow restoring the same functionality as was available before.

maggie44 · 2024-12-30T09:30:23Z

I see now it’s switched to a factory approach, with TBSCertificate that is exported. 👍

A few thoughts to consider:

A helper function for converting a Certificate back to a TBSCertificate would be useful to complete the factory loop and allow manipulation of existing signed certificates (as unsigned ones).

There are also some Marshal functions like Marshal(), but without a corresponding Unmarshal(). I imagine most people are using pem encoded certs, but if Marshal() is exposed there should probably be a corresponding Unmarshal().

nbdd0121 · 2024-12-30T16:11:38Z

cert/cert_v2.go

+	networks       []netip.Prefix
+	unsafeNetworks []netip.Prefix


I mentioned this as a reply to a resolved comment already, but in case it's folded and not seen: I think the name networks and unsafeNetworks are confusing. The former one is an address + its on-link network prefix, while the latter is just a network prefix. For the former, only a single address is routable to the host, while all addresses in the network is routable to the host for the latter.

I would propose the name to be addresses (I think it's fine even though this expects an CIDR, as the input is same as what ip address add command expects) which clearly indicate that the address part of the network prefix is significant, and routable_networks (or maybe unsafe_routable_networks) which clearly indicates that these are CIDRs for routing only.

I share concerns of confusion with the naming here. I'm also comfortable sticking with the old name which at least creates no new confusion (especially with respect to existing documentation and discussions): #1216 (comment)

I'd also be open to the name addresses for the networks field. I think the name networks came up when we were considering it in the context of a CA certificate? Using that same field name for two very-similar-but-different purposes muddies the waters a little bit, but the intent was that it shows which networks (as V2 certs let you have more than one!) the host participates in, as well as the actual address assignment on non-CA certs. However, the ip address add argument is very persuasive and probably much more obvious to new users.

Naming unsafeNetworks is a little trickier. When the addresses field is named networks, I think it's a pretty logical name, but unsafeAddresses doesn't really work. I'd propose something like unsafeRouterFor: the prefix unsafe links it mentally to unsafe_routes in the config, and I think RouterFor shows that the field expresses a capability, rather than something that might affect how "regular" Nebula traffic is routed.

I think addresses places emphasis on the use of the ip; networks on the prefix use. But ultimately both are used and the content passed in is CIDRs? Perhaps better to reflect what it is and what is passed in by the user rather than the things it’s used for (similar to before where it was called Ips but CIDRs would probably be more accurate).

The bigger challenge before was having to manually change the ip when using ParseCIDR because it returned the base address instead of the provided IP, and the confusion of the certificate field ‘subnets’. Both of those are no longer an issue so bounds better.

As Jack said, the issue is that the field is used for dual purposes. For a non-CA cert, addresses and unsafeRouterFor makes more sense, but for CA cert, networks and unsafeNetworks make more sense. I would say it makes more sense to cater for the non-CA cert case since there're more of them, or perhaps they can have different names for different nebula-cert commands?

* enforce certificate correctness in TBSCertificate.SignWith * check length, not nil * Address review comments * github hates me --------- Co-authored-by: Nate Brown <[email protected]> Co-authored-by: Jack Doan <[email protected]>

…Os (#1319)

wadey

Approved to merge and fix forward. Left a few comments we can fix in the future

wadey · 2025-03-05T15:57:20Z

.gitignore

+**/coverage.out
+**/cover.out


Suggested change

**/coverage.out

**/cover.out

coverage.out

cover.out

You can just remove the leading / and it matches that path in any folder.

wadey · 2025-03-06T15:50:14Z

cmd/nebula-cert/ca.go

 	cf := caFlags{set: flag.NewFlagSet("ca", flag.ContinueOnError)}
 	cf.set.Usage = func() {}
 	cf.name = cf.set.String("name", "", "Required: name of the certificate authority")
+	cf.version = cf.set.Uint("version", uint(cert.Version2), "Optional: version of the certificate format to use")


Should we default to version 1 for now?

wadey · 2025-03-06T16:01:38Z

e2e/handshakes_test.go

-// Race loser renews and handshakes
-// Does race winner repin the cert to old?
-//TODO: add a test with many lies
+func TestV2NonPrimaryWithLighthouse(t *testing.T) {


It looks like all of the other e2e handshake tests are for V1 certs? Do we need more tests with V2?

wadey · 2025-03-06T16:20:55Z

firewall.go

+			droppedLocalAddr:  metrics.GetOrRegisterCounter("firewall.incoming.dropped.local_addr", nil),
+			droppedRemoteAddr: metrics.GetOrRegisterCounter("firewall.incoming.dropped.remote_addr", nil),


We might want to indicate in the release notes that these metrics names are changing

salesforce-cla bot added the cla:signed label Sep 13, 2024

johnmaguire reviewed Sep 13, 2024

View reviewed changes

cert/cert_v2.asn1 Show resolved Hide resolved

JackDoanRivian reviewed Sep 13, 2024

View reviewed changes

cert/cert_v2.go Show resolved Hide resolved

JackDoanRivian reviewed Sep 16, 2024

View reviewed changes

cert/cert_v2.asn1 Outdated Show resolved Hide resolved

JackDoanRivian reviewed Sep 16, 2024

View reviewed changes

nebula.proto Show resolved Hide resolved

JackDoanRivian reviewed Sep 16, 2024

View reviewed changes

cert/pem.go Show resolved Hide resolved

JackDoanRivian reviewed Sep 20, 2024

View reviewed changes

cert/cert_v2.asn1 Show resolved Hide resolved

JackDoanRivian reviewed Sep 21, 2024

View reviewed changes

lighthouse.go Outdated Show resolved Hide resolved

JackDoanRivian reviewed Sep 21, 2024

View reviewed changes

overlay/tun_linux.go Show resolved Hide resolved

nbrownus mentioned this pull request Oct 4, 2024

Cert interface #1212

Merged

nbrownus force-pushed the cert-v2 branch from 9d15808 to f1fca4d Compare October 8, 2024 03:40

Base automatically changed from cert-interface to master October 10, 2024 23:00

nbrownus force-pushed the cert-v2 branch from c4ef068 to d6f1b51 Compare October 11, 2024 21:44

nbrownus commented Oct 17, 2024

View reviewed changes

udp/udp_rio_windows.go Show resolved Hide resolved

nbrownus commented Oct 17, 2024

View reviewed changes

cmd/nebula-cert/sign.go Outdated Show resolved Hide resolved

nbrownus commented Oct 17, 2024

View reviewed changes

cmd/nebula-cert/sign.go Show resolved Hide resolved

nbrownus force-pushed the cert-v2 branch 3 times, most recently from 9f75b80 to a09f39f Compare October 24, 2024 03:17

Support for ipv6 in the overlay with v2 certificates

f2c3242

--------- Co-authored-by: Jack Doan <[email protected]>

nbrownus force-pushed the cert-v2 branch from a09f39f to f2c3242 Compare October 24, 2024 03:25