Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: don't fail channel/server startup when xds creds is specified, but bootstrap is missing certificate providers #6848

Merged
merged 7 commits into from
Dec 20, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Dec 11, 2023

We currently have checks in the xDS resolver and xDS-enabled gRPC server that verify that if xDS credentials are specified by the user, then the associated certificate providers configuration in the bootstrap is non-empty. If the check fails, then channel/server start-up fails. This is broken for a number of reasons:

  • On the client, we could have multiple clusters, some of which specify valid security configuration (i.e certificate provider instance name that is present in the bootstrap) and some of which specify invalid security configuration (i.e certificate provider instance name that is not present in the bootstrap). In this scenario, the clusters with valid security configuration need to work and the ones with invalid security configuration should fail with TRANSIENT_FAILURE. The same applies on the server side where the security configuration is part of the listener resource. FYI: We currently handle this scenario correctly by NACKing the appropriate resource.
  • Checking for the presence of the certificate provider configuration at channel/server startup is a very weak check and does not protect us from anything that the above check does not.
  • C-core and Java do not have this check.

This PR fixes #6756, and brings the grpc-go implementation to be in sync with C-core and Java.

RELEASE NOTES:

  • xds: don't fail channel/server startup when xds creds is specified, but bootstrap is missing certificate providers

…ut bootstrap is missing certificate providers
@easwars easwars requested a review from zasweq December 11, 2023 23:12
@easwars easwars added this to the 1.61 Release milestone Dec 11, 2023
Copy link

codecov bot commented Dec 11, 2023

Codecov Report

Merging #6848 (25050b6) into master (52baf16) will increase coverage by 0.09%.
Report is 12 commits behind head on master.
The diff coverage is 14.28%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6848      +/-   ##
==========================================
+ Coverage   83.43%   83.53%   +0.09%     
==========================================
  Files         286      286              
  Lines       30801    30756      -45     
==========================================
- Hits        25699    25692       -7     
+ Misses       4028     3993      -35     
+ Partials     1074     1071       -3     
Files Coverage Δ
xds/internal/resolver/xds_resolver.go 78.93% <ø> (-0.04%) ⬇️
xds/internal/server/listener_wrapper.go 67.88% <ø> (-0.15%) ⬇️
xds/server.go 86.50% <ø> (+0.51%) ⬆️
xds/internal/balancer/cdsbalancer/cdsbalancer.go 80.80% <25.00%> (+0.43%) ⬆️
xds/internal/server/conn_wrapper.go 71.64% <0.00%> (-1.97%) ⬇️

... and 18 files with indirect coverage changes

@zasweq
Copy link
Contributor

zasweq commented Dec 12, 2023

Can you please add an e2e test as to what the behavior should be e2e? (i.e. not fail, resolver NACKs, we handle in a graceful case). It used to be to fail, now what should it be? I see the component test for resolver only

@zasweq
Copy link
Contributor

zasweq commented Dec 12, 2023

Server and Client side please for e2e test. Also, we only process security config from top level cluster incorrectly now, it needs to be for each specific cluster.

@easwars easwars force-pushed the xds_certificate_providers_fix branch from 1363ae5 to 6d26f6f Compare December 12, 2023 20:51
@easwars
Copy link
Contributor Author

easwars commented Dec 15, 2023

Added e2e style tests for both client and server now. PTAL.

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to rebase my server fix onto this PR. Hopefully I can get that PR out by Thurs so you can take at least once pass before your leave for me to get back to next year.

test/xds/xds_server_certificate_providers_test.go Outdated Show resolved Hide resolved
test/xds/xds_server_certificate_providers_test.go Outdated Show resolved Hide resolved
test/xds/xds_server_certificate_providers_test.go Outdated Show resolved Hide resolved
Comment on lines 165 to 173
// Initialize an xDS-enabled gRPC server and register the stubServer on it.
// Pass it a mode change server option that pushes on a channel the mode
// changes to "not serving".
modeCh := make(chan struct{})
modeChangeOpt := xds.ServingModeCallback(func(addr net.Addr, args xds.ServingModeChangeArgs) {
if args.Mode == connectivity.ServingModeServing {
close(modeCh)
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather call modeCh. Also, this sends only once by closing it (and will panic if you close twice). Thinking about it, I made this an event for my server side e2e tests for the server side fix serving := grpcsync.NewEvent()...if args.Mode == connectivity.ServingModeServing {
serving.Fire()
}. I think that this semantically makes more sense here (and below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather call modeCh

It is called modeCh already. Don't understand this comment.

Also, this sends only once by closing it (and will panic if you close twice).

Since, we don't expect this to even happen once, I'm not too worried about a double close.

I think that this semantically makes more sense here (and below)

The grpcsync.Event provides a way to spot check if the event has fired and a way to block on the event firing. Since we only need the latter here, I feel a channel is more appropriate and it does not make the code harder to read in anyway. So, I would like to stick to using a channel here, unless you have a very good reason for using a grpcsync.Event instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry servingModeCh. Hmmmm I guess that a channel does make sense here since this is something we expect not to happen vs. to happen (i.e. wait till serving until RPC happens)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to servingModeCh.

test/xds/xds_server_certificate_providers_test.go Outdated Show resolved Hide resolved
test/xds/xds_server_certificate_providers_test.go Outdated Show resolved Hide resolved
@@ -86,71 +83,36 @@ func (s) TestResolverBuilder_ClientCreationFails_NoBootstap(t *testing.T) {
}
}

// Tests the resolver builder's Build() method with different xDS bootstrap
// configurations.
func (s) TestResolverBuilder_DifferentBootstrapConfigs(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: leave a t-test in case we add more failure modes for resolver construction in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a t-test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - Menghan/Doug described the functionality of defining test cases, then looping through each of them as a unity test as a "t-test'. So keep a slice of length 1. Optional though.

@zasweq zasweq assigned easwars and unassigned zasweq Dec 19, 2023
@easwars easwars assigned zasweq and unassigned easwars Dec 19, 2023
Comment on lines +336 to +446
}}}}},
},
HttpFilters: []*v3httppb.HttpFilter{e2e.RouterHTTPFilter},
}
ts := &v3corepb.TransportSocket{
Name: "envoy.transport_sockets.tls",
ConfigType: &v3corepb.TransportSocket_TypedConfig{
TypedConfig: testutils.MarshalAny(t, &v3tlspb.DownstreamTlsContext{
RequireClientCertificate: &wrapperspb.BoolValue{Value: true},
CommonTlsContext: &v3tlspb.CommonTlsContext{
TlsCertificateCertificateProviderInstance: &v3tlspb.CommonTlsContext_CertificateProviderInstance{
InstanceName: "non-existent-certificate-provider",
},
ValidationContextType: &v3tlspb.CommonTlsContext_ValidationContextCertificateProviderInstance{
ValidationContextCertificateProviderInstance: &v3tlspb.CommonTlsContext_CertificateProviderInstance{
InstanceName: "non-existent-certificate-provider",
},
},
},
}),
},
}
resource2 := &v3listenerpb.Listener{
Name: fmt.Sprintf(e2e.ServerListenerResourceNameTemplate, net.JoinHostPort(host2, strconv.Itoa(int(port2)))),
Address: &v3corepb.Address{
Address: &v3corepb.Address_SocketAddress{
SocketAddress: &v3corepb.SocketAddress{
Address: host2,
PortSpecifier: &v3corepb.SocketAddress_PortValue{
PortValue: port2,
},
},
},
},
FilterChains: []*v3listenerpb.FilterChain{
{
Name: "v4-wildcard",
FilterChainMatch: &v3listenerpb.FilterChainMatch{
PrefixRanges: []*v3corepb.CidrRange{
{
AddressPrefix: "0.0.0.0",
PrefixLen: &wrapperspb.UInt32Value{
Value: uint32(0),
},
},
},
SourceType: v3listenerpb.FilterChainMatch_SAME_IP_OR_LOOPBACK,
SourcePrefixRanges: []*v3corepb.CidrRange{
{
AddressPrefix: "0.0.0.0",
PrefixLen: &wrapperspb.UInt32Value{
Value: uint32(0),
},
},
},
},
Filters: []*v3listenerpb.Filter{
{
Name: "filter-1",
ConfigType: &v3listenerpb.Filter_TypedConfig{TypedConfig: testutils.MarshalAny(t, hcm)},
},
},
TransportSocket: ts,
},
{
Name: "v6-wildcard",
FilterChainMatch: &v3listenerpb.FilterChainMatch{
PrefixRanges: []*v3corepb.CidrRange{
{
AddressPrefix: "::",
PrefixLen: &wrapperspb.UInt32Value{
Value: uint32(0),
},
},
},
SourceType: v3listenerpb.FilterChainMatch_SAME_IP_OR_LOOPBACK,
SourcePrefixRanges: []*v3corepb.CidrRange{
{
AddressPrefix: "::",
PrefixLen: &wrapperspb.UInt32Value{
Value: uint32(0),
},
},
},
},
Filters: []*v3listenerpb.Filter{
{
Name: "filter-1",
ConfigType: &v3listenerpb.Filter_TypedConfig{TypedConfig: testutils.MarshalAny(t, hcm)},
},
},
TransportSocket: ts,
},
},
}
resources := e2e.UpdateOptions{
NodeID: nodeID,
Listeners: []*v3listenerpb.Listener{resource1, resource2},
SkipValidation: true,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super long inline block of xDS resources. Is there not e2e helpers for these? Should we refactor some of the e2e helpers if not to pull out common functionality for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used anywhere else so far, so we haven't needed helpers for this. But if it turns out that you will need something like this for your server side tests, maybe we can pull it out at that point. But its going to be a bit of work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 235 to 237
if _, err := client.EmptyCall(sCtx, &testpb.Empty{}); status.Code(err) != codes.DeadlineExceeded {
t.Fatalf("EmptyCall() failed: %v", err)
}
Copy link
Contributor

@zasweq zasweq Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhh right, I was thinking as if my change was already in master with respect to Accept() + Close(), and not blocking as it is now. I'll probably switch this over (and I have tests for this Accept() + Close() - I thought hard about the invariant of a signal that a connection was Accepted + Closed - UNAVAILABLE with error string sgtm).

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// server. The management server responds with two listener resources:
// 1. contains valid security configuration pointing to the certificate provider
// instance specified in the bootstrap
// 3. contains invalid security configuration pointing to a non-existent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne.

@zasweq zasweq assigned easwars and unassigned zasweq Dec 19, 2023
@easwars easwars merged commit bb0d32f into grpc:master Dec 20, 2023
14 checks passed
@easwars easwars deleted the xds_certificate_providers_fix branch December 20, 2023 00:30
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xDS: Use fallback server credentials if no certificate_providers in bootstrap
2 participants