Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-15088: Use integration health report in declarative config manager #5105

Merged
merged 8 commits into from
Mar 10, 2023

Conversation

dhaus67
Copy link
Contributor

@dhaus67 dhaus67 commented Mar 6, 2023

Description

This PR is a follow up for pull/5048, in which the API of the /v1/integrationhealth service was extended to cover declarative configurations.

Within this PR, the integrationhealth.Reporter is now used within the declarativeconfig.Manager to report the health status of resources created from declarative configuration as well as the overall health of the config maps.

For now, the following is registered and updated:

  • For each watch handler (i.e. each config map), an integration health is created. In case any issues occurred during either unmarshalling the config file or transforming the contents, the health status is set to unhealthy and errors are surfaced.
  • For each transformed resource (i.e. proto message), an integration health is created. In case any issues occurred during upserting the resource, the health status is set to unhealthy and errors are surfaced.

Notes:

  • What's currently not covered, but already possible, is the deletion of the health status once a resource is deleted. This will be done within ROX-14694.

Caveats within the PR:

  • the handleMissingTypeUpdater called stopSignal.Done(), while it should have called stopSignal.Signal() (currently, not stop signal would be sent).
  • the release tests for ^^ didn't fail because the signals (both stopSignal & shortCircuitSignal) were not initialized via concurrency.NewSignal(), hence stopSignal.IsDone() always yielded true

Both things were fixed within this PR.

Checklist

  • Investigated and inspected CI test results
  • Unit test and regression tests added
  • Evaluated and added CHANGELOG entry if required
  • Determined and documented upgrade steps
  • Documented user facing changes (create PR based on openshift/openshift-docs and merge into rhacs-docs)

If any of these don't apply, please comment below.

Testing Performed

  • see added unit tests.

Manual tests:

  1. Create the central-services chart
roxctl helm output central-services --image-defaults=opensource
  1. Install the central services (a.k.a. central) within the stackrox namespace. Set the ROX_DECLARATIVE_CONFIGURATION feature flag to true, add declarative config mounts, and optionally set loglevel to debug
helm install stackrox-central-services -n stackrox -f - <<EOF
central:
  declarativeConfiguration:
    mounts:
      configMaps:
        - permission-set
        - access-scope
        - auth-provider
        - role
customize:
  envVars:
    ROX_DECLARATIVE_CONFIGURATION: true
    LOGLEVEL: Debug
EOF ./stackrox-central-services-chart
  1. Once initialized, each configuration mount should be included within the response of the integration health service. They all should be UNINITIALIZED.
roxcurl /v1/integrationhealth/declarativeconfigs | jq
{
  "integrationHealth": [
    {
      "id": "/run/stackrox.io/declarative-configuration/access-scope",
      "name": "Config Map access-scope",
      "type": "DECLARATIVE_CONFIG",
      "status": "UNINITIALIZED",
      "errorMessage": "",
      "lastTimestamp": "2023-03-06T23:38:05.541191670Z"
    },
    {
      "id": "/run/stackrox.io/declarative-configuration/auth-provider",
      "name": "Config Map auth-provider",
      "type": "DECLARATIVE_CONFIG",
      "status": "UNINITIALIZED",
      "errorMessage": "",
      "lastTimestamp": "2023-03-06T23:38:05.593600136Z"
    },
    {
      "id": "/run/stackrox.io/declarative-configuration/permission-set",
      "name": "Config Map permission-set",
      "type": "DECLARATIVE_CONFIG",
      "status": "UNINITIALIZED",
      "errorMessage": "",
      "lastTimestamp": "2023-03-06T23:38:05.646174401Z"
    },
    {
      "id": "/run/stackrox.io/declarative-configuration/role",
      "name": "Config Map role",
      "type": "DECLARATIVE_CONFIG",
      "status": "UNINITIALIZED",
      "errorMessage": "",
      "lastTimestamp": "2023-03-06T23:38:05.698478865Z"
    }
  ]
}
  1. Create a valid declarative config file containing a permission set
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: permission-set
  namespace: stackrox
data:
  permission-set: |
    name: hello-world
    description: the first declaratively created permission set
    resources:
    - resource: Integration
      access: READ_ACCESS
    - resource: Administration
      access: READ_ACCESS
    - resource: Access
      access: READ_ACCESS
EOF
  1. The integration health list should now include the permission set, referencing the config map name, and the config map should be marked as healthy
roxcurl /v1/integrationhealth/declarativeconfigs | jq
{
  "integrationHealth": [
...
    {
      "id": "/run/stackrox.io/declarative-configuration/permission-set",
      "name": "Config Map permission-set",
      "type": "DECLARATIVE_CONFIG",
      "status": "HEALTHY",
      "errorMessage": "",
      "lastTimestamp": "2023-03-06T23:44:45.699022645Z"
    },
    {
      "id": "89e43e8c-a50b-5429-b53d-13c26b9d5080",
      "name": "hello-world in config map permission-set",
      "type": "DECLARATIVE_CONFIG",
      "status": "HEALTHY",
      "errorMessage": "",
      "lastTimestamp": "2023-03-06T23:46:05.723449047Z"
    }
  ]
}
  1. Create a declarative configuration which contains invalid format
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: auth-provider
  namespace: stackrox
data:
  auth-provider: |
    {
      "some": "funny",
      "key": "value",
      "pairs": "here"
    }
EOF
  1. The associated config map auth-provider should be marked unhealthy, including the error
roxcurl /v1/integrationhealth/declarativeconfigs | jq
{
  "integrationHealth": [
...
    {
      "id": "/run/stackrox.io/declarative-configuration/auth-provider",
      "name": "Config Map auth-provider",
      "type": "DECLARATIVE_CONFIG",
      "status": "UNHEALTHY",
      "errorMessage": "unmarshalling raw configuration: raw configuration {\n  \"some\": \"funny\",\n  \"key\": \"value\",\n  \"pairs\": \"here\"\n}\n didn't match any of the given configurations: 4 errors occurred:\n\t* yaml: unmarshal errors:\n  line 2: field some not found in type declarativeconfig.AuthProvider\n  line 3: field key not found in type declarativeconfig.AuthProvider\n  line 4: field pairs not found in type declarativeconfig.AuthProvider\n\t* yaml: unmarshal errors:\n  line 2: field some not found in type declarativeconfig.AccessScope\n  line 3: field key not found in type declarativeconfig.AccessScope\n  line 4: field pairs not found in type declarativeconfig.AccessScope\n\t* yaml: unmarshal errors:\n  line 2: field some not found in type declarativeconfig.PermissionSet\n  line 3: field key not found in type declarativeconfig.PermissionSet\n  line 4: field pairs not found in type declarativeconfig.PermissionSet\n\t* yaml: unmarshal errors:\n  line 2: field some not found in type declarativeconfig.Role\n  line 3: field key not found in type declarativeconfig.Role\n  line 4: field pairs not found in type declarativeconfig.Role\n\n",
      "lastTimestamp": "2023-03-06T23:49:30.645077233Z"
    }
  ]
}
  1. Create a declarative configuration which contains an invalid reference (i.e. a role referencing a non-existing access scope)
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: role
  namespace: stackrox
data:
  role: |
    name: hello-world
    description: the first declaratively created role
    permissionSet: hello-world
    accessScope: hello-world
EOF
  1. The associated config map should be marked healthy, as configurations could be unmarshalled from it, but the created resources should be marked unhealthy (note: this shouldn't be the case immediately, but only after the error occurred 5 times)
roxcurl /v1/integrationhealth/declarativeconfigs | jq
{
  "integrationHealth": [
...
    {
      "id": "hello-world",
      "name": "hello-world in config map role",
      "type": "DECLARATIVE_CONFIG",
      "status": "UNHEALTHY",
      "errorMessage": "referenced access scope df5d408f-e083-5920-b35a-774684481925 does not exist: invalid arguments",
      "lastTimestamp": "2023-03-06T23:57:05.725916471Z"
    }
  ]
}

@dhaus67 dhaus67 requested a review from ivan-degtiarenko March 6, 2023 03:15
@dhaus67
Copy link
Contributor Author

dhaus67 commented Mar 6, 2023

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@ghost
Copy link

ghost commented Mar 6, 2023

Images are ready for the commit at 485047c.

To use with deploy scripts, first export MAIN_IMAGE_TAG=3.74.x-338-g485047cbe4.

central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
@dhaus67 dhaus67 force-pushed the master-dh/rox-15088-report-errors-during-creation branch from 80656e3 to 1abf2f1 Compare March 6, 2023 21:48
@dhaus67 dhaus67 requested a review from ivan-degtiarenko March 6, 2023 21:48
@dhaus67 dhaus67 force-pushed the master-dh/rox-15088-report-errors-during-creation branch from 1abf2f1 to 67c340c Compare March 6, 2023 22:36
@dhaus67
Copy link
Contributor Author

dhaus67 commented Mar 7, 2023

/retest

central/declarativeconfig/manager_impl.go Outdated Show resolved Hide resolved
for _, message := range messages {
if err := typeUpdater.Upsert(m.reconciliationCtx, message); err != nil {
m.reconciliationErrorReporter.ProcessError(message, err)
for handler, protoMessagesByType := range transformedMessagesByHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm eager to keep global order of proto types to avoid extra errors and displaying unhealthy configs when it's not needed. How about we use a structure like map[protoType]map[handlerID][]proto.Message?

P.S. Another benefit of this structure is that it will allow to easily flag all handlers which contain unsupported type as unhealthy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the global order is kept right now, no?

We go through each proto type, and for each proto type we consolidate all handlers, retrieving only the proto messages for this specific type. Once done, we continue with the next proto type. The global order is kept (also, ensured within the tests via gomock.InOrder that the order is actually kept).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry - didn't notice the outer loop. No questions to this part

}

func extractNameFromProtoMessage(message proto.Message) string {
messageWithName, ok := message.(interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative for groups that comes to mind is to create a fake name based on group properties - something like provider_id:key:value:roleName - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are back to good old serialized key for groups 🥴

I'm not sure how much it's better, given that the auth provider ID will be non-telling for customers to identify the auth provider (imagine you have two groups sharing the same attributes / role name, but for different auth providers). Guess the same applies right now with the group ID, so I changed it to be the tuple of these values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are back to good old serialized key for groups 🥴

“The more you know about the past, the better prepared you are for the future.” —Theodore Roosevelt

given that the auth provider ID will be non-telling for customers to identify the auth provider

I guess you're right, they should know what they are looking at. What do you think about terrifying
auth_provider_id:<id>:key:<key>:value:<value>:role_name:<roleName> format? A more pleasant alternative would be group <key>:<value>:<roleName> for auth provider ID <authProviderID>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm fine doing it either way, since we can be flexible with the name.
The root issue we have is that the group has no name and the auth provider is in ID-format instead of name, so looking at the declarative config it may be non-trivial identifying the groups.

Updated the names to the group <key>:<value>:<roleName> for auth provider ID <authProviderID> format.

central/declarativeconfig/manager_impl_test.go Outdated Show resolved Hide resolved
Name: "Config Map my-cool-config-map",
Type: storage.IntegrationHealth_DECLARATIVE_CONFIG,
Status: storage.IntegrationHealth_UNHEALTHY,
ErrorMessage: "unmarshalling raw configuration: raw configuration \n{\"cool\": \"key\", \"value\": \"pairs\"}\n didn't match any of the given configurations: 4 errors occurred:\n\t* yaml: unmarshal errors:\n line 2: field cool not found in type declarativeconfig.AuthProvider\n line 2: field value not found in type declarativeconfig.AuthProvider\n\t* yaml: unmarshal errors:\n line 2: field cool not found in type declarativeconfig.AccessScope\n line 2: field value not found in type declarativeconfig.AccessScope\n\t* yaml: unmarshal errors:\n line 2: field cool not found in type declarativeconfig.PermissionSet\n line 2: field value not found in type declarativeconfig.PermissionSet\n\t* yaml: unmarshal errors:\n line 2: field cool not found in type declarativeconfig.Role\n line 2: field value not found in type declarativeconfig.Role\n\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can specify matcher to check not exact error message here, but rather that it contains an unmarshalling error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could, but I don't want to make the gomock.Matcher more complex than it needs to be. I'd just keep it for now, maybe later we can change it if another long-ish error exists.

@dhaus67 dhaus67 requested a review from ivan-degtiarenko March 7, 2023 21:54
@dhaus67 dhaus67 force-pushed the master-dh/rox-15088-report-errors-during-creation branch 3 times, most recently from aaa15f1 to d975301 Compare March 8, 2023 12:54
Copy link
Contributor

@ivan-degtiarenko ivan-degtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's resolve conversation around group name and merge this PR 👍

errorsPerDeclarativeConfig: map[string]int32{},
idExtractor: idExtractor,
nameExtractor: nameExtractor,
shortCircuitSignal: concurrency.NewSignal(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it necessary to initialize signals? Don't think we did that before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the description:

the release tests for ^^ didn't fail because the signals (both stopSignal & shortCircuitSignal) were not initialized via concurrency.NewSignal(), hence stopSignal.IsDone() always yielded true

Currently, the reconciliation didn't work because we always stopped the reconciliation loop.

for _, message := range messages {
if err := typeUpdater.Upsert(m.reconciliationCtx, message); err != nil {
m.reconciliationErrorReporter.ProcessError(message, err)
for handler, protoMessagesByType := range transformedMessagesByHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry - didn't notice the outer loop. No questions to this part

}
log.Debugf("Error within reconciliation for %+v: %v", message, err)
} else {
currentDeclarativeConfigErrors := m.errorsPerDeclarativeConfig[messageID]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 lines can be simplified to

m.errorsPerDeclarativeConfig[messageID] = 0

}

func extractNameFromProtoMessage(message proto.Message) string {
messageWithName, ok := message.(interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are back to good old serialized key for groups 🥴

“The more you know about the past, the better prepared you are for the future.” —Theodore Roosevelt

given that the auth provider ID will be non-telling for customers to identify the auth provider

I guess you're right, they should know what they are looking at. What do you think about terrifying
auth_provider_id:<id>:key:<key>:value:<value>:role_name:<roleName> format? A more pleasant alternative would be group <key>:<value>:<roleName> for auth provider ID <authProviderID>

@dhaus67 dhaus67 requested a review from ivan-degtiarenko March 9, 2023 19:38
@dhaus67 dhaus67 force-pushed the master-dh/rox-15088-report-errors-during-creation branch from d975301 to 6a4a3ad Compare March 9, 2023 19:38
Copy link
Contributor

@ivan-degtiarenko ivan-degtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small fixes, otherwise LGTM

// IDExtractor extracts the ID from proto messages.
type IDExtractor func(m proto.Message) string

// NameExtractor extracts the ID from proto messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NameExtractor extracts the ID from proto messages.
// NameExtractor extracts the name from proto messages.

@dhaus67 dhaus67 enabled auto-merge (squash) March 10, 2023 00:20
@dhaus67 dhaus67 merged commit a14a7f3 into master Mar 10, 2023
@dhaus67 dhaus67 deleted the master-dh/rox-15088-report-errors-during-creation branch March 10, 2023 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants