Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AWX role #8226

Closed
wants to merge 16 commits into from
Closed

add AWX role #8226

wants to merge 16 commits into from

Conversation

cben
Copy link
Contributor

@cben cben commented May 2, 2018

Goal: The recently added autoheal role (#7753) needs an AWX (or Ansible Tower) to execute playbooks in response to problems.
Currently you need to bring your own and set many config vars to point to it.
Ideally we'll want a way to launch both together, with less config (but unclear if that should will be the default).

This PR is a first step — adds a role to launch an AWX but does not config autoheal to use it.

Based on upstream AWX playbooks,
those keep changing so I'm trying to not change them too much so I can keep them in sync in future.

open questions:

  • should everything here be called awx / openshift_awx / openshift_autoheal_awx ?
    I picked awx as for now it's an optional thing, not part of normal install, but not sure that makes sense.
    UPDATE: the way it's likely to get used will be deployed together with autoheal in [WIP] Add openshift_autoheal_deploy_awx var to deploy an awx for autoheal #8549, which overrides to deploy into openshift-autoheal project.

  • which AWX version do we want? I used 1.0.3 as starting point, but can move up/down.
    1.0.4 & 1.0.5 will also need rabbitmq and etcd; latest 1.0.6 needs rabbitmq 3.7 but no etcd.

  • origin vs enterprise AWX images? no idea, I didn't find any tower nor awx on RH container catalog?
    (also downstream rabbitmq image? again, no downstream image, except rhosp12/openstack-rabbitmq)

  • should I set cpu & mem requests/limits? can steal from later awx templates (they set quite high, 3 cpu total).

  • security!

    • awx user: admin, password: password randomized, stored into a secret.
    • RABBITMQ_ERLANG_COOKIE, RABBITMQ_DEFAULT_PASS — randomized, stored into a secret. IIUC, rabbitmq is not exposed outside the pod in 1.0.3 but became exposed later.
    • awx-web-svc serves insecure HTTP, tls terminated only in router :-(
      Turns out awx can't yet serve https itself (Ingress not working for AWX Kubernetes installation ansible/awx#1781 (comment), Add SSL Termination to standalone docker deployment ansible/awx#1549 and others), current practice is put a separate nginx or other proxy in front (although awx_web container already includes an nginx!).
      => Asking upstream & deferring this for now. I don't know if we need to add auth-proxy anyway (?), if we do that could cover TLS as well.

Which of these are blockers for merge? Can I iterate in later PRs?

@jhernand @zgalor @elad661 @ironcladlou please review.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cben
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: michaelgugino

Assign the PR to them by writing /assign @michaelgugino in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 2, 2018
@cben cben mentioned this pull request May 2, 2018
5 tasks
- name: Deploy and Activate Postgres
shell: >-
{{ openshift_client_binary }} project {{ awx_project }};
{{ openshift_client_binary }} process -n openshift postgresql-persistent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could oc_process module be used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried initially and to be honest I don't fully remember why I dropped it :-)
Hmm, from oc_process.py:

in conjunction with 'create: True' the process will be piped to | oc create -f

That doesn't sound idempotent. I guess I need to wrap it in a condition to create unless exists?
Then what if you upgrade openshift and the builtin postgresql-persistent template changes but you won't pick up changes. So I think | oc apply is better?
Or maybe I'm looking at this wrong, and upgrading a postgres in place is actually undesired?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be combined with @elad661's suggestion - create a new template and then use oc_apply to create new objects from the template. There should already be a postgresql template

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the oc_apply role instead of oc apply command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, oc_apply role would make it cleaner.

Copy link
Contributor Author

@cben cben May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from j2 to an openshift template.

  • Used one oc_configmap because the main payload is a fixed python file (getting param by env var instead of textual templating) and I felt it's much cleaner to just have it as a file.

The two existing oc_apply wrappers I found (roles/openshift_metrics/tasks/oc_apply.yaml, roles/openshift_provisioners/tasks/oc_apply.yaml) added value is reporting "changed" correctly. They seem to expect a single object, as they check resourceVersion before/after. I'd have to complicate that approach to iterate the multiple objects the template creates.
=> Instead, I'm running oc apply directly, formulated a changed_when that looks at output messages. Tested with/without change and with several errors (both in template and apply stage).

Copy link

@elad661 elad661 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments. Open questions remain open...


- name: Template Openshift AWX Deployment
template:
src: deployment.yml.j2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current pattern in openshift-ansible is to move away from jinja templates and into OpenShift Templates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for templating here is actually very little; most objects are fixed, and a few env vars could be converted to secretKeyRef.

I've also considered replacing the template with a series of oc_service, oc_configmap, ..., oc_obj modules.

I'm happy to go either way, just need someone to tell me...

labels:
name: awx-web-svc
spec:
type: "NodePort"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason for using NodePort here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea :) copied from https://github.com/ansible/awx/blob/1.0.3/installer/openshift/templates/deployment.yml.j2#L76

Thanks for noticing, it does sound undesired, I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, works with a regular ClusterIP service just as well.

targetPort: http
tls:
insecureEdgeTerminationPolicy: Allow
termination: edge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of components use reencrypt and expect the pod itself to server https with an automatically generated certificate

@ironcladlou
Copy link
Contributor

Haven't yet had a chance to look very closely at this yet, but one thing that stood out immediately is that you should probably be using serialized OpenShift Templates for all your assets rather than Jinja templates. Recently discussed this in the autoheal role PR. New component roles should be very loosely coupled to Ansible.

@cben cben force-pushed the awx branch 7 times, most recently from 2fed963 to e2b73c7 Compare May 15, 2018 11:14
Copy link
Contributor

@ironcladlou ironcladlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Templating is looking good, added some comments re: security of secrets

- mountPath: /etc/tower
name: awx-application-config
env:
- name: DATABASE_PASSWORD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment is not a secure place for secrets... can awx consume the password from a file on disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can — I can mount the secret as volume, and modify settings.py to read the files.

But can you explain the threat model?
k8s API returns such env vars with valueFrom: secretKeyRef:, it does not expose the effective value.
Are you thinking of with oc exec access into the pod — I think then it's game over anyway?
Does the kubelet leak them in some way?

Many of the builtin templates use such secretKeyRef env vars for keys, passwords, etc.:
https://github.com/openshift/openshift-ansible/search?q=secretKeyRef
including the postgresql-persistent template this role uses:

"name": "POSTGRESQL_PASSWORD",
"valueFrom": {
"secretKeyRef": {
"key": "database-password",
"name": "${DATABASE_SERVICE_NAME}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the issues with secrets in env were covered way back in the original design discussion, but I wouldn't be surprised if various enhancements have been made since then that I'm just unaware of to reduce the risks.

In this case I probably wouldn't go too far out of my way to introduce file support where none exists, but I would default to avoiding environment when possible (as it's generally an additional risk with no clear added value outside shim use cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No can do, not with current awx images.
Turns out they look at POSTGRESQL_PASSWORD, AWX_ADMIN_USER, AWX_ADMIN_PASSWORD in shell scripts (that are baked into the image) before settings.py (that we control) comes into play.
Can't even attempt equivalent logic in settings.py later, because if we omit AWX_ADMIN_PASSWORD env var, we might later create a superuser with password specified via a file, but an insecure admin/password superuser will be created as well.

I'll start a discussion upstream in awx about various ways (including this) learnt from here to improve their k8s/openshift deployment security, and see if they want PRs for any of those.

value: postgresql
- name: DATABASE_PORT
value: "5432"
- name: DATABASE_PASSWORD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment is not a secure place for secrets... can pgsql consume the password from a file on disk?

secretKeyRef:
name: admin-credentials
key: username
- name: AWX_ADMIN_PASSWORD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above re: environment/secrets

value: rabbitmq
- name: RABBITMQ_DEFAULT_USER
value: awx
- name: RABBITMQ_DEFAULT_PASS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above re: environment/secrets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for RABBITMQ_ERLANG_COOKIE & RABBITMQ_DEFAULT_PASS it's even worse — I left them hard-wired to fixed known values! (as do upstream awx playbooks)
AFAICT rabbitmq container has no ports exposed outside the pod — does that sound sane or should I randomize those just in case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you want to use fixed defaults, I'd make the passwords required parameters and move the defaulting logic out into the installer bits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomized these 2 as well, stored in secrets. It seems in later AWX version rabbitmq is exposed and these will become important (?)
RABBITMQ_ERLANG_COOKIE could be mounted, not sure about RABBITMQ_DEFAULT_PASS (indirectly by putting it in rabbitmq.conf?); not gonna bother with this until I get guidance from awx where to invest effort.

- name: Deploy and Activate Postgres
# Note: NAMESPACE param is where to look for postgresql ImageStream.
shell: >-
{{ openshift_client_binary }} project {{ awx_project }};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd breakout namespace creation into a discrete task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did not create the project, only switched to it to control where oc apply instantiates (this template omits namespace:).
=> Discovered I can use the cleaner oc apply -n ..., switched to that.

@cben
Copy link
Contributor Author

cben commented May 30, 2018

/retest

@cben cben changed the title [WIP] add AWX role add AWX role May 30, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2018
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 30, 2018

@cben: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/system-containers d47d171 link /test system-containers

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cben added 5 commits June 10, 2018 17:25
Example output when creating:
"stderr": "", "stdout": "deployment.extensions \"awx\" created\nservice \"awx-web-svc\" created\nroute.route.openshift.io \"awx-web-svc\" created",
Example output when re-applying without change:
"stderr": "", "stdout": "deployment.extensions \"awx\" unchanged\nservice \"awx-web-svc\" unchanged\nroute.route.openshift.io \"awx-web-svc\" unchanged",
Example output when re-applying with one change:
"stderr": "", "stdout": "deployment.extensions \"awx\" configured\nservice \"awx-web-svc\" unchanged\nroute.route.openshift.io \"awx-web-svc\" unchanged",

failed_when doesn't seem necessary, already works correctly.  Tested:
- invalid yaml syntax (which breaks `oc template`)
- invalid content - missing name
- trying to create in non-existent namespace (which breaks `oc apply`)
and they all result in "rc": 1 and error in stderr.
cben added 11 commits June 10, 2018 17:25
This file is not part of openshift-ansible code, it's mounted into awx
and just happens to be written in Python.
This file is not part of openshift-ansible code,
it's mounted into awx and just happens to be written in Python.
- Renamed secret from 'awx' to 'admin-credentials' for clarity
- Added 'username' key
- Renamed 'admin_password' key to 'password'
Friendlier to running awx in same namespace with autoheal.
- RABBITMQ_ERLANG_COOKIE could be mounted at
  /var/lib/rabbitmq/.erlang.cookie instead of an env var:
  https://github.com/docker-library/rabbitmq/blob/d7096266dfb047fce1eff89bd759e3ab55d779f3/3.7/docker-entrypoint.sh#L164-L176

However https://hub.docker.com/_/rabbitmq/ doc only mentions
RABBITMQ_ERLANG_COOKIE and RABBITMQ_DEFAULT_PASS env var.
@cben
Copy link
Contributor Author

cben commented Jul 1, 2018

@ironcladlou this got stuck, how do we move forward?

I believe I addressed your comments except for env vars — they can't be passed via secrets with existing awx image.
I asked @matburt if they're interested in upstream improvements (=> sure, yes) but then our team moved on to different project and I haven't had time to contribute anything :( Anyway, from the various ways to improve ansible/awx security, env vars are the least impactful.

We still want to slowly push autoheal forward. What's needed from your perspective to merge this (and then #8549)?

@ironcladlou
Copy link
Contributor

I think we need to establish who owns autoheal generally going forward (it's not me). @derekwaynecarr, does monitoring now own this? Some other new team?

@cben
Copy link
Contributor Author

cben commented Nov 4, 2018

@oourfali Will we or anyone else proceed with autoheal?
For now I'm closing this as abandoned. If it's to be resurrected, would need to rebase and update to recent AWX, a non-negligible effort...

@cben cben closed this Nov 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants