Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh goal state when certificates are missing #2562

Merged
merged 3 commits into from
Apr 25, 2022

Conversation

narrieta
Copy link
Member

@narrieta narrieta commented Apr 22, 2022

In some scenarios, certificates may be missing from the goal state (see comments in the code for a description of one of those scenarios).

When that happens, the GoalState class refreshes the goal state in an attempt to fix the inconsistency. It tries this only one time and if the issue persists it raises an exception, which would be handled and reported in the main loop of the agent. The main loop would continue trying to update the goal state every period.

The PR also includes other small changes needed to support this scenario. I added comments to the PR pointing those out.

@codecov
Copy link

codecov bot commented Apr 22, 2022

Codecov Report

Merging #2562 (0b41f6d) into release-2.8.0.0 (a3d761f) will increase coverage by 0.06%.
The diff coverage is 90.58%.

@@                 Coverage Diff                 @@
##           release-2.8.0.0    #2562      +/-   ##
===================================================
+ Coverage            72.05%   72.12%   +0.06%     
===================================================
  Files                  102      102              
  Lines                15340    15374      +34     
  Branches              2433     2442       +9     
===================================================
+ Hits                 11054    11089      +35     
+ Misses                3795     3792       -3     
- Partials               491      493       +2     
Impacted Files Coverage Δ
azurelinuxagent/common/logger.py 89.83% <33.33%> (-0.98%) ⬇️
azurelinuxagent/common/protocol/goal_state.py 95.18% <90.19%> (-0.28%) ⬇️
azurelinuxagent/ga/update.py 88.31% <95.65%> (+0.08%) ⬆️
azurelinuxagent/common/protocol/hostplugin.py 87.66% <100.00%> (ø)
azurelinuxagent/common/protocol/wire.py 78.82% <100.00%> (ø)
azurelinuxagent/common/event.py 86.03% <0.00%> (+0.67%) ⬆️
azurelinuxagent/common/utils/cryptutil.py 45.79% <0.00%> (+1.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3d761f...0b41f6d. Read the comment docs.

@@ -0,0 +1,85 @@
<?xml version="1.0" encoding="utf-8"?>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new file contains certificates that do not match ExtensionsConfig

<ExtensionsConfig>http://168.63.129.16:80/extensionsconfiguri/</ExtensionsConfig>
<FullConfig>http://168.63.129.16:80/fullconfiguri/</FullConfig>
<ConfigName>b61f93d0-e1ed-40b2-b067-22c243233448.1.b61f93d0-e1ed-40b2-b067-22c243233448.2.MachineRole_IN_0.xml</ConfigName>
<HostingEnvironmentConfig>http://168.63.129.16:80/machine/865a6683-91d8-450f-99ae/bc8b9d47%2Db5ed%2D4704%2D85d9%2Dfd74cc967ec2.%5Fcanary?comp=config&amp;type=hostingEnvironmentConfig&amp;incarnation=1</HostingEnvironmentConfig>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these test files with realistic URL, instead of the previous dummy values

@@ -11,6 +11,22 @@ class HttpRequestPredicates(object):
def is_goal_state_request(url):
return url.lower() == 'http://{0}/machine/?comp=goalstate'.format(restutil.KNOWN_WIRESERVER_IP)

@staticmethod
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only needed is_certificates_request, but since I was here I added the rest as well

"sharedconfiguri": 0,
"certificatesuri": 0,
"extensionsconfiguri": 0,
"hostingEnvironmentConfig": 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the name of these counters consistent with the actual URLs

self._try_update_goal_state(protocol)

# Update the Guest Agent if a new version is available
if self._goal_state is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were executing the goal state even when we fail to update it... it's a regression introduced a few releases ago

@@ -124,6 +125,9 @@ def write_log(log_appender): # pylint: disable=W0612
finally:
log_appender.appender_lock = False

if self.silent:
return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new flag to turn on/off logging

@@ -481,13 +482,16 @@ def _try_update_goal_state(self, protocol):
Attempts to update the goal state and returns True on success or False on failure, sending telemetry events about the failures.
"""
try:
protocol.update_goal_state()
max_errors_to_log = 3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the Fast Track changes more logging has been added to the GoalState class. I added the capability to turn off all logging done by GoalState.

We leaving logging turned on for at most 3 consecutive errors and then turn it off and output 1 single error every 6 hours. We turn on logging once fetching goal state succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@narrieta Sorry, I didn't get a chance to review this. Do we turn on before the goal state succeeded or will it be for next goal state? Looks to me successful goal state may not have logging if it gets updated after 3 consecutive errors.

Ex: We had 3 consecutive failures, on 4th attempt we disable the logger but goal state updated successfully and process the same goal state. So don't we miss that process goal state log in waagent because logger disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking a look @nagworld9

yes, we re-enable logging at line 492

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we reset to 0 and the condition for turning it off is "self._update_goal_state_error_count >= max_errors_to_log")

def test_it_should_refresh_the_goal_state_when_it_is_inconsistent(self):
#
# Some scenarios can produce inconsistent goal states. For example, during hibernation/resume, the Fabric goal state changes (the
# tenant certificate is re-generated when the VM is restarted) *without* the incarnation changing. If a Fast Track goal state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is necessarily the place for this much info, but it doesn't look like this text matches that above (in goal_state.py) w.r.t. the reset of incarnation to 1 (meaning the incarnation might not change) vs. the incarnation not changing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is trying to describe the specific scenario that the test is emulating. In this scenario the incarnation does not change.

The code in goal_state.py describes the general case, where the incarnation may or may not change.

Does that address your comment? (not sure I understand the comment 100%)

@narrieta narrieta merged commit cd03ff2 into Azure:release-2.8.0.0 Apr 25, 2022
@narrieta narrieta deleted the refresh branch April 25, 2022 20:00
narrieta added a commit that referenced this pull request Apr 28, 2022
* Refresh goal state when certificates are missing

* Improve error reporting

* Fix assert message

Co-authored-by: narrieta <narrieta>
(cherry picked from commit cd03ff2)
nagworld9 added a commit that referenced this pull request Mar 7, 2023
* release prepare-2.4.0.0 (#2280)

* Fix bug with dependent extensions with no settings (#2285)

* update test-requirements to pin pylint. (#2288) (#2299)

Co-authored-by: Kevin Clark <kevin.clark@microsoft.com>

* Do not create placeholder status file for AKS extensions (#2298)

* Exception for Linux Patch Extension for creating placeholder status file (#2307)

* update release version (#2308)

* Dont create default status file for Single-Config extensions (#2318)

* version update (#2319)

* Update Version (#2348)

* Fix bug with dependent extensions with no settings (#2285) (#2349)

Co-authored-by: Laveesh Rohra <larohra@microsoft.com>

* Use handler status if extension status is None when computing the Ext… (#2358)

* Use handler status if extension status is None when computing the ExtensionsSummary

* Add NotReady to convergence statuses

Co-authored-by: narrieta <narrieta>

* Release preparation 2.5.0.1 (#2360)

* Define ExtensionsSummary.__eq__ (#2371)

* define ExtensionsSummary.__eq__

* Fix test name

* Fix test name

* Fix test name

Co-authored-by: narrieta <narrieta>

* Release preparation 2.5.0.2 (#2372)

* update cgroups monitoring expiry date (#2392)

* Updated Agent Version to 2.6.0.0 (#2393)

* Do not parse status blob (#2394)

Co-authored-by: narrieta <narrieta>

* Exclude dcr from setup (#2396)

* Improve error message for vmSettings errors (#2397)

* Improve error message for vmSettings errors

* Add etag and correlation id

* pylint warning

Co-authored-by: narrieta <narrieta>

* Updated agent version to 2.6.0.1 (#2400)

* Verify that extensions goal state from vmsettings has been initialized (#2404)

Co-authored-by: narrieta <narrieta>

* Updated agent version to 2.6.0.2 (#2405)

* Set Agent version to 2.7.0.0 (#2457)

* Set Agent version to 2.7.0.0

* Remove duplicate import

Co-authored-by: narrieta <narrieta>

* Simplify the logic to update the extensions goal state (#2466)

* Simplify the logic to update the extensions goal state

* Added telemetry for NotSupported

* Added comments

* Do not support old hostgaplugin

Co-authored-by: narrieta <narrieta>

* Retry update_goal_state on GoalStateMismatchError (#2470)

* Retry update_goal_state on GoalStateMismatchError

* Add sleep before retry

* Enable test

* Enable test

* Update test

* Add unit test

* Add data file

* pylint warning

* Add comment; fix typos

* fix typo

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.7.0.1 (#2473)

* Redact settings from mismatch message (#2477)

Co-authored-by: narrieta <narrieta>

* Set Agent Version to 2.7.0.2 (#2478)

Co-authored-by: narrieta <narrieta>

* Improvements in telemetry for vmSettings (#2482)

Co-authored-by: narrieta <narrieta>

* Set Agent version to 2.7.0.3 (#2483)

* Do not raise on missing status blob; reduce amount of logging for vms… (#2492)

* Do not raise on missing status blob; reduce amount of logging for vmsettings

* remove extra file; fix typo

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.7.0.4 (#2494)

Co-authored-by: narrieta <narrieta>

* Save vmSettings on parse errors; improve messages in parse errors (#2503)

* Save vmSettings on parse errors; improve messages in parse errors

* pylint warnings

* pylint warnings

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.7.0.5 (#2504)

Co-authored-by: narrieta <narrieta>

* Revert "Set agent version to 2.7.0.5 (#2504)" (#2505)

This reverts commit ae5a222.

Co-authored-by: narrieta <narrieta>

* ignore firewall packets reset error, check enable firewall config flag and extend cgroup extension monitoring expiry time (#2502)

* Set agent version to 2.7.0.5 (#2506)

Co-authored-by: narrieta <narrieta>

* Handle OOM errors by stopping the periodic log collector (#2510)

* Set agent version to 2.7.0.6 (#2511)

* Add keep_alive property to collect_logs (#2533)

* Set agent version to 2.7.1.0. (#2534)

* Set agent version to 2.8.0.0 (#2545)

Co-authored-by: narrieta <narrieta>

* Update HGAP minimum required version for FastTrack (#2549)

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.1 (#2550)

Co-authored-by: narrieta <narrieta>

* Improvements in waagent.log (#2558)

* Improvements in waagent.log

* fix function call

* update comment

* typo

Co-authored-by: narrieta <narrieta>

* Change format of history items (#2560)

* Change format of history directory

* Update message; fix typo

* py2 compat

* py2 compat

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.2 (#2561)

Co-authored-by: narrieta <narrieta>

* Refresh goal state when certificates are missing (#2562)

* Refresh goal state when certificates are missing

* Improve error reporting

* Fix assert message

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.3 (#2563)

Co-authored-by: narrieta <narrieta>

* Do not mark goal state as processed when goal state fails to update (#2569)

Co-authored-by: narrieta <narrieta>

* Update agent version to 2.8.0.4 (#2570)

Co-authored-by: narrieta <narrieta>

* Bug fix for fetching a goal state with empty certificates property (#2575)

* Move error counter reset down to end of block. (#2576)

* Bug Fix: Change fast track timestamp default from None to datetime.min (#2577)

* Update agent version to 2.8.0.5. (#2580)

* Create placeholder GoalState.*.xml file (#2594)

Co-authored-by: narrieta <narrieta>

* Update version to 2.8.0.6 (#2595)

Co-authored-by: narrieta <narrieta>

* fix network interface restart in RHEL9 (#2592) (#2612)

(cherry picked from commit b8ca432)

* set agent version to 2.7.2.0 (#2613)

* Parse missing agent manifests as empty

* Set agent version to 2.8.0.7

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status (#2621)

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status

* python 2.6 compat

Co-authored-by: narrieta <narrieta>

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status (#2621) (#2622)

* Retry HGAP's extensionsArtifact requests on BAD_REQUEST status

* python 2.6 compat

Co-authored-by: narrieta <narrieta>
(cherry picked from commit dbc82d3)

* fix if command in rhel v8.6+ (#2624)

* Set agent version to 2.7.3.0 (#2625)

Co-authored-by: narrieta <narrieta>

* Set Agent version to 2.8.0.8 (#2627)

Co-authored-by: narrieta <narrieta>

* fix network interface restart in RHEL9 (#2592) (#2629)

(cherry picked from commit b8ca432)

Co-authored-by: Nageswara Nandigam <84482346+nagworld9@users.noreply.github.com>

* fix if command in rhel v8.6+ (#2624) (#2630)

(cherry picked from commit e540728)

Co-authored-by: Nageswara Nandigam <84482346+nagworld9@users.noreply.github.com>

* Set Agent version to 2.8.0.9 (#2631)

Co-authored-by: narrieta <narrieta>

* Cleanup history directory when creating new subdirectories (#2633)

* Cleanup history directory when creating new subdirectories

* Review feedback

Co-authored-by: narrieta <narrieta>

* Set agent version to 2.8.0.10

* Save sharedconfig to disk (#2649)

* Save sharedconfig to disk

* Update tests

* pylint warnings

Co-authored-by: narrieta <narrieta>

* Set Agent version to 2.8.0.11 (#2650)

Co-authored-by: narrieta <narrieta>

* test fixes

* Microsoft mandatory file (#2737)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

---------

Co-authored-by: Laveesh Rohra <larohra@microsoft.com>
Co-authored-by: Kevin Clark <kevin.clark@microsoft.com>
Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>
Co-authored-by: Dhivya Ganesan <dhivyaganesan@users.noreply.github.com>
Co-authored-by: narrieta <narrieta>
Co-authored-by: Kevin Clark <keclar@microsoft.com>
Co-authored-by: Norberto Arrieta <narrieta@microsoft.com>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants