Skip to content

Commit

Permalink
Expose periods of environment thread in waagent.conf (#1891)
Browse files Browse the repository at this point in the history
* Expose periods of environment thread in waagent.conf

* Python 2.6 compatibility issues

* Report errors on the monitor/environment threads

* Added new parameters to README

* log changes in conf

* Document MonitorDhcpClientRestartPeriod

Co-authored-by: narrieta <narrieta>
  • Loading branch information
narrieta authored May 22, 2020
1 parent d62c4c5 commit de81f1b
Show file tree
Hide file tree
Showing 11 changed files with 336 additions and 112 deletions.
64 changes: 63 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,8 @@ A sample configuration file is shown below:

```yml
Extensions.Enabled=y
Extensions.GoalStatePeriod=6
Extensions.GoalStateHistoryCleanupPeriod=86400
Provisioning.Agent=auto
Provisioning.DeleteRootPassword=n
Provisioning.RegenerateSshHostKeyPair=y
Expand Down Expand Up @@ -236,6 +238,28 @@ without the agent. In order to do that, the `provisionVMAgent` flag must be set
provisioning time, via whichever API is being used. We will provide more details on
this on our wiki when it is generally available.

#### __Extensions.GoalStatePeriod__

_Type: Integer_
_Default: 6_

How often to poll for new goal states (in seconds) and report the status of the VM
and extensions. Goal states describe the desired state of the extensions on the VM.

_Note_: setting up this parameter to more than a few minutes can make the state of
the VM be reported as unresponsive/unavailable on the Azure portal. Also, this
setting affects how fast the agent starts executing extensions.

#### __Extensions.GoalStateHistoryCleanupPeriod__

_Type: Integer_
_Default: 86400 (24 hours)_

How often to clean up the history folder of the agent. The agent keeps past goal
states on this folder, each goal state represented with a set of small files. The
history is useful to debug issues in the agent or extensions.


#### __Provisioning.Agent__

_Type: String_
Expand All @@ -259,7 +283,22 @@ _Note_: This configuration option has been removed and has no effect. waagent
now auto-detects cloud-init as a provisioning agent (with an option to override
with `Provisioning.Agent`).

#### __Provisioning.UseCloudInit__ (*removed in 2.2.45*)
#### __Provisioning.MonitorHostName__

_Type: Boolean_
_Default: n_

Monitor host name changes and publish changes via DHCP requests.

#### __Provisioning.MonitorHostNamePeriod__

_Type: Integer_
_Default: 30_

How often to monitor host name changes (in seconds). This setting is ignored if
MonitorHostName is not set.

#### __Provisioning.UseCloudInit__

_Type: Boolean_
_Default: n_
Expand Down Expand Up @@ -440,6 +479,14 @@ OpenSSL commands. This signals OpenSSL to use any installed FIPS-compliant libra
Note that the agent itself has no FIPS-specific code. _If no FIPS-compliant certificates are
installed, then enabling this option will cause all OpenSSL commands to fail._

#### __OS.MonitorDhcpClientRestartPeriod__

_Type: Integer_
_Default: 30_

The agent monitor restarts of the DHCP client and restores network rules when it happens. This
setting determines how often (in seconds) to monitor for restarts.

#### __OS.RootDeviceScsiTimeout__

_Type: Integer_
Expand All @@ -448,6 +495,14 @@ _Default: 300_
This configures the SCSI timeout in seconds on the root device. If not set, the
system defaults are used.

#### __OS.RootDeviceScsiTimeoutPeriod__

_Type: Integer_
_Default: 30_

How often to set the SCSI timeout on the root device (in seconds). This setting is
ignored if RootDeviceScsiTimeout is not set.

#### __OS.OpensslPath__

_Type: String_
Expand All @@ -456,6 +511,13 @@ _Default: None_
This can be used to specify an alternate path for the openssl binary to use for
cryptographic operations.

#### __OS.RemovePersistentNetRulesPeriod__
_Type: Integer_
_Default: 30_

How often to remove the udev rules for persistent network interface names (75-persistent-net-generator.rules
and /etc/udev/rules.d/70-persistent-net.rules) (in seconds)

#### __OS.SshClientAliveInterval__

_Type: Integer_
Expand Down
51 changes: 51 additions & 0 deletions azurelinuxagent/common/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,14 @@ def load_conf_from_file(conf_file_path, conf=__conf__):


__INTEGER_OPTIONS__ = {
"Extensions.GoalStatePeriod": 6,
"Extensions.GoalStateHistoryCleanupPeriod": 86400,
"OS.EnableFirewallPeriod": 30,
"OS.RemovePersistentNetRulesPeriod": 30,
"OS.RootDeviceScsiTimeoutPeriod": 30,
"OS.MonitorDhcpClientRestartPeriod": 30,
"OS.SshClientAliveInterval": 180,
"Provisioning.MonitorHostNamePeriod": 30,
"Provisioning.PasswordCryptSaltLength": 10,
"HttpProxy.Port": None,
"ResourceDisk.SwapSizeMB": 0,
Expand All @@ -160,10 +167,40 @@ def get_configuration(conf=__conf__):
return options


def get_default_value(option):
if option in __STRING_OPTIONS__:
return __STRING_OPTIONS__[option]
raise ValueError("{0} is not a valid configuration parameter.".format(option))


def get_int_default_value(option):
if option in __INTEGER_OPTIONS__:
return int(__INTEGER_OPTIONS__[option])
raise ValueError("{0} is not a valid configuration parameter.".format(option))


def get_switch_default_value(option):
if option in __SWITCH_OPTIONS__:
return __SWITCH_OPTIONS__[option]
raise ValueError("{0} is not a valid configuration parameter.".format(option))


def enable_firewall(conf=__conf__):
return conf.get_switch("OS.EnableFirewall", False)


def get_enable_firewall_period(conf=__conf__):
return conf.get_int("OS.EnableFirewallPeriod", 30)


def get_remove_persistent_net_rules_period(conf=__conf__):
return conf.get_int("OS.RemovePersistentNetRulesPeriod", 30)


def get_monitor_dhcp_client_restart_period(conf=__conf__):
return conf.get_int("OS.MonitorDhcpClientRestartPeriod", 30)


def enable_rdma(conf=__conf__):
return conf.get_switch("OS.EnableRDMA", False) or \
conf.get_switch("OS.UpdateRdmaDriver", False) or \
Expand Down Expand Up @@ -256,6 +293,10 @@ def get_root_device_scsi_timeout(conf=__conf__):
return conf.get("OS.RootDeviceScsiTimeout", None)


def get_root_device_scsi_timeout_period(conf=__conf__):
return conf.get_int("OS.RootDeviceScsiTimeoutPeriod", 30)


def get_ssh_host_keypair_type(conf=__conf__):
keypair_type = conf.get("Provisioning.SshHostKeyPairType", "rsa")
if keypair_type == "auto":
Expand All @@ -275,6 +316,14 @@ def get_extensions_enabled(conf=__conf__):
return conf.get_switch("Extensions.Enabled", True)


def get_goal_state_period(conf=__conf__):
return conf.get_int("Extensions.GoalStatePeriod", 6)


def get_goal_state_history_cleanup_period(conf=__conf__):
return conf.get_int("Extensions.GoalStateHistoryCleanupPeriod", 86400)


def get_allow_reset_sys_user(conf=__conf__):
return conf.get_switch("Provisioning.AllowResetSysUser", False)

Expand Down Expand Up @@ -321,6 +370,8 @@ def get_password_crypt_salt_len(conf=__conf__):
def get_monitor_hostname(conf=__conf__):
return conf.get_switch("Provisioning.MonitorHostName", False)

def get_monitor_hostname_period(conf=__conf__):
return conf.get_int("Provisioning.MonitorHostNamePeriod", 30)

def get_httpproxy_host(conf=__conf__):
return conf.get("HttpProxy.Host", None)
Expand Down
3 changes: 2 additions & 1 deletion azurelinuxagent/common/event.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,10 @@ class WALAEventOperation:
AgentEnabled = "AgentEnabled"
ArtifactsProfileBlob = "ArtifactsProfileBlob"
AutoUpdate = "AutoUpdate"
CustomData = "CustomData"
CGroupsCleanUp = "CGroupsCleanUp"
CGroupsLimitsCrossed = "CGroupsLimitsCrossed"
ConfigurationChange = "ConfigurationChange"
CustomData = "CustomData"
Deploy = "Deploy"
Disable = "Disable"
Downgrade = "Downgrade"
Expand Down
128 changes: 64 additions & 64 deletions azurelinuxagent/ga/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,7 @@
import re
import os
import socket
import time
import threading
import datetime

import azurelinuxagent.common.conf as conf
import azurelinuxagent.common.logger as logger
Expand All @@ -34,6 +32,7 @@
from azurelinuxagent.common.protocol.util import get_protocol_util
from azurelinuxagent.common.utils.archive import StateArchiver
from azurelinuxagent.common.version import AGENT_NAME, CURRENT_VERSION
from azurelinuxagent.ga.periodic_operation import PeriodicOperation

CACHE_PATTERNS = [
re.compile("^(.*)\.(\d+)\.(agentsManifest)$", re.IGNORECASE),
Expand All @@ -43,9 +42,6 @@

MAXIMUM_CACHED_FILES = 50

ARCHIVE_INTERVAL = datetime.timedelta(hours=24)


def get_env_handler():
return EnvHandler()

Expand All @@ -62,13 +58,26 @@ def __init__(self):
self.osutil = get_osutil()
self.dhcp_handler = get_dhcp_handler()
self.protocol_util = None
self._protocol = None
self.stopped = True
self.hostname = None
self.dhcp_id_list = []
self.server_thread = None
self.dhcp_warning_enabled = True
self.last_archive = None
self.archiver = StateArchiver(conf.get_lib_dir())
self._reset_firewall_rules = False

self._periodic_operations = [
PeriodicOperation("_remove_persistent_net_rules", self._remove_persistent_net_rules_period, conf.get_remove_persistent_net_rules_period()),
PeriodicOperation("_monitor_dhcp_client_restart", self._monitor_dhcp_client_restart, conf.get_monitor_dhcp_client_restart_period()),
PeriodicOperation("_cleanup_goal_state_history", self._cleanup_goal_state_history, conf.get_goal_state_history_cleanup_period())
]
if conf.enable_firewall():
self._periodic_operations.append(PeriodicOperation("_enable_firewall", self._enable_firewall, conf.get_enable_firewall_period()))
if conf.get_root_device_scsi_timeout() is not None:
self._periodic_operations.append(PeriodicOperation("_set_root_device_scsi_timeout", self._set_root_device_scsi_timeout, conf.get_root_device_scsi_timeout_period()))
if conf.get_monitor_hostname():
self._periodic_operations.append(PeriodicOperation("_monitor_hostname", self._monitor_hostname_changes, conf.get_monitor_hostname_period()))

def run(self):
if not self.stopped:
Expand All @@ -92,56 +101,50 @@ def start(self):
self.server_thread.start()

def monitor(self):
"""
Monitor firewall rules
Monitor dhcp client pid and hostname.
If dhcp client process re-start has occurred, reset routes.
Purge unnecessary files from disk cache.
"""

# The initialization of ProtocolUtil for the Environment thread should be done within the thread itself rather
# than initializing it in the ExtHandler thread. This is done to avoid any concurrency issues as each
# thread would now have its own ProtocolUtil object as per the SingletonPerThread model.
self.protocol_util = get_protocol_util()
protocol = self.protocol_util.get_protocol()
reset_firewall_fules = False
while not self.stopped:
self.osutil.remove_rules_files()

if conf.enable_firewall():
# If the rules ever change we must reset all rules and start over again.
#
# There was a rule change at 2.2.26, which started dropping non-root traffic
# to WireServer. The previous rules allowed traffic. Having both rules in
# place negated the fix in 2.2.26.
if not reset_firewall_fules:
self.osutil.remove_firewall(dst_ip=protocol.get_endpoint(), uid=os.getuid())
reset_firewall_fules = True

success = self.osutil.enable_firewall(dst_ip=protocol.get_endpoint(), uid=os.getuid())

add_periodic(
logger.EVERY_HOUR,
AGENT_NAME,
version=CURRENT_VERSION,
op=WALAEventOperation.Firewall,
is_success=success,
log_event=False)

timeout = conf.get_root_device_scsi_timeout()
if timeout is not None:
self.osutil.set_scsi_disks_timeout(timeout)

if conf.get_monitor_hostname():
self.handle_hostname_update()

self.handle_dhclient_restart()

self.archive_history()

time.sleep(5)

def handle_hostname_update(self):
try:
# The initialization of ProtocolUtil for the Environment thread should be done within the thread itself rather
# than initializing it in the ExtHandler thread. This is done to avoid any concurrency issues as each
# thread would now have its own ProtocolUtil object as per the SingletonPerThread model.
self.protocol_util = get_protocol_util()
self._protocol = self.protocol_util.get_protocol()
while not self.stopped:
try:
for op in self._periodic_operations:
op.run()
except Exception as e:
logger.error("An error occurred in the environment thread main loop; will skip the current iteration.\n{0}", ustr(e))
finally:
PeriodicOperation.sleep_until_next_operation(self._periodic_operations)
except Exception as e:
logger.error("An error occurred in the environment thread; will exit the thread.\n{0}", ustr(e))

def _remove_persistent_net_rules_period(self):
self.osutil.remove_rules_files()

def _enable_firewall(self):
# If the rules ever change we must reset all rules and start over again.
#
# There was a rule change at 2.2.26, which started dropping non-root traffic
# to WireServer. The previous rules allowed traffic. Having both rules in
# place negated the fix in 2.2.26.
if not self._reset_firewall_rules:
self.osutil.remove_firewall(dst_ip=self._protocol.get_endpoint(), uid=os.getuid())
self._reset_firewall_rules = True

success = self.osutil.enable_firewall(dst_ip=self._protocol.get_endpoint(), uid=os.getuid())

add_periodic(
logger.EVERY_HOUR,
AGENT_NAME,
version=CURRENT_VERSION,
op=WALAEventOperation.Firewall,
is_success=success,
log_event=False)

def _set_root_device_scsi_timeout(self):
self.osutil.set_scsi_disks_timeout(conf.get_root_device_scsi_timeout())

def _monitor_hostname_changes(self):
curr_hostname = socket.gethostname()
if curr_hostname != self.hostname:
logger.info("EnvMonitor: Detected hostname change: {0} -> {1}",
Expand Down Expand Up @@ -169,6 +172,9 @@ def get_dhcp_client_pid(self):

return pid

def _monitor_dhcp_client_restart(self):
self.handle_dhclient_restart()

def handle_dhclient_restart(self):
if len(self.dhcp_id_list) == 0:
self.dhcp_id_list = self.get_dhcp_client_pid()
Expand All @@ -183,16 +189,10 @@ def handle_dhclient_restart(self):
self.dhcp_handler.conf_routes()
self.dhcp_id_list = new_pid

def archive_history(self):
def _cleanup_goal_state_history(self):
"""
Purge history if we have exceed the maximum count.
Create a .zip of the history that has been preserved.
Purge history and create a .zip of the history that has been preserved.
"""
if self.last_archive is not None \
and datetime.datetime.utcnow() < \
self.last_archive + ARCHIVE_INTERVAL:
return

self.archiver.purge()
self.archiver.archive()

Expand Down
Loading

0 comments on commit de81f1b

Please sign in to comment.