fix(magmad): restart options #15586

lucaaamaral · 2024-12-13T00:49:52Z

fix(magmad): restart options

Summary

There were misbehaviors in the Equipment -> Actions -> Restart services/Reboot from the nms:

Restart AGW host machine is currently not working
Restart AGW components is currently not working

During tests, it was found that:

The “reboot” command was not available inside the docker container, hence the system not rebooting
Six out of twenty-one components were not being restarted after the button was pressed

Test Plan

Selecting for the “restart“ option triggers a series of messages that results in the magmad component the below output:

INFO:root:Remote reboot triggered! Rebooting gateway...    
sh: 1: reboot: not found                                   
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused

This command is triggered by the orc8r/gateway/python/magma/magmad/rpc_servicer.py:109 file which depends on the reboot command being installed in the magmad container.

Docker containers don't have the ability to restart the host system or control the host machine's processes, neither implement full OS.

The solution was to Replace the “reboot“ command to echo b > /proc/sysrq-trigger in the python script orc8r/gateway/python/magma/magmad/rpc_servicer.py:109 and add the below lines to the magmad section on the compose file lte/gateway/docker/docker-compose.yaml or /var/opt/magma/docker/docker-compose.yaml:

    security_opt:
      - apparmor=unconfined
      - systempaths=unconfined

*Note: the command echo b > /proc/sysrq-trigger might be too harsh on the machine, it might be interesting to examine for the advantages of other commands, such as 'echo _sub > /proc/sysrq-trigger'. I've tried using _reisub as commonly recommended, even _sb to assure the disks are being synchronized, but without success, so I left only with the b from reboot. Please let me know if this is enough or a better solution is needed.

Selecting for the “restart services“ option triggers a series of messages that results in the magmad component the below output:

INFO:root:[SyncRPC] Got heartBeat from cloud                                                                          
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
INFO:root:Checking for upgrade...                          
WARNING:root:magmad package_version config missing or set to default 0.0.0-0, skipping upgrade                        
INFO:root:Restarting following services: []                
Error response from daemon: No such container: mme         
Error response from daemon: No such container: envoy_controller                                                       
Error response from daemon: No such container: dnsd        
subscriberdb                                               
directoryd                                                 
enodebd                                                    
policydb                                                   
smsd                                                       
state                                                      
ctraced                                                    
eventd                                                     
health                                                     
ERROR:root:GetServiceInfo Error for subscriberdb! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for directoryd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for enodebd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for policydb! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for state! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for eventd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for smsd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for ctraced! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for health! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
td-agent-bit                                               
pipelined                                                  
ERROR:root:[SyncRPC] Failing to forward request, err: Socket closed                                                   
WARNING:root:[SyncRPC] Transient gRPC error, retrying: Socket closed                                                  
control_proxy                                              
INFO:root:[SyncRPC] Opening stream to cloud                
INFO:root:[SyncRPC] Waiting for requests                   
ERROR:root:[SyncRPC] Failing to forward request, err: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:[SyncRPC] gRPC error: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused, reconnecting to cloud.
mobilityd                                                  
sessiond                                                   
redis                                                      
INFO:root:[SyncRPC] Opening stream to cloud                
INFO:root:[SyncRPC] Waiting for requests                   
ERROR:root:[SyncRPC] Failing to forward request, err: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:[SyncRPC] gRPC error: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused, reconnecting to cloud.
INFO:root:[SyncRPC] Opening stream to cloud                
INFO:root:[SyncRPC] Waiting for requests                   
ERROR:root:GetServiceInfo Error for mobilityd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetOperationalStates Error for mobilityd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetOperationalStates Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
INFO:root:Checkin Successful! Successfully sent states to the cloud!                                                  
INFO:root:Processing config update agw-001                 
WARNING:root:Orchestrator version:  not valid              
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused

It is possible to recognize the attempt to restart services from the lines:

INFO:root:Restarting following services: []                                                                                                                                                                                                 
Error response from daemon: No such container: mme         
Error response from daemon: No such container: envoy_controller                                                       
Error response from daemon: No such container: dnsd        
subscriberdb                                               
directoryd                                                 
enodebd                                                    
policydb                                                   
smsd                                                       
state                                                      
ctraced                                                    
eventd                                                     
health  
td-agent-bit                                               
pipelined  
control_proxy 
mobilityd                                                  
sessiond                                                   
redis

It is possible to see that a couple of services failed to be found from the lines:

Error response from daemon: No such container: mme         
Error response from daemon: No such container: envoy_controller                                                       
Error response from daemon: No such container: dnsd

And it is possible to confirm that some of the services has been restarted from the docker compose ps command:

connectiond     linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/conn…"    connectiond     3 days ago   Up 22 hours (healthy)              
control_proxy   linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c '/usr/local/b…"    control_proxy   3 days ago   Up 19 seconds (health: starting)   
ctraced         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    ctraced         3 days ago   Up 27 seconds (health: starting)   
directoryd      linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    directoryd      3 days ago   Up 29 seconds (health: starting)   
enodebd         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    enodebd         3 days ago   Up 29 seconds (health: starting)   
eventd          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    eventd          3 days ago   Up 27 seconds (health: starting)   
health          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    health          3 days ago   Up 27 seconds (health: starting)   
magmad          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '\n  /u…"   magmad          3 days ago   Up 22 hours                        
mobilityd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c 'sleep 5 && /…"    mobilityd       3 days ago   Up 19 seconds (health: starting)   
monitord        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    monitord        3 days ago   Up 22 hours (healthy)              
oai_mme         linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c '/usr/local/b…"    oai_mme         3 days ago   Up 22 hours (healthy)              
pipelined       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "bash -c '/usr/bin/o…"    pipelined       3 days ago   Up 23 seconds (health: starting)   
policydb        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    policydb        3 days ago   Up 28 seconds (health: starting)   
redirectd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    redirectd       3 days ago   Up 22 hours (healthy)              
redis           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    redis           3 days ago   Up 18 seconds (health: starting)   
sctpd           linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/sctpd"    sctpd           3 days ago   Up 22 hours                        
sessiond        linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c 'mkdir -p /va…"    sessiond        3 days ago   Up 19 seconds (health: starting)   
smsd            linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    smsd            3 days ago   Up 28 seconds (health: starting)   
state           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    state           3 days ago   Up 27 seconds (health: starting)   
subscriberdb    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    subscriberdb    3 days ago   Up 29 seconds (health: starting)   
td-agent-bit    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    td-agent-bit    3 days ago   Up 26 seconds (health: starting)

From that list, it is safe to assume that all containers had been restarted except for connectiond, magmad, monitord, oai_mme, redirectd and sctpd.

The function to restart the tasks is RestartServices, defined orc8r/gateway/python/magma/magmad/rpc_servicer.py:115 and the services seems to be originated from an parse_args object, as from orc8r/gateway/python/scripts/magmad_cli.py:42.

At the first inspection, I could not locate where the list is being generated.

The solution found was to add the remaining service names to the configuration file lte/gateway/configs/magmad.yml to resolve the issue of restarting the remaining items.

NAME            IMAGE                                                            COMMAND                   SERVICE         CREATED         STATUS                             PORTS
connectiond     linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/conn…"    connectiond     3 minutes ago   Up 19 seconds (health: starting)   
control_proxy   linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c '/usr/local/b…"    control_proxy   3 minutes ago   Up 20 seconds (health: starting)   
ctraced         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    ctraced         3 minutes ago   Up 29 seconds (health: starting)   
directoryd      linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    directoryd      3 minutes ago   Up 30 seconds (healthy)            
enodebd         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    enodebd         3 minutes ago   Up 30 seconds (health: starting)   
eventd          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    eventd          3 minutes ago   Up 29 seconds (health: starting)   
health          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    health          3 minutes ago   Up 29 seconds (health: starting)   
magmad          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '\n  /u…"   magmad          3 minutes ago   Up 28 seconds                      
mobilityd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c 'sleep 5 && /…"    mobilityd       3 minutes ago   Up 20 seconds (health: starting)   
monitord        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    monitord        3 minutes ago   Up 29 seconds (health: starting)   
oai_mme         linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c '/usr/local/b…"    oai_mme         3 minutes ago   Up 20 seconds (health: starting)   
pipelined       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "bash -c '/usr/bin/o…"    pipelined       3 minutes ago   Up 29 seconds (health: starting)   
policydb        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    policydb        3 minutes ago   Up 30 seconds (health: starting)   
redirectd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    redirectd       3 minutes ago   Up 29 seconds (health: starting)   
redis           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    redis           3 minutes ago   Up 20 seconds (health: starting)   
sctpd           linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/sctpd"    sctpd           3 minutes ago   Up 29 seconds                      
sessiond        linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c 'mkdir -p /va…"    sessiond        3 minutes ago   Up 20 seconds (health: starting)   
smsd            linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    smsd            3 minutes ago   Up 29 seconds (health: starting)   
state           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    state           3 minutes ago   Up 30 seconds (health: starting)   
subscriberdb    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    subscriberdb    3 minutes ago   Up 30 seconds (healthy)            
td-agent-bit    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    td-agent-bit    3 minutes ago   Up 27 seconds (health: starting)

The “restart services” option is functional, although some services are not being targeted. A fix is to include the docker container names in the configuration file lte/gateway/configs/magmad.yml, under the magma_services section.

Additional Information

This change is backwards-breaking

Security Considerations

Restarting the machine without proper caution might corrupt disk data. It might be interesting to look after a safest way to restart the host system.

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>

github-actions · 2024-12-13T00:50:06Z

Thanks for opening a PR! 💯

A couple initial guidelines

All commits must be signed off. This is enforced by PR DCO check.
All PR titles must follow the semantic commits format. This is enforced by PR Check Title Or Commit Message.

Howto

Reviews. The "Reviewers" listed for this PR are the Magma maintainers who will shepherd it.
Checks. All required CI checks must pass before merge.
Merge. Once approved and passing CI checks, use the ready2merge label to indicate the maintainers can merge your PR.

More info

Please take a moment to read through the Magma project's

Contributing Conventions for norms around contributed code

If this is your first Magma PR, also consider reading

Developer Onboarding for onboarding as a new Magma developer
Development Workflow for guidance on your first PR
CI Checks for points of contact for failing or flaky CI checks
Code Review Process for information on requesting reviews and contacting maintainers

github-actions · 2024-12-13T00:50:25Z

lte/gateway/configs/magmad.yml

  - policydb
  - state
  - eventd
  - smsd
  - ctraced
  - health
+  - redirectd
+  - sctpd
+  - monitord


[misspell] _{reported by reviewdog 🐶}
"monitord" is a misspelling of "monitored"

monitord is the name of the container, so it is not misspelled.

github-actions · 2024-12-13T00:50:41Z

orc8r/gateway/python/magma/magmad/rpc_servicer.py

@@ -106,7 +106,7 @@ def Reboot(self, _, context):
        """
        async def run_reboot():
            await asyncio.sleep(1)
-            os.system('reboot')
+            os.system('echo b > /proc/sysrq-trigger')


[pep8] _{reported by reviewdog 🐶}
S605 Starting a process with a shell: Seems safe, but may be changed in the future, consider rewriting without shell

I could not find a better solution than this, please let me know if you can come up with a solution that does not uses shell.

orc8r/gateway/python/magma/magmad/rpc_servicer.py

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>

github-actions · 2024-12-13T01:00:01Z

orc8r/gateway/python/magma/magmad/rpc_servicer.py

@@ -106,7 +106,7 @@ def Reboot(self, _, context):
        """
        async def run_reboot():
            await asyncio.sleep(1)
-            os.system('reboot')
+            os.system('/usr/bin/echo b > /proc/sysrq-trigger')


[pep8] _{reported by reviewdog 🐶}
S605 Starting a process with a shell: Seems safe, but may be changed in the future, consider rewriting without shell

lucaaamaral added 3 commits December 12, 2024 21:45

fix(magmad): enable system reboot

9de8aad

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>

fix(magmad): enable restart of remaining docker containers

0cdde4a

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>

chore: devcontainer autoformat

8ea8d12

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>

lucaaamaral requested review from a team as code owners December 13, 2024 00:49

lucaaamaral requested a review from jordanvrtanoski December 13, 2024 00:49

pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines. label Dec 13, 2024

github-actions bot added component: agw Access gateway-related issue component: ci All updates on CI (Jenkins/CircleCi/Github Action) component: orc8r Orchestrator-related issue labels Dec 13, 2024

github-actions bot reviewed Dec 13, 2024

View reviewed changes

chore(magmad): use full executable path for OS call

d2824ae

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>

github-actions bot reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(magmad): restart options #15586

fix(magmad): restart options #15586

lucaaamaral commented Dec 13, 2024

github-actions bot commented Dec 13, 2024

github-actions bot Dec 13, 2024

lucaaamaral Dec 13, 2024

github-actions bot Dec 13, 2024

lucaaamaral Dec 13, 2024

github-actions bot Dec 13, 2024

fix(magmad): restart options #15586

Are you sure you want to change the base?

fix(magmad): restart options #15586

Conversation

lucaaamaral commented Dec 13, 2024

Summary

Test Plan

Additional Information

Security Considerations

github-actions bot commented Dec 13, 2024

Howto

More info

github-actions bot Dec 13, 2024

Choose a reason for hiding this comment

lucaaamaral Dec 13, 2024

Choose a reason for hiding this comment

github-actions bot Dec 13, 2024

Choose a reason for hiding this comment

lucaaamaral Dec 13, 2024

Choose a reason for hiding this comment

github-actions bot Dec 13, 2024

Choose a reason for hiding this comment