Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(magmad): restart options #15586

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lucaaamaral
Copy link

fix(magmad): restart options

Summary

There were misbehaviors in the Equipment -> Actions -> Restart services/Reboot from the nms:

  • Restart AGW host machine is currently not working
  • Restart AGW components is currently not working

During tests, it was found that:

  • The “reboot” command was not available inside the docker container, hence the system not rebooting
  • Six out of twenty-one components were not being restarted after the button was pressed

Test Plan

Selecting for the “restart“ option triggers a series of messages that results in the magmad component the below output:

INFO:root:Remote reboot triggered! Rebooting gateway...    
sh: 1: reboot: not found                                   
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused

This command is triggered by the orc8r/gateway/python/magma/magmad/rpc_servicer.py:109 file which depends on the reboot command being installed in the magmad container.

Docker containers don't have the ability to restart the host system or control the host machine's processes, neither implement full OS.

The solution was to Replace the “reboot“ command to echo b > /proc/sysrq-trigger in the python script orc8r/gateway/python/magma/magmad/rpc_servicer.py:109 and add the below lines to the magmad section on the compose file lte/gateway/docker/docker-compose.yaml or /var/opt/magma/docker/docker-compose.yaml:

    security_opt:
      - apparmor=unconfined
      - systempaths=unconfined

*Note: the command echo b > /proc/sysrq-trigger might be too harsh on the machine, it might be interesting to examine for the advantages of other commands, such as 'echo _sub > /proc/sysrq-trigger'. I've tried using _reisub as commonly recommended, even _sb to assure the disks are being synchronized, but without success, so I left only with the b from reboot. Please let me know if this is enough or a better solution is needed.

Selecting for the “restart services“ option triggers a series of messages that results in the magmad component the below output:

INFO:root:[SyncRPC] Got heartBeat from cloud                                                                          
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
INFO:root:Checking for upgrade...                          
WARNING:root:magmad package_version config missing or set to default 0.0.0-0, skipping upgrade                        
INFO:root:Restarting following services: []                
Error response from daemon: No such container: mme         
Error response from daemon: No such container: envoy_controller                                                       
Error response from daemon: No such container: dnsd        
subscriberdb                                               
directoryd                                                 
enodebd                                                    
policydb                                                   
smsd                                                       
state                                                      
ctraced                                                    
eventd                                                     
health                                                     
ERROR:root:GetServiceInfo Error for subscriberdb! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for directoryd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for enodebd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for policydb! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for state! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for eventd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for smsd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for ctraced! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for health! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
td-agent-bit                                               
pipelined                                                  
ERROR:root:[SyncRPC] Failing to forward request, err: Socket closed                                                   
WARNING:root:[SyncRPC] Transient gRPC error, retrying: Socket closed                                                  
control_proxy                                              
INFO:root:[SyncRPC] Opening stream to cloud                
INFO:root:[SyncRPC] Waiting for requests                   
ERROR:root:[SyncRPC] Failing to forward request, err: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:[SyncRPC] gRPC error: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused, reconnecting to cloud.
mobilityd                                                  
sessiond                                                   
redis                                                      
INFO:root:[SyncRPC] Opening stream to cloud                
INFO:root:[SyncRPC] Waiting for requests                   
ERROR:root:[SyncRPC] Failing to forward request, err: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:[SyncRPC] gRPC error: failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused, reconnecting to cloud.
INFO:root:[SyncRPC] Opening stream to cloud                
INFO:root:[SyncRPC] Waiting for requests                   
ERROR:root:GetServiceInfo Error for mobilityd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetOperationalStates Error for mobilityd! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetOperationalStates Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
INFO:root:Checkin Successful! Successfully sent states to the cloud!                                                  
INFO:root:Processing config update agw-001                 
WARNING:root:Orchestrator version:  not valid              
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused
ERROR:root:GetServiceInfo Error for envoy_controller! [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused

It is possible to recognize the attempt to restart services from the lines:

INFO:root:Restarting following services: []                                                                                                                                                                                                 
Error response from daemon: No such container: mme         
Error response from daemon: No such container: envoy_controller                                                       
Error response from daemon: No such container: dnsd        
subscriberdb                                               
directoryd                                                 
enodebd                                                    
policydb                                                   
smsd                                                       
state                                                      
ctraced                                                    
eventd                                                     
health  
td-agent-bit                                               
pipelined  
control_proxy 
mobilityd                                                  
sessiond                                                   
redis  

It is possible to see that a couple of services failed to be found from the lines:

Error response from daemon: No such container: mme         
Error response from daemon: No such container: envoy_controller                                                       
Error response from daemon: No such container: dnsd   

And it is possible to confirm that some of the services has been restarted from the docker compose ps command:

connectiond     linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/conn…"    connectiond     3 days ago   Up 22 hours (healthy)              
control_proxy   linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c '/usr/local/b…"    control_proxy   3 days ago   Up 19 seconds (health: starting)   
ctraced         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    ctraced         3 days ago   Up 27 seconds (health: starting)   
directoryd      linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    directoryd      3 days ago   Up 29 seconds (health: starting)   
enodebd         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    enodebd         3 days ago   Up 29 seconds (health: starting)   
eventd          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    eventd          3 days ago   Up 27 seconds (health: starting)   
health          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    health          3 days ago   Up 27 seconds (health: starting)   
magmad          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '\n  /u…"   magmad          3 days ago   Up 22 hours                        
mobilityd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c 'sleep 5 && /…"    mobilityd       3 days ago   Up 19 seconds (health: starting)   
monitord        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    monitord        3 days ago   Up 22 hours (healthy)              
oai_mme         linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c '/usr/local/b…"    oai_mme         3 days ago   Up 22 hours (healthy)              
pipelined       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "bash -c '/usr/bin/o…"    pipelined       3 days ago   Up 23 seconds (health: starting)   
policydb        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    policydb        3 days ago   Up 28 seconds (health: starting)   
redirectd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    redirectd       3 days ago   Up 22 hours (healthy)              
redis           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    redis           3 days ago   Up 18 seconds (health: starting)   
sctpd           linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/sctpd"    sctpd           3 days ago   Up 22 hours                        
sessiond        linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c 'mkdir -p /va…"    sessiond        3 days ago   Up 19 seconds (health: starting)   
smsd            linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    smsd            3 days ago   Up 28 seconds (health: starting)   
state           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    state           3 days ago   Up 27 seconds (health: starting)   
subscriberdb    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    subscriberdb    3 days ago   Up 29 seconds (health: starting)   
td-agent-bit    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    td-agent-bit    3 days ago   Up 26 seconds (health: starting)

From that list, it is safe to assume that all containers had been restarted except for connectiond, magmad, monitord, oai_mme, redirectd and sctpd.

The function to restart the tasks is RestartServices, defined orc8r/gateway/python/magma/magmad/rpc_servicer.py:115 and the services seems to be originated from an parse_args object, as from orc8r/gateway/python/scripts/magmad_cli.py:42.

At the first inspection, I could not locate where the list is being generated.

The solution found was to add the remaining service names to the configuration file lte/gateway/configs/magmad.yml to resolve the issue of restarting the remaining items.

NAME            IMAGE                                                            COMMAND                   SERVICE         CREATED         STATUS                             PORTS
connectiond     linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/conn…"    connectiond     3 minutes ago   Up 19 seconds (health: starting)   
control_proxy   linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c '/usr/local/b…"    control_proxy   3 minutes ago   Up 20 seconds (health: starting)   
ctraced         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    ctraced         3 minutes ago   Up 29 seconds (health: starting)   
directoryd      linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    directoryd      3 minutes ago   Up 30 seconds (healthy)            
enodebd         linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    enodebd         3 minutes ago   Up 30 seconds (health: starting)   
eventd          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    eventd          3 minutes ago   Up 29 seconds (health: starting)   
health          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    health          3 minutes ago   Up 29 seconds (health: starting)   
magmad          linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '\n  /u…"   magmad          3 minutes ago   Up 28 seconds                      
mobilityd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "sh -c 'sleep 5 && /…"    mobilityd       3 minutes ago   Up 20 seconds (health: starting)   
monitord        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    monitord        3 minutes ago   Up 29 seconds (health: starting)   
oai_mme         linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c '/usr/local/b…"    oai_mme         3 minutes ago   Up 20 seconds (health: starting)   
pipelined       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "bash -c '/usr/bin/o…"    pipelined       3 minutes ago   Up 29 seconds (health: starting)   
policydb        linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    policydb        3 minutes ago   Up 30 seconds (health: starting)   
redirectd       linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    redirectd       3 minutes ago   Up 29 seconds (health: starting)   
redis           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    redis           3 minutes ago   Up 20 seconds (health: starting)   
sctpd           linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "/usr/local/bin/sctpd"    sctpd           3 minutes ago   Up 29 seconds                      
sessiond        linuxfoundation.jfrog.io/magma-docker/agw_gateway_c:1.8.0        "sh -c 'mkdir -p /va…"    sessiond        3 minutes ago   Up 20 seconds (health: starting)   
smsd            linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    smsd            3 minutes ago   Up 29 seconds (health: starting)   
state           linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    state           3 minutes ago   Up 30 seconds (health: starting)   
subscriberdb    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/usr/bin/env python…"    subscriberdb    3 minutes ago   Up 30 seconds (healthy)            
td-agent-bit    linuxfoundation.jfrog.io/magma-docker/agw_gateway_python:1.8.0   "/bin/bash -c '/usr/…"    td-agent-bit    3 minutes ago   Up 27 seconds (health: starting) 

The “restart services” option is functional, although some services are not being targeted. A fix is to include the docker container names in the configuration file lte/gateway/configs/magmad.yml, under the magma_services section.

Additional Information

  • This change is backwards-breaking

Security Considerations

Restarting the machine without proper caution might corrupt disk data. It might be interesting to look after a safest way to restart the host system.

Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>
Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>
Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>
@lucaaamaral lucaaamaral requested review from a team as code owners December 13, 2024 00:49
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines. label Dec 13, 2024
Copy link
Contributor

Thanks for opening a PR! 💯

A couple initial guidelines

Howto

  • Reviews. The "Reviewers" listed for this PR are the Magma maintainers who will shepherd it.
  • Checks. All required CI checks must pass before merge.
  • Merge. Once approved and passing CI checks, use the ready2merge label to indicate the maintainers can merge your PR.

More info

Please take a moment to read through the Magma project's

If this is your first Magma PR, also consider reading

@github-actions github-actions bot added component: agw Access gateway-related issue component: ci All updates on CI (Jenkins/CircleCi/Github Action) component: orc8r Orchestrator-related issue labels Dec 13, 2024
- policydb
- state
- eventd
- smsd
- ctraced
- health
- redirectd
- sctpd
- monitord
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[misspell] reported by reviewdog 🐶
"monitord" is a misspelling of "monitored"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitord is the name of the container, so it is not misspelled.

@@ -106,7 +106,7 @@ def Reboot(self, _, context):
"""
async def run_reboot():
await asyncio.sleep(1)
os.system('reboot')
os.system('echo b > /proc/sysrq-trigger')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pep8] reported by reviewdog 🐶
S605 Starting a process with a shell: Seems safe, but may be changed in the future, consider rewriting without shell

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find a better solution than this, please let me know if you can come up with a solution that does not uses shell.

orc8r/gateway/python/magma/magmad/rpc_servicer.py Outdated Show resolved Hide resolved
Signed-off-by: Lucas Amaral <lucaaamaral@gmail.com>
@@ -106,7 +106,7 @@ def Reboot(self, _, context):
"""
async def run_reboot():
await asyncio.sleep(1)
os.system('reboot')
os.system('/usr/bin/echo b > /proc/sysrq-trigger')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pep8] reported by reviewdog 🐶
S605 Starting a process with a shell: Seems safe, but may be changed in the future, consider rewriting without shell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: agw Access gateway-related issue component: ci All updates on CI (Jenkins/CircleCi/Github Action) component: orc8r Orchestrator-related issue size/M Denotes a PR that changes 30-99 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant