Skip to content

[Core] Support aarch64 -- causing docker on M1 build and runtime errors #28103

Closed
@lundybernard

Description

What happened + What you expected to happen

We are building a docker-compose config to wire-up ray as a single-node or cluster, and wire it up with additional services (logging, S3 storage, etc.), and provide an "easy button" for our users. Many of our users are now stuck on Apple M1(arm64) hardware, and we need to support them.

Best Case:

use the rayproject/ray (or ray-ml) container directly

version: "3"

services:
  ray-head:
    image: rayproject/ray
    command: "ray start -v --head --port=6377 --redis-shard-ports=6380,6381 --object-manager-port=22345 --node-manager-port=22346 --dashboard-host=0.0.0.0 --block"

docker-compose up Results in a Timeout error:

ray-head_1    | 2022-08-25 09:32:31,135 INFO scripts.py:715 -- Local node IP: 172.26.0.2
ray-head_1    | Traceback (most recent call last):
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 307, in __init__
ray-head_1    |     self.redis_password,
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 397, in wait_for_node
ray-head_1    |     raise TimeoutError("Timed out while waiting for node to startup.")
ray-head_1    | TimeoutError: Timed out while waiting for node to startup.
ray-head_1    | 
ray-head_1    | During handling of the above exception, another exception occurred:
ray-head_1    | 
ray-head_1    | Traceback (most recent call last):
ray-head_1    |   File "/home/ray/anaconda3/bin/ray", line 8, in <module>
ray-head_1    |     sys.exit(main())
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2339, in main
ray-head_1    |     return cli()
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
ray-head_1    |     return self.main(*args, **kwargs)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
ray-head_1    |     rv = self.invoke(ctx)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
ray-head_1    |     return _process_result(sub_ctx.command.invoke(sub_ctx))
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
ray-head_1    |     return ctx.invoke(self.callback, **ctx.params)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
ray-head_1    |     return __callback(*args, **kwargs)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
ray-head_1    |     return f(*args, **kwargs)
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 738, in start
ray-head_1    |     ray_params, head=True, shutdown_at_exit=block, spawn_reaper=block
ray-head_1    |   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 311, in __init__
ray-head_1    |     "The current node has not been updated within 30 "
ray-head_1    | Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.
dc_ray-head_1 exited with code 1

docker run --platform linux/x86_64 rayproject/ray ray start -v --head --port=6377 --redis-shard-ports=6380,6381 --object-manager-port=22345 --node-manager-port=22346 --dashboard-host=0.0.0.0 --block results in a Raylet Some Ray subprcesses exited unexpectedly error

2022-08-25 09:27:57,870 INFO scripts.py:715 -- Local node IP: 172.17.0.2
2022-08-25 09:28:07,345 SUCC scripts.py:757 -- --------------------
2022-08-25 09:28:07,345 SUCC scripts.py:758 -- Ray runtime started.
2022-08-25 09:28:07,345 SUCC scripts.py:759 -- --------------------
2022-08-25 09:28:07,346 INFO scripts.py:761 -- Next steps
2022-08-25 09:28:07,346 INFO scripts.py:762 -- To connect to this Ray runtime from another node, run
2022-08-25 09:28:07,346 INFO scripts.py:767 --   ray start --address='172.17.0.2:6377'
2022-08-25 09:28:07,346 INFO scripts.py:770 -- Alternatively, use the following Python code:
2022-08-25 09:28:07,347 INFO scripts.py:772 -- import ray
2022-08-25 09:28:07,347 INFO scripts.py:785 -- ray.init(address='auto')
2022-08-25 09:28:07,347 INFO scripts.py:789 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-08-25 09:28:07,347 INFO scripts.py:793 -- connect to a remote cluster from your laptop directly, use the following
2022-08-25 09:28:07,348 INFO scripts.py:796 -- Python code:
2022-08-25 09:28:07,348 INFO scripts.py:798 -- import ray
2022-08-25 09:28:07,348 INFO scripts.py:804 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-08-25 09:28:07,349 INFO scripts.py:810 -- If connection fails, check your firewall settings and network configuration.
2022-08-25 09:28:07,349 INFO scripts.py:816 -- To terminate the Ray runtime, run
2022-08-25 09:28:07,349 INFO scripts.py:817 --   ray stop
2022-08-25 09:28:07,349 INFO scripts.py:892 -- --block
2022-08-25 09:28:07,350 INFO scripts.py:894 -- This command will now block until terminated by a signal.
2022-08-25 09:28:07,350 INFO scripts.py:897 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-08-25 09:28:10,364 ERR scripts.py:907 -- Some Ray subprcesses exited unexpectedly:
2022-08-25 09:28:10,365 ERR scripts.py:914 -- raylet [exit code=1]
2022-08-25 09:28:10,365 ERR scripts.py:919 -- Remaining processes will be killed.

Workable Solution:

Build a ray image locally by installing the package in a dockerfile

native build

#FROM --platform=linux/amd64 python:3.9
#FROM --platform=linux/amd64 continuumio/miniconda3
FROM continuumio/miniconda3

#RUN conda install python=3.10
#RUN conda update conda && conda update pip

# fix grpcio for M1
#RUN pip uninstall grpcio; conda install grpcio

# Install Ray
#RUN pip install --no-cache-dir ray[default]~=1.13 ray[serve]~=1.13
RUN uname -m && \
    uname -a && \
    python --version && \
    pip --version && \
    conda install -c conda-forge ray-core
    #pip install ray

This fails to build, when using pip:

 > [5/5] RUN uname -m &&     uname -a &&     python --version &&     pip --version &&     pip install setuptools &&     pip install ray:                                                                                                  
#8 0.216 aarch64                                                                                                     
#8 0.216 Linux buildkitsandbox 5.10.109-0-virt #1-Alpine SMP Mon, 28 Mar 2022 11:20:52 +0000 aarch64 GNU/Linux       
#8 0.218 Python 3.10.4                                                                                               
#8 0.373 pip 22.1.2 from /opt/conda/lib/python3.10/site-packages/pip (python 3.10)
#8 0.600 Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (63.4.1)
#8 0.643 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#8 1.244 ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
#8 1.244 ERROR: No matching distribution found for ray

and when using conda:

 > [5/5] RUN uname -m &&     uname -a &&     python --version &&     pip --version &&     pip install setuptools &&     conda install -c conda-forge ray-core #pip install --no-cache-dir ray:                                            
#8 0.343 aarch64                                                                                                     
#8 0.343 Linux buildkitsandbox 5.10.109-0-virt #1-Alpine SMP Mon, 28 Mar 2022 11:20:52 +0000 aarch64 GNU/Linux       
#8 0.350 Python 3.10.4                                                                                               
#8 0.484 pip 22.1.2 from /opt/conda/lib/python3.10/site-packages/pip (python 3.10)
#8 0.686 Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (63.4.1)
#8 0.724 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#8 4.660 Collecting package metadata (current_repodata.json): ...working... done
#8 7.763 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#8 7.764 Collecting package metadata (repodata.json): ...working... done
#8 21.87 Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
#8 21.88 
#8 21.88 PackagesNotFoundError: The following packages are not available from current channels:
#8 21.88 
#8 21.88   - ray-core
#8 21.88 
#8 21.88 Current channels:
#8 21.88 
#8 21.88   - https://conda.anaconda.org/conda-forge/linux-aarch64
#8 21.88   - https://conda.anaconda.org/conda-forge/noarch
#8 21.88   - https://repo.anaconda.com/pkgs/main/linux-aarch64
#8 21.88   - https://repo.anaconda.com/pkgs/main/noarch
#8 21.88   - https://repo.anaconda.com/pkgs/r/linux-aarch64
#8 21.88   - https://repo.anaconda.com/pkgs/r/noarch
#8 21.88 
#8 21.88 To search for alternate channels that may provide the conda package you're
#8 21.88 looking for, navigate to
#8 21.88 
#8 21.88     https://anaconda.org
#8 21.88 
#8 21.88 and use the search bar at the top of the page.
#8 21.88 
#8 21.88 

platform=linux/amd64 emulation build

setting the image to emulate amd64 with platform=linux/amd64, we can build successfully using pip or conda, however at runtime we get the Raylet subprocess error.

worst case

Build ray from source in a local image. Undesireable due to build time, and extra work for users, but I am willing to test it on M1 hardware with some guidance.

raylet Logs:

> cat logs/session_2022-08-25_15-27-56_654718_1/logs/raylet.*
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  bind: Operation not permitted
[2022-08-25 15:27:59,090 E 53 68] logging.cc:414: *** Aborted at 1661441279 (unix time) try "date -d @1661441279" if you are using GNU date ***
[2022-08-25 15:27:59,092 E 53 68] logging.cc:414: PC: @                0x0 (unknown)
[2022-08-25 15:27:59,094 E 53 68] logging.cc:414: *** SIGABRT (@0x35) received by PID 53 (TID 0x40379c7700) from PID 53; stack trace: ***
[2022-08-25 15:27:59,103 E 53 68] logging.cc:414:     @       0x400062e55f google::(anonymous namespace)::FailureSignalHandler()
[2022-08-25 15:27:59,104 E 53 68] logging.cc:414:     @       0x40027c7140 (unknown)
[2022-08-25 15:27:59,105 E 53 68] logging.cc:414:     @       0x400282fce1 gsignal
[2022-08-25 15:27:59,106 E 53 68] logging.cc:414:     @       0x4002819537 abort
[2022-08-25 15:27:59,110 E 53 68] logging.cc:414:     @       0x40025a5872 __gnu_cxx::__verbose_terminate_handler()
[2022-08-25 15:27:59,111 E 53 68] logging.cc:414:     @       0x40025a3f6f __cxxabiv1::__terminate()
[2022-08-25 15:27:59,162 E 53 68] logging.cc:414:     @       0x40025a3fb1 std::terminate()
[2022-08-25 15:27:59,164 E 53 68] logging.cc:414:     @       0x40025a419a __cxa_throw
[2022-08-25 15:27:59,167 E 53 68] logging.cc:414:     @       0x400012b564 boost::throw_exception<>()
[2022-08-25 15:27:59,168 E 53 68] logging.cc:414:     @       0x4000963d8d boost::asio::detail::do_throw_error()
[2022-08-25 15:27:59,170 E 53 68] logging.cc:414:     @       0x4000290087 plasma::PlasmaStore::PlasmaStore()
[2022-08-25 15:27:59,171 E 53 68] logging.cc:414:     @       0x4000297741 plasma::PlasmaStoreRunner::Start()
[2022-08-25 15:27:59,173 E 53 68] logging.cc:414:     @       0x4000236bac std::thread::_State_impl<>::_M_run()
[2022-08-25 15:27:59,182 E 53 68] logging.cc:414:     @       0x40025c0039 execute_native_thread_routine
[2022-08-25 15:27:59,183 E 53 68] logging.cc:414:     @       0x40027bbea7 start_thread
[2022-08-25 15:27:59,184 E 53 68] logging.cc:414:     @       0x40028f1def clone
[2022-08-25 15:27:58,990 I 53 53] io_service_pool.cc:36: IOServicePool is running with 1 io_service.
[2022-08-25 15:27:59,008 I 53 53] store_runner.cc:31: Allowing the Plasma store to use up to 0.59385GB of memory.
[2022-08-25 15:27:59,009 I 53 53] store_runner.cc:44: Starting object store with directory /dev/shm and huge page support disabled
[2022-08-25 15:27:59,090 E 53 68] logging.cc:414: *** Aborted at 1661441279 (unix time) try "date -d @1661441279" if you are using GNU date ***
[2022-08-25 15:27:59,092 E 53 68] logging.cc:414: PC: @                0x0 (unknown)
[2022-08-25 15:27:59,094 E 53 68] logging.cc:414: *** SIGABRT (@0x35) received by PID 53 (TID 0x40379c7700) from PID 53; stack trace: ***
[2022-08-25 15:27:59,103 E 53 68] logging.cc:414:     @       0x400062e55f google::(anonymous namespace)::FailureSignalHandler()
[2022-08-25 15:27:59,104 E 53 68] logging.cc:414:     @       0x40027c7140 (unknown)
[2022-08-25 15:27:59,105 E 53 68] logging.cc:414:     @       0x400282fce1 gsignal
[2022-08-25 15:27:59,106 E 53 68] logging.cc:414:     @       0x4002819537 abort
[2022-08-25 15:27:59,110 E 53 68] logging.cc:414:     @       0x40025a5872 __gnu_cxx::__verbose_terminate_handler()
[2022-08-25 15:27:59,111 E 53 68] logging.cc:414:     @       0x40025a3f6f __cxxabiv1::__terminate()
[2022-08-25 15:27:59,162 E 53 68] logging.cc:414:     @       0x40025a3fb1 std::terminate()
[2022-08-25 15:27:59,164 E 53 68] logging.cc:414:     @       0x40025a419a __cxa_throw
[2022-08-25 15:27:59,167 E 53 68] logging.cc:414:     @       0x400012b564 boost::throw_exception<>()
[2022-08-25 15:27:59,168 E 53 68] logging.cc:414:     @       0x4000963d8d boost::asio::detail::do_throw_error()
[2022-08-25 15:27:59,170 E 53 68] logging.cc:414:     @       0x4000290087 plasma::PlasmaStore::PlasmaStore()
[2022-08-25 15:27:59,171 E 53 68] logging.cc:414:     @       0x4000297741 plasma::PlasmaStoreRunner::Start()
[2022-08-25 15:27:59,173 E 53 68] logging.cc:414:     @       0x4000236bac std::thread::_State_impl<>::_M_run()
[2022-08-25 15:27:59,182 E 53 68] logging.cc:414:     @       0x40025c0039 execute_native_thread_routine
[2022-08-25 15:27:59,183 E 53 68] logging.cc:414:     @       0x40027bbea7 start_thread
[2022-08-25 15:27:59,184 E 53 68] logging.cc:414:     @       0x40028f1def clone

Versions / Dependencies

Macintosh M1 hardware

uname -a
Darwin MacBook-Pro.local 21.4.0 Darwin Kernel Version 21.4.0: Mon Feb 21 20:35:58 PST 2022; root:xnu-8020.101.4~2/RELEASE_ARM64_T6000 arm64

Python: 3.9, 3.10
docker image arch: linux/amd64, linux/aarch64
Ray: 1.13, 2.0

Reproduction script

repro will require a working docker+compose installation on Apple M1 hardware

Issue Severity

High: It blocks me from completing my task.

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Core

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions