[v2 BUG]: #1123

jia-huang · 2024-12-16T05:07:47Z

I did not make any changes to the code, but when running the hyperparameter optimization example with the command:
chemprop hpopt --data-path tests/data/regression.csv --task-type regression --search-parameter-keywords depth ffn_num_layers message_hidden_dim --hpopt-save-dir results

I encountered the following error:

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\tune\result_grid.py", line 161, in get_best_result
raise RuntimeError(error_msg)
RuntimeError: No best trial found for the given metric: val_loss. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the filter_nan_and_inf arg to False.

More detailed error information is as follows:
Trial status: 9 ERROR | 1 RUNNING
Current time: 2024-12-16 12:49:33. Total running time: 1min 0s
Logical resource usage: 1.0/32 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:RTX)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status ...ig/ffn_num_layers ...essage_hidden_dim ...loop_config/depth │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_6b81e40e RUNNING 1 2400 4 │
│ TorchTrainer_a24ddccd ERROR 2 1300 3 │
│ TorchTrainer_397b0729 ERROR 2 1100 5 │
│ TorchTrainer_ca0e0541 ERROR 2 1600 3 │
│ TorchTrainer_39df964c ERROR 1 1600 2 │
│ TorchTrainer_7d9cc99c ERROR 2 1700 4 │
│ TorchTrainer_3794041d ERROR 1 2200 3 │
│ TorchTrainer_4ccd20de ERROR 2 1900 3 │
│ TorchTrainer_8a3f1661 ERROR 1 2000 3 │
│ TorchTrainer_95199caf ERROR 2 1400 5 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
2024-12-16 12:49:35,189 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_6b81e40e
Traceback (most recent call last):
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_f
uture
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=23868, ip=127.0.0.1, actor_id=dee898940af38b9810f02c
fe01000000, repr=TorchTrainer)
File "python\ray_raylet.pyx", line 1879, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 1820, in ray._raylet.execute_task.function_executor
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\function_manager.py", line 696, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\tune\trainable\trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\air_internal\util.py", line 107, in run
self._ret = self._target(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\tune\trainable\function_trainable.py", line 45, in
training_func=lambda: self._trainable_func(self.config),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train\base_trainer.py", line 799, in _trainable_func
super()._trainable_func(self._merged_config)
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\tune\trainable\function_trainable.py", line 250, in _trainable_fu
nc
output = fn()
^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train\base_trainer.py", line 107, in _train_coordinator_fn
trainer.training_loop()
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train\data_parallel_trainer.py", line 459, in training_loop
backend_executor.start()
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train_internal\backend_executor.py", line 203, in start
self._backend.on_start(self.worker_group, self._backend_config)
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train\torch\config.py", line 200, in on_start
ray.get(setup_futures)
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute._setup_torch_process_group() (pid=16908, ip=127
.0.0.1, actor_id=5c6e8ff43a38c0b7a3280bd501000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x000
001441D027150>)
File "python\ray_raylet.pyx", line 1879, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 1820, in ray._raylet.execute_task.function_executor
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray_private\function_manager.py", line 696, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train_internal\worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train_internal\worker_group.py", line 30, in __execute
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\train\torch\config.py", line 115, in _setup_torch_process_group
dist.init_process_group(
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\torch\distributed\c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\torch\distributed\c10d_logger.py", line 97, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1520, in init_process_gr
oup
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\torch\distributed\rendezvous.py", line 269, in _env_rendezvous_handle
r
store = _create_c10d_store(
^^^^^^^^^^^^^^^^^^^
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store
return TCPStore(
^^^^^^^^^
RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Trial TorchTrainer_6b81e40e errored after 0 iterations at 2024-12-16 12:49:35. Total running time: 1min 2s
Error file: C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/artifacts/2024-12-16_12-48-32/To
rchTrainer_2024-12-16_12-48-32/driver_artifacts/6b81e40e/error.txt
2024-12-16 12:49:35,207 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to 'F:/hu
angjia/chemprop-main/results/ray_results/TorchTrainer_2024-12-16_12-48-32' in 0.0120s.

Trial status: 10 ERROR
Current time: 2024-12-16 12:49:35. Total running time: 1min 2s
Logical resource usage: 1.0/32 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:RTX)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status ...ig/ffn_num_layers ...essage_hidden_dim ...loop_config/depth │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_a24ddccd ERROR 2 1300 3 │
│ TorchTrainer_397b0729 ERROR 2 1100 5 │
│ TorchTrainer_ca0e0541 ERROR 2 1600 3 │
│ TorchTrainer_39df964c ERROR 1 1600 2 │
│ TorchTrainer_7d9cc99c ERROR 2 1700 4 │
│ TorchTrainer_3794041d ERROR 1 2200 3 │
│ TorchTrainer_4ccd20de ERROR 2 1900 3 │
│ TorchTrainer_8a3f1661 ERROR 1 2000 3 │
│ TorchTrainer_95199caf ERROR 2 1400 5 │
│ TorchTrainer_6b81e40e ERROR 1 2400 4 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Number of errored trials: 10
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name # failures error file
│
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_a24ddccd 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/a24ddccd/error.txt │
│ TorchTrainer_397b0729 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/397b0729/error.txt │
│ TorchTrainer_ca0e0541 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/ca0e0541/error.txt │
│ TorchTrainer_39df964c 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/39df964c/error.txt │
│ TorchTrainer_7d9cc99c 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/7d9cc99c/error.txt │
│ TorchTrainer_3794041d 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/3794041d/error.txt │
│ TorchTrainer_4ccd20de 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/4ccd20de/error.txt │
│ TorchTrainer_8a3f1661 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/8a3f1661/error.txt │
│ TorchTrainer_95199caf 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/95199caf/error.txt │
│ TorchTrainer_6b81e40e 1 C:/Users/hp/AppData/Local/Temp/ray/session_2024-12-16_12-48-26_645095_23484/art
ifacts/2024-12-16_12-48-32/TorchTrainer_2024-12-16_12-48-32/driver_artifacts/6b81e40e/error.txt │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────────────────────────────────────────╯

2024-12-16 12:49:35,212 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_a24ddccd, TorchTrainer_397b0729, To
File "C:\anaconda3\envs\hjpy11\Lib\site-packages\ray\tune\result_grid.py", line 161, in get_best_result
raise RuntimeError(error_msg)
RuntimeError: No best trial found for the given metric: val_loss. This means that no trial has reported this metric, or all values reporte
d for this metric are NaN. To not ignore NaN values, you can set the filter_nan_and_inf arg to False.

this is my env :
win10
(hjpy11) F:\huangjia\chemprop-main>pip list
Package Version

aimsim_core 2.2.2
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.2
astartes 1.3.0
attrs 24.2.0
certifi 2024.12.14
charset-normalizer 3.4.0
chemprop 2.1.0
click 8.1.7
cloudpickle 3.1.0
colorama 0.4.6
ConfigArgParse 1.7
descriptastorus 2.8.0
dill 0.3.9
filelock 3.16.1
frozenlist 1.5.0
fsspec 2024.10.0
future 1.0.0
hyperopt 0.2.7
idna 3.10
Jinja2 3.1.4
joblib 1.4.2
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lightning 2.4.0
lightning-utilities 0.11.9
markdown-it-py 3.0.0
MarkupSafe 3.0.2
mdurl 0.1.2
mhfp 1.9.6
mordredcommunity 2.0.6
mpmath 1.3.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.17
networkx 3.4.2
numpy 1.26.4
packaging 24.2
padelpy 0.1.16
pandas 2.2.3
pandas-flavor 0.6.0
pillow 11.0.0
pip 24.2
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
py4j 0.10.9.7
pyarrow 18.1.0
Pygments 2.18.0
python-dateutil 2.9.0.post0
pytorch-lightning 2.4.0
pytz 2024.2
PyYAML 6.0.2
ray 2.40.0
rdkit 2024.3.6
referencing 0.35.1
requests 2.32.3
rich 13.9.4
rpds-py 0.22.3
scikit-learn 1.6.0
scipy 1.14.1
setuptools 75.1.0
six 1.17.0
sympy 1.13.1
tabulate 0.9.0
tensorboardX 2.6.2.2
threadpoolctl 3.5.0
torch 2.5.1
torchmetrics 1.6.0
tqdm 4.67.1
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
wheel 0.44.0
xarray 2024.11.0
yarl 1.18.3

The text was updated successfully, but these errors were encountered:

JacksonBurns · 2024-12-16T14:46:13Z

This is the actual error:

RuntimeError: use_libuv was requested but PyTorch was build without libuv support

And it's because windows pytorch binaries try to use the UV backend, but aren't built with support for it: pytorch/pytorch#139990

To fix this, either:

run set USE_LIBUV=0 (see the pytorch libuv page) before calling chemprop ... in each shell you use
install torch<=2.3.0 which predates this upstream change

jia-huang added the bug Something isn't working label Dec 16, 2024

JacksonBurns closed this as completed Dec 16, 2024

shivashankarv mentioned this issue Dec 18, 2024

[v2 BUG]: Chemprop hopt no best trial found error #1129

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2 BUG]: #1123

[v2 BUG]: #1123

jia-huang commented Dec 16, 2024

JacksonBurns commented Dec 16, 2024

[v2 BUG]: #1123

[v2 BUG]: #1123

Comments

jia-huang commented Dec 16, 2024

JacksonBurns commented Dec 16, 2024