[v2 BUG]: Chemprop hopt no best trial found error #1129

shivashankarv · 2024-12-18T19:11:06Z

Describe the bug
While running chemprop hopt I get the same error as #1123.

Command: chemprop hpopt --logfile results/chemprop/mordred_1600_hpopt/20241218/hpopt.log -v --data-path data_noix.csv -o results/chemprop/mordred_1600_hpopt/20241218/ -t regression --search-parameter-keywords all --hpopt-save-dir results/chemprop/mordred_1600_hpopt/20241218/hpopt_save/ --raytune-use-gpu --raytune-num-cpus 12

Example(s)
Command: chemprop hpopt --logfile results/chemprop/mordred_1600_hpopt/20241218/hpopt.log -v --data-path data_noix.csv -o results/chemprop/mordred_1600_hpopt/20241218/ -t regression --search-parameter-keywords all --hpopt-save-dir results/chemprop/mordred_1600_hpopt/20241218/hpopt_save/ --raytune-use-gpu --raytune-num-cpus \nproc``

Expected behavior
For hopt to run to completion.

Error Stack Trace
Error traceback:

2024-12-18 17:39:05,484 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_b1d72306, TorchTrainer_d2bbe02f, TorchTrainer_b4f4fb34, TorchTrainer_17c3f9ad, TorchTrainer_2499b783, TorchTrainer_3fcba8ea, ToTorchTrainer_75727c0f, TorchTrainer_ef1bdae1, TorchTrainer_6f9eba6a, TorchTrainer_581da13f]
2024-12-18 17:39:05,491 WARNING experiment_analysis.py:558 -- Could not find best trial. Did you pass the correct `metric` parameter?
Traceback (most recent call last):
  File "/home/ec2-user/miniconda3/envs/chemprop/bin/chemprop", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/chemprop/cli/main.py", line 85, in main
    func(args)
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/chemprop/cli/hpopt.py", line 122, in func
    main(args)
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/chemprop/cli/hpopt.py", line 501, in main
    best_result = results.get_best_result()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/ray/tune/result_grid.py", line 161, in get_best_result
    raise RuntimeError(error_msg)
RuntimeError: No best trial found for the given metric: val_loss. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filtlter_nan_and_inf` arg to False.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
OS: Amazon Linux 2023

Python package versions:

pytorch-cuda=12.4
pytorch-lightning=2.4.0
chemprop==2.1.0

CUDA 12.3

Checklist

all dependencies are satisifed: conda list or pip list shows the packages listed in the pyproject.toml
the unit tests are working: pytest -v reports no errors

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

JacksonBurns · 2024-12-18T19:12:19Z

Can you post the full stacktrace? Something is going wrong inside the trials

KnathanM · 2024-12-18T22:58:06Z

@shivashankarv You can find the full stacktrace in the error file that is listed just above the stacktrace you gave. It will look something like:

Number of errored trials: 1
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                # failures   error file                                                                                                                                           │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_4e0ccaf7              1   /tmp/ray/session_2024-12-18_14-55-10_780475_10000/artifacts/2024-12-18_14-55-15/TorchTrainer_2024-12-18_14-55-15/driver_artifacts/4e0ccaf7/error.txt │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

shivashankarv · 2024-12-19T04:16:40Z

Hi,

I thought it would be useful to log the entire hpopt run. Attaching the full log here. Seems like there were several out of memory errors. Let me know if this suffices.

error.log

JacksonBurns · 2024-12-19T13:49:34Z

CPU is just running out of main memory - ray says python is taking 40+ gb at termination. hpopt is only running one concurrent trial, so I think the dataset is just too big. Try downsampling your data and running, or else confirm that the data actually isn't too big and then it might be something else.

shihchengli · 2024-12-19T16:00:31Z

If your dataset is too large, you can choose not to use caching by adding the --no-cache flag.

shivashankarv · 2024-12-19T16:29:22Z

Thank you for the quick reply and looking through the logs. The dataset is pretty big ~6.8g. I reduced the number of cpus and it seems to be running since last night! Thank you so much for your help and helping me parse the error.

shivashankarv added the bug Something isn't working label Dec 18, 2024

JacksonBurns closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2 BUG]: Chemprop hopt no best trial found error #1129

[v2 BUG]: Chemprop hopt no best trial found error #1129

shivashankarv commented Dec 18, 2024 •

edited

Loading

JacksonBurns commented Dec 18, 2024

KnathanM commented Dec 18, 2024

shivashankarv commented Dec 19, 2024

JacksonBurns commented Dec 19, 2024

shihchengli commented Dec 19, 2024

shivashankarv commented Dec 19, 2024

[v2 BUG]: Chemprop hopt no best trial found error #1129

[v2 BUG]: Chemprop hopt no best trial found error #1129

Comments

shivashankarv commented Dec 18, 2024 • edited Loading

JacksonBurns commented Dec 18, 2024

KnathanM commented Dec 18, 2024

shivashankarv commented Dec 19, 2024

JacksonBurns commented Dec 19, 2024

shihchengli commented Dec 19, 2024

shivashankarv commented Dec 19, 2024

shivashankarv commented Dec 18, 2024 •

edited

Loading