Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2 BUG]: Chemprop hopt no best trial found error #1129

Closed
2 tasks
shivashankarv opened this issue Dec 18, 2024 · 6 comments
Closed
2 tasks

[v2 BUG]: Chemprop hopt no best trial found error #1129

shivashankarv opened this issue Dec 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@shivashankarv
Copy link

shivashankarv commented Dec 18, 2024

Describe the bug
While running chemprop hopt I get the same error as #1123.

Command: chemprop hpopt --logfile results/chemprop/mordred_1600_hpopt/20241218/hpopt.log -v --data-path data_noix.csv -o results/chemprop/mordred_1600_hpopt/20241218/ -t regression --search-parameter-keywords all --hpopt-save-dir results/chemprop/mordred_1600_hpopt/20241218/hpopt_save/ --raytune-use-gpu --raytune-num-cpus 12

Example(s)
Command: chemprop hpopt --logfile results/chemprop/mordred_1600_hpopt/20241218/hpopt.log -v --data-path data_noix.csv -o results/chemprop/mordred_1600_hpopt/20241218/ -t regression --search-parameter-keywords all --hpopt-save-dir results/chemprop/mordred_1600_hpopt/20241218/hpopt_save/ --raytune-use-gpu --raytune-num-cpus \nproc``

Expected behavior
For hopt to run to completion.

Error Stack Trace
Error traceback:

2024-12-18 17:39:05,484 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_b1d72306, TorchTrainer_d2bbe02f, TorchTrainer_b4f4fb34, TorchTrainer_17c3f9ad, TorchTrainer_2499b783, TorchTrainer_3fcba8ea, ToTorchTrainer_75727c0f, TorchTrainer_ef1bdae1, TorchTrainer_6f9eba6a, TorchTrainer_581da13f]
2024-12-18 17:39:05,491 WARNING experiment_analysis.py:558 -- Could not find best trial. Did you pass the correct `metric` parameter?
Traceback (most recent call last):
  File "/home/ec2-user/miniconda3/envs/chemprop/bin/chemprop", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/chemprop/cli/main.py", line 85, in main
    func(args)
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/chemprop/cli/hpopt.py", line 122, in func
    main(args)
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/chemprop/cli/hpopt.py", line 501, in main
    best_result = results.get_best_result()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/miniconda3/envs/chemprop/lib/python3.11/site-packages/ray/tune/result_grid.py", line 161, in get_best_result
    raise RuntimeError(error_msg)
RuntimeError: No best trial found for the given metric: val_loss. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filtlter_nan_and_inf` arg to False.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
OS: Amazon Linux 2023

Python package versions:

  • pytorch-cuda=12.4
  • pytorch-lightning=2.4.0
  • chemprop==2.1.0

CUDA 12.3

Checklist

  • all dependencies are satisifed: conda list or pip list shows the packages listed in the pyproject.toml
  • the unit tests are working: pytest -v reports no errors

Additional context
Add any other context about the problem here.

@shivashankarv shivashankarv added the bug Something isn't working label Dec 18, 2024
@JacksonBurns
Copy link
Member

Can you post the full stacktrace? Something is going wrong inside the trials

@KnathanM
Copy link
Member

@shivashankarv You can find the full stacktrace in the error file that is listed just above the stacktrace you gave. It will look something like:

Number of errored trials: 1
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                # failures   error file                                                                                                                                           │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_4e0ccaf7              1   /tmp/ray/session_2024-12-18_14-55-10_780475_10000/artifacts/2024-12-18_14-55-15/TorchTrainer_2024-12-18_14-55-15/driver_artifacts/4e0ccaf7/error.txt │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@shivashankarv
Copy link
Author

Hi,

I thought it would be useful to log the entire hpopt run. Attaching the full log here. Seems like there were several out of memory errors. Let me know if this suffices.

error.log

@JacksonBurns
Copy link
Member

CPU is just running out of main memory - ray says python is taking 40+ gb at termination. hpopt is only running one concurrent trial, so I think the dataset is just too big. Try downsampling your data and running, or else confirm that the data actually isn't too big and then it might be something else.

@shihchengli
Copy link
Contributor

If your dataset is too large, you can choose not to use caching by adding the --no-cache flag.

@shivashankarv
Copy link
Author

Thank you for the quick reply and looking through the logs. The dataset is pretty big ~6.8g. I reduced the number of cpus and it seems to be running since last night! Thank you so much for your help and helping me parse the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants