Add CLI pages for hpopt and fingerprint (#914)

Co-authored-by: Hao-Wei Pang <45482070+hwpang@users.noreply.github.com> Co-authored-by: Joel Manu <joelmanu@mit.edu> Co-authored-by: Shih-Cheng Li <scli@mit.edu>
chemprop · Jun 21, 2024 · 6982ed3 · 6982ed3
1 parent a40cc0a
commit 6982ed3
Show file tree

Hide file tree

Showing 7 changed files with 138 additions and 51 deletions.
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -67,10 +67,9 @@ In the rest of this documentation, we'll go into more detail about how to:
 * :ref:`Customize model architecture and task type<train>`
 * :ref:`Specify training parameters: split type, learning rate, batch size, loss function, etc. <train>`
 * :ref:`Use Chemprop as a Python package <notebooks>`
-
-..
-     Optimize hyperparameters
-    * :ref:`Quantify prediction uncertainty<predict>`
+* :ref:`Perform a hyperparameter optimization <hpopt>`
+* :ref:`Generate a molecular fingerprint <fingerprint>`
+.. * :ref:`Quantify prediction uncertainty<predict>`
 
 Summary
 -------

diff --git a/docs/source/tutorial/cli/fingerprint.rst b/docs/source/tutorial/cli/fingerprint.rst
@@ -0,0 +1,34 @@
+.. _fingerprint:
+
+Fingerprint
+============================
+
+To calculate the learned representations (encodings) of model inputs from a pretrained model, run
+
+.. code-block::
+   
+    chemprop fingerprint --test-path <test_path> --model-path <model_path> 
+
+where :code:`<test_path>` is the path to the CSV file containing SMILES strings, and :code:`<model_path>` is the location of checkpoint(s) or model file(s) to use for prediction. It can be a path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt), a directory that contains these files, or a list of path(s) and directory(s). If a directory, will recursively search and predict on all found (.pt) models. By default, predictions will be saved to the same directory as the test path. If desired, a different directory can be specified by using :code:`--output <path>`. The output <path> can end with either .csv or .npz, and the output will be saved to the corresponding file type.
+
+For example:
+
+.. code-block::
+  
+    chemprop fingerprint --test-path tests/data/smis.csv \
+        --model-path tests/data/example_model_v2_regression_mol.ckpt \
+        --output fps.csv
+
+
+Specifying FFN encoding layer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, the encodings are returned from the penultimate linear layer of the model's FFN. However, the exact layer to draw encodings from can be specified using :code:`--ffn-block-index <index>`.
+
+An index of 0 will simply return the post-aggregation representation without passing through the FFN. Here, an index of 1 will return the output of the first linear layer of the FFN, an index of 2 the second layer, and so on.
+
+
+Specifying Data to Parse
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+:code:`fingerprint` shares the same arguments for specifying SMILES columns and reaction types as :code:`predict`. For more detail, see :ref:`predict`.
diff --git a/docs/source/tutorial/cli/hpopt.rst b/docs/source/tutorial/cli/hpopt.rst
@@ -0,0 +1,79 @@
+.. _hpopt:
+
+Hyperparameter Optimization
+============================
+
+.. note::
+    Chemprop relies on `Ray Tune <https://docs.ray.io/en/latest/tune/index.html>`_ for hyperparameter optimization, which is not yet compatible with python=3.12 and is an optional install. To install the required dependencies, make sure your Python version is 3.11 and run :code:`pip install -U ray[tune]` if installing with PyPI, or :code:`pip install -e .[hpopt]` if installing from source.
+
+Searching Hyperparameter Space
+--------------------------------
+
+We include an automated hyperparameter optimization procedure through the Ray Tune package. Hyperparameter optimization can be run as follows:
+
+.. code-block::
+
+    chemprop hpopt --data-path <data_path> --task-type <task> --search-parameter-keywords <keywords> --hpopt-save-dir <save_dir>
+
+For example:
+
+.. code-block::
+
+    chemprop hpopt --data-path tests/data/regression.csv \
+        --task-type regression \
+        --search-parameter-keywords depth ffn_num_layers message_hidden_dim \
+        --hpopt-save-dir results 
+
+The search parameters can be any combination of hyperparameters or a predefined set. Options include :code:`basic` (default), which consists of:
+
+ * :code:`depth` The number of message passing steps
+ * :code:`ffn_num_layers` The number of layers in the FFN model
+ * :code:`dropout` The probability (from 0.0 to 1.0) of dropout in the MPNN & FNN layers
+ * :code:`message_hidden_dim` The hidden dimension in the message passing step 
+ * :code:`ffn_hidden_dim` The hidden dimension in the FFN model
+
+Another option is :code:`learning_rate` which includes:
+
+ * :code:`max_lr` The maximum learning rate
+ * :code:`init_lr` The initial learning rate. It is searched as a ratio relative to the max learning rate
+ * :code:`final_lr` The initial learning rate. It is searched as a ratio relative to the max learning rate 
+ * :code:`warmup_epochs` Number of warmup epochs, during which the learning rate linearly increases from the initial to the maximum learning rate
+
+Other individual search parameters include:
+
+ * :code:`activation` The activation function used in the MPNN & FFN layers. Choices include ``relu``, ``leakyrelu``, ``prelu``, ``tanh``, ``selu``, and ``elu``
+ * :code:`aggregation` Aggregation mode used during molecule-level predictor. Choices include ``mean``, ``sum``, ``norm``
+ * :code:`aggregation_norm` For ``norm`` aggregation, the normalization factor by which atomic features are divided
+ * :code:`batch_size` Batch size for dataloader
+
+Specifying :code:`--search-parameter-keywords all` will search over all 13 of the above parameters.
+
+The following other common keywords may be used:
+
+ * :code:`--raytune-num-samples <num_samples>` The number of trials to perform
+ * :code:`--raytune-num-cpus <num_cpus>` The number of CPUs to use  
+ * :code:`--raytune-num-gpus <num_gpus>` The number of GPUs to use  
+ * :code:`--raytune-max-concurrent-trials <num_trials>` The maximum number of concurrent trials
+ * :code:`--raytune-search-algorithm <algorithm>` The choice of control search algorithm (either ``random``, ``hyperopt``, or ``optuna``). If ``hyperopt`` is specified, then the arguments ``--hyperopt-n-initial-points <num_points>`` and ``--hyperopt-random-state-seed <seed>`` can be specified.
+
+Other keywords related to hyperparameter optimization are also available (see :ref:`cmd` for a full list).
+
+Splitting
+----------
+By default, Chemprop will split the data into train / validation / test data splits. The splitting behavior can be modified using the same splitting arguments used in training, i.e., section :ref:`train_validation_test_splits`.
+
+.. note::
+    This default splitting behavior is different from Chemprop v1, wherein the hyperparameter optimization was performed on the entirety of the data provided to it.
+
+If ``--num-folds`` is greater than one, Chemprop will only use the first split to perform hyperparameter optimization. If you need to optimize hyperparameters separately for several different cross validation splits, you should e.g. set up a bash script to run :code:`chemprop hpopt` separately on each split.
+
+
+Applying Optimal Hyperparameters
+---------------------------------
+
+Once hyperparameter optimization is complete, the optimal hyperparameters can be applied during training by specifying the config path. If an argument is both provided via the command line and the config file, the command line takes precedence. For example:
+
+.. code-block::
+
+    chemprop train --data-path tests/data/regression.csv \
+        --config-path results/best_config.toml
diff --git a/docs/source/tutorial/cli/hyperopt.rst b/docs/source/tutorial/cli/hyperopt.rst
diff --git a/docs/source/tutorial/cli/index.rst b/docs/source/tutorial/cli/index.rst
@@ -17,6 +17,8 @@ where ``COMMAND`` is one of the following:
 * ``train``: Train a model.
 * ``predict``: Make predictions with a trained model.
 * ``convert``: Convert a trained Chemprop model from v1 to v2.
+* ``hpopt``: Perform hyperparameter optimization.
+* ``fingerprint``: Use a trained model to compute a learned representation.
 
 and ``ARGS`` are command-specific arguments. To see the arguments for a specific command, run:
 
@@ -42,18 +44,19 @@ For more details on each command, see the corresponding section below:
 * :ref:`train`
 * :ref:`predict`
 * :ref:`convert`
+* :ref:`hpopt`
+* :ref:`fingerprint`
 
-The following features are not yet implemented, but will soon be included in a future release:
+The following features are not yet implemented, but will be included in a future release:
 
-* ``hyperopt``: Perform hyperparameter optimization.
 * ``interpret``: Interpret model predictions.
-* ``fingerprint``: Use a trained model to compute a learned representation.
-
 
 .. toctree::
     :maxdepth: 1
     :hidden:
 
     train
     predict
-    convert
+    convert
+    hpopt
+    fingerprint
diff --git a/docs/source/tutorial/cli/predict.rst b/docs/source/tutorial/cli/predict.rst
@@ -7,9 +7,9 @@ To load a trained model and make predictions, run:
 
 .. code-block::
    
-   chemprop predict --test-path <test_path> --model-path <model_path>
+    chemprop predict --test-path <test_path> --model-path <model_path>
 
-where :code:`<test_path>` is the path to the data to test on, and :code:`<model_path>` is the path to the trained model. By default, predictions will be saved to the same directory as the test path. If desired, a different directory can be specified by using :code:`--preds-path <path>`
+where :code:`<test_path>` is the path to the data to test on, and :code:`<model_path>` is the location of checkpoint(s) or model file(s) to use for prediction. It can be a path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt), a directory that contains these files, or a list of path(s) and directory(s). If a directory, will recursively search and predict on all found (.pt) models. By default, predictions will be saved to the same directory as the test path. If desired, a different directory can be specified by using :code:`--preds-path <path>`. The predictions <path> can end with either .csv or .pkl, and the output will be saved to the corresponding file type.
 
 For example:
 

diff --git a/docs/source/tutorial/cli/train.rst b/docs/source/tutorial/cli/train.rst
@@ -7,17 +7,17 @@ To train a model, run:
 
 .. code-block::
    
-   chemprop train --data-path <input_path> --task-type <task> --output-dir <dir>
+    chemprop train --data-path <input_path> --task-type <task> --output-dir <dir>
 
 where ``<input_path>`` is the path to a CSV file containing a dataset, ``<task>`` is the type of modeling task, and ``<dir>`` is the directory where model checkpoints will be saved.
 
 For example:
 
 .. code-block::
 
-   chemprop train --data-path tests/data/regression.csv \
-   --task-type regression \
-   --output-dir solubility_checkpoints
+    chemprop train --data-path tests/data/regression.csv \
+        --task-type regression \
+        --output-dir solubility_checkpoints
 
 The following modeling tasks are supported:
 
@@ -42,13 +42,14 @@ The data file must be be a **CSV file with a header row**. For example:
 
 .. code-block::
 
-   smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
-   CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
-   CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
-   ...
+    smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
+    CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
+    CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
+    ...
 
 By default, it is assumed that the SMILES are in the first column and the targets are in the remaining columns. However, the specific columns containing the SMILES and targets can be specified using the :code:`--smiles-columns <column>` and :code:`--target-columns <column_1> <column_2> ...` flags, respectively. To simultaneously train multiple molecules (such as a solute and a solvent), supply two column headers in :code:`--smiles-columns <columns>`.
 
+.. _train_validation_test_splits:
 
 Train/Validation/Test Splits
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -84,7 +85,7 @@ Model performance is often highly dependent on the hyperparameters used. Below i
  * :code:`--message-hidden-dim <n>` Hidden dimension of the messages in the MPNN (default 300)
  * :code:`--depth <n>` Number of message-passing steps (default 3)
  * :code:`--dropout <n>` Dropout probability in the MPNN & FFN layers (default 0)
- * :code:`--activation <activation_type>` The activation function used in the MPNN and FNN layers. Options include :code:`relu`, :code:`leakyrelu`, :code:`prelu`, :code:`tanh`, :code:`selu`, and :code:`elu`. (default :code:`relu``)
+ * :code:`--activation <activation_type>` The activation function used in the MPNN and FNN layers. Options include :code:`relu`, :code:`leakyrelu`, :code:`prelu`, :code:`tanh`, :code:`selu`, and :code:`elu`. (default :code:`relu`)
  * :code:`--epochs <n>` How many epochs to train over (default 50)
  * :code:`--warmup-epochs <n>`: The number of epochs during which the learning rate is linearly incremented from :code:`init_lr` to :code:`max_lr` (default 2)
  * :code:`--init_lr <n>` Initial learning rate (default 0.0001)
@@ -173,8 +174,6 @@ Pretraining
 It is possible to freeze the weights of a loaded model during training, such as for transfer learning applications. To do so, specify :code:`--model-frzn <path>` where :code:`<path>` refers to a model's checkpoint file that will be used to overwrite and freeze the model weights. The following flags may be used:
 
  * :code:`--frzn-ffn-layers <n>` Overwrites weights for the first n layers of the FFN from the checkpoint (default 0)  
-..  * :code:`--freeze-first-only` Determines whether to use the loaded checkpoint for just the first encoder. Only relevant if the number of molecules is greater than one, i.e. two SMILES columns are provided for training (default :code:`false`)
-
 
 .. _train-on-reactions:
 
@@ -228,7 +227,7 @@ While the model works very well on its own, especially after hyperparameter opti
 
 
 Atom-Level Features/Descriptors
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 You can provide additional atom features via :code:`--atom-features-path /path/to/atom/features.npz` as a numpy :code:`.npz` file. This command concatenates the features to each atomic feature vector before the D-MPNN, so that they are used during message-passing. This file can be saved using :code:`np.savez("atom_features.npz", *V_fs)`, where :code:`V_fs` is a list containing the atom features :code:`V_f` for each molecule, where :code:`V_f` is a 2D array with a shape of number of atoms by number of atom features in the exact same order as the SMILES strings in your data file.
 
@@ -254,7 +253,7 @@ The bond-level features are scaled by default. This can be disabled with the opt
 Extra Descriptors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Additional descriptors can be concatenated to the learned representaiton after aggregation. These could be molecule features, for example. If you install from source, you can modify the code to load custom descriptors as follows:
+Additional descriptors can be concatenated to the learned representation after aggregation. These could be molecule features, for example. If you install from source, you can modify the code to load custom descriptors as follows:
 
 1. **Generate features:** If you want to generate molecule features in code, you can write a custom features generator function using the default featurizers in :code:`chemprop/featurizers/`. This also works for custom atom and bond features. 
 2. **Load features:** Additional descriptors can be provided using :code:`--descriptors-path /path/to/descriptors.npz` as a numpy :code:`.npz` file. This file can be saved using :code:`np.savez("/path/to/descriptors.npz", X_d)`, where :code:`X_d` is a 2D array with a shape of number of datapoints by number of additional descriptors. Note that the descriptors must be in the same order as the SMILES strings in your data file. The extra descriptors are scaled by default. This can be disabled with the option :code:`--no-descriptor-scaling`.