wip

ray-project · sven1977 · Jan 3, 2025 · Dec 27, 2024 · Dec 27, 2024 · Dec 27, 2024
commit 8557f5a542cf2035c0918460839f5b19e49a863f
diff --git a/doc/source/rllib/algorithm-config.rst b/doc/source/rllib/algorithm-config.rst
@@ -7,15 +7,15 @@
 AlgorithmConfig API
 ===================
 
-You can configure your RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`
-in a type-safe fashion by working with the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` API.
+RLlib's :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` API is
+the auto-validated and type-safe gateway into configuring and building an RLlib
+:py:class:`~ray.rllib.algorithms.algorithm.Algorithm`.
 
 In essence, you first create an instance of :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`
-and then call some of its methods to set various configuration options.
+and then call some of its methods to set various configuration options. RLlib uses the following, `black <https://github.com/psf/black>`__ compliant format
+in all parts of the code.
 
-RLlib uses the following, `black <https://github.com/psf/black>`__ compliant notation
-in all parts of the code. Note that you can chain together more than one method call, including
-the constructor:
+Note that you can chain together more than one method call, including the constructor:
 
 .. testcode::
 
@@ -30,7 +30,6 @@ the constructor:
         .learners(num_learners=2)
     )
 
-
 .. hint::
 
     For value checking and type-safety reasons, you should never set attributes in your
@@ -49,10 +48,15 @@ the constructor:
 Algorithm specific config classes
 ---------------------------------
 
+You don't use ``AlgorithmConfig`` directly in practice, but rather use its algorithm-specific
+subclasses such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPOConfig`. Each subclass comes
+with its own set of additional arguments to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training`
+method.
+
 Normally, you should pick the specific :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`
 subclass that matches the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`
 you would like to run your learning experiments with. For example, if you would like to
-use ``IMPALA`` as your algorithm, you should import its specific config class:
+use :ref:`IMPALA <impala>` as your algorithm, you should import its specific config class:
 
 .. testcode::
 
@@ -65,25 +69,34 @@ use ``IMPALA`` as your algorithm, you should import its specific config class:
         .training(lr=0.0004)
     )
 
-You can build the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` directly from the
+To change algorithm-specific settings, here for ``IMPALA``, also use the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training`
+method:
+
+.. testcode::
+
+    # Change an IMPALA-specific setting (the entropy coefficient).
+    config.training(entropy_coeff=0.01)
+
+
+You can build the :py:class:`~ray.rllib.algorithms.impala.IMPALA` instance directly from the
 config object through calling the
 :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.build_algo` method:
 
 .. testcode::
 
+    # Build the algorithm instance.
     impala = config.build_algo()
 
 
 The config object stored inside any built :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` instance
-is a copy of your original config. Hence, you can further alter your original config object and
-build another instance of the algo without affecting the previously built one:
-
+is a copy of your original config. This allows you to further alter your original config object and
+build another algorithm instance without affecting the previously built one:
 
 .. testcode::
 
-    # Further alter the config without affecting the previously built IMPALA ...
+    # Further alter the config without affecting the previously built IMPALA object ...
     config.env_runners(num_env_runners=4)
-    # ... and build another algo from it.
+    # ... and build a new IMPALA from it.
     another_impala = config.build_algo()
 
 
@@ -108,106 +121,123 @@ instance into the constructor of the :py:class:`~ray.tune.tuner.Tuner`:
 Generic config settings
 -----------------------
 
-Most config settings are generic and apply to all of RLlib's
-:py:class:`~ray.rllib.algorithms.algorithm.Algorithm` classes.
-
+Most config settings are generic and apply to all of RLlib's :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` classes.
+The following sections walk you through the most important config settings users should pay close attention to for before
+diving further into other config settings and before starting with hyperparameter fine tuning.
 
 RL Environment
 ~~~~~~~~~~~~~~
 
+To configure, which RL environment your algorithm trains against, use the ``env`` argument to the
+:py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.environment` method:
 
+.. testcode::
+
+    config.environment()
+
+See this :ref:`RL environment guide <rllib-environments-doc>` for more details.
 
 Learning rate `lr`
 ~~~~~~~~~~~~~~~~~~
 
-Set the learning rate for updating your models through the ``lr`` arg to the
+Set the learning rate for updating your models through the ``lr`` argument to the
 :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training` method:
 
 .. testcode::
 
     config.training(lr=0.0001)
 
-
 Train batch size
 ~~~~~~~~~~~~~~~~
 
 Set the train batch size, per Learner actor,
-through the ``train_batch_size_per_learner`` arg to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training`
+through the ``train_batch_size_per_learner`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training`
 method:
 
 .. testcode::
 
     config.training(train_batch_size_per_learner=256)
 
-
 Discount factor `gamma`
 ~~~~~~~~~~~~~~~~~~~~~~~
 
 Set the `RL discount factor <https://www.envisioning.io/vocab/discount-factor?utm_source=chatgpt.com>`__
-through the ``gamma`` arg to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training`
+through the ``gamma`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training`
 method:
 
 .. testcode::
 
     config.training(gamma=0.995)
 
+Scaling with `num_env_runners` and `num_learners`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+Set the number of :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors used to collect training samples
+through the ``num_env_runners`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners`
+method:
 
-num_learners
-num_env_runners
-num_envs_per_env_runner
-
-explore
-
-rollout_fragment_length
-
-
-env
-env_config
+.. testcode::
 
+    config.env_runners(num_env_runners=4)
 
-policies
-policy_mapping_fn
+    # Also use `num_envs_per_env_runner` to vectorize your environment on each EnvRunner actor.
+    # Note that this option is only available in single-agent setups.
+    #  The Ray Team is working on a solution for this restriction.
+    config.env_runners(num_envs_per_env_runner=10)
 
+Set the number of :py:class:`~ray.rllib.core.learner.learner.Learner` actors used to update your models
+through the ``num_learners`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners`
+method. This should correspond to the number of GPUs you have available for training.
 
+.. testcode::
 
-Algorithm specific settings
----------------------------
+    config.learners(num_learners=2)
 
-Each RLlib algorithm has its own config class that inherits from `AlgorithmConfig`.
-For instance, to create a `PPO` algorithm, you start with a `PPOConfig` object, to work
-with a `DQN` algorithm, you start with a `DQNConfig` object, etc.
+Disable `explore` behavior
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
+Switch off/on exploratory behavior
+through the ``explore`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners`
+method. To compute actions, the :py:class:`~ray.rllib.env.env_runner.EnvRunner` calls `forward_exploration()` on the RLModule when ``explore=True``
+and `forward_inference()` when ``explore=False``. The default value is ``explore=True``.
 
-    Each algorithm has its specific settings, but most configuration options are shared.
-    We discuss the common options below, and refer to
-    :ref:`the RLlib algorithms guide <rllib-algorithms-doc>` for algorithm-specific
-    properties.
-    Algorithms differ mostly in their `training` settings.
+.. testcode::
 
-Below you find the basic signature of the `AlgorithmConfig` class, as well as some
-advanced usage examples:
+    # Disable exploration behavior.
+    # When False, the EnvRunner calls `forward_inference()` on the RLModule to compute
+    # actions instead of `forward_exploration()`.
+    config.env_runners(explore=False)
 
-.. autoclass:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig
-    :noindex:
+Rollout length
+~~~~~~~~~~~~~~
 
-As RLlib algorithms are fairly complex, they come with many configuration options.
-To make things easier, the common properties of algorithms are naturally grouped into
-the following categories:
+Set the number of timesteps each :py:class:`~ray.rllib.env.env_runner.EnvRunner` steps through each of its env copies
+through the ``rollout_fragment_length`` argument to the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners`
+method:
 
-- :ref:`training options <rllib-config-train>`,
-- :ref:`environment options <rllib-config-env>`,
-- :ref:`deep learning framework options <rllib-config-framework>`,
-- :ref:`env runner options <rllib-config-env-runners>`,
-- :ref:`evaluation options <rllib-config-evaluation>`,
-- :ref:`options for training with offline data <rllib-config-offline_data>`,
-- :ref:`options for training multiple agents <rllib-config-multi_agent>`,
-- :ref:`reporting options <rllib-config-reporting>`,
-- :ref:`options for saving and restoring checkpoints <rllib-config-checkpointing>`,
-- :ref:`debugging options <rllib-config-debugging>`,
-- :ref:`options for adding callbacks to algorithms <rllib-config-callbacks>`,
-- :ref:`Resource options <rllib-config-resources>`
-- :ref:`and options for experimental features <rllib-config-experimental>`
+.. testcode::
 
-Let's discuss each category one by one, starting with training options.
+    config.env_runners(rollout_fragment_length=50)
+
+All available methods and their settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Besides the previously described most common settings, the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig`
+class and its algo-specific subclasses come with many more configuration options.
+
+To structure things more semantically, :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` groups
+its various config settings into the following categories, each represented by its own method:
+
+- :ref:`Config settings for the RL environment <rllib-config-env>`,
+- :ref:`Config settings for training behavior (including algo-specific settings) <rllib-config-training>`,
+- :ref:`Config settings for EnvRunners <rllib-config-env-runners>`,
+- :ref:`Config settings for Learners <rllib-config-learners>`,
+- :ref:`Config settings for adding callbacks <rllib-config-callbacks>`,
+- :ref:`Config settings for multi-agent setups <rllib-config-multi_agent>`,
+- :ref:`Config settings for offline RL <rllib-config-offline_data>`,
+- :ref:`Config settings for evaluating policies <rllib-config-evaluation>`,
+- :ref:`Config settings for the DL framework <rllib-config-framework>`,
+- :ref:`Config settings for reporting and logging behavior <rllib-config-reporting>`,
+- :ref:`Config settings for checkpointing <rllib-config-checkpointing>`,
+- :ref:`Config settings for debugging <rllib-config-debugging>`,
+- :ref:`Experimental config settings <rllib-config-experimental>`
diff --git a/doc/source/rllib/index.rst b/doc/source/rllib/index.rst
@@ -41,6 +41,7 @@ RLlib: Industry-Grade, Scalable Reinforcement Learning
     rllib-training
     key-concepts
     rllib-env
+    algorithm-config
     rllib-algorithms
     user-guides
     rllib-examples

diff --git a/doc/source/rllib/package_ref/algorithm-config.rst b/doc/source/rllib/package_ref/algorithm-config.rst
@@ -5,19 +5,12 @@
 .. _algorithm-config-reference-docs:
 
 Algorithm Configuration API
-----------------------------
-
-The :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` class represents
-the user's gateway into configuring and building an RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`.
-
-You don't use ``AlgorithmConfig`` directly in practice, but rather use its algorithm-specific
-implementations such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPOConfig`, which each come
-with their own set of arguments to their respective ``.training()`` method.
+===========================
 
 .. currentmodule:: ray.rllib.algorithms.algorithm_config
 
 Constructor
-~~~~~~~~~~~
+-----------
 
 .. autosummary::
     :nosignatures:
@@ -27,7 +20,7 @@ Constructor
 
 
 Builder methods
-~~~~~~~~~~~~~~~
+---------------
 .. autosummary::
     :nosignatures:
     :toctree: doc/
@@ -38,7 +31,7 @@ Builder methods
 
 
 Properties
-~~~~~~~~~~
+----------
 .. autosummary::
     :nosignatures:
     :toctree: doc/
@@ -51,7 +44,7 @@ Properties
     ~AlgorithmConfig.total_train_batch_size
 
 Getter methods
-~~~~~~~~~~~~~~
+--------------
 .. autosummary::
     :nosignatures:
     :toctree: doc/
@@ -65,7 +58,7 @@ Getter methods
 
 
 Public methods
-~~~~~~~~~~~~~~
+--------------
 .. autosummary::
     :nosignatures:
     :toctree: doc/
@@ -75,6 +68,11 @@ Public methods
     ~AlgorithmConfig.freeze
 
 
+.. _rllib-algorithm-config-methods:
+
+Configuration methods
+---------------------
+
 .. _rllib-config-env:
 
 Configuring the RL Environment
@@ -84,36 +82,27 @@ Configuring the RL Environment
     :noindex:
 
 
-.. _rllib-config-train:
+.. _rllib-config-training:
 
 Configuring training behavior
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    For instance, a `DQNConfig` takes a `double_q` `training` argument to specify whether
-    to use a double-Q DQN, whereas in a `PPOConfig` this does not make sense.
-
-For individual algorithms, this is probably the most relevant configuration group,
-as this is where all the algorithm-specific options go.
-But the base configuration for `training` of an `AlgorithmConfig` is actually quite small:
-
 .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training
     :noindex:
 
 
 .. _rllib-config-env-runners:
 
-Configuring EnvRunnerGroup and EnvRunner actors
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Configuring `EnvRunnerGroup` and `EnvRunner` actors
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.env_runners
     :noindex:
 
 .. _rllib-config-learners:
 
-Configuring LearnerGroup and Learner actors
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Configuring `LearnerGroup` and `Learner` actors
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. automethod:: ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners
     :noindex:

diff --git a/rllib/algorithms/marwil/marwil.py b/rllib/algorithms/marwil/marwil.py
@@ -2,7 +2,6 @@
 
 from ray.rllib.algorithms.algorithm import Algorithm
 from ray.rllib.algorithms.algorithm_config import AlgorithmConfig, NotProvided
-from ray.rllib.algorithms.marwil.marwil_catalog import MARWILCatalog
 from ray.rllib.connectors.learner import (
     AddObservationsFromEpisodesToBatch,
     AddOneTsToEpisodesAndTruncate,