[Doc] Fix broken links and formatting issues in doc

ghstack-source-id: 4e3f84fe436de6a6e9696894cd06318a98e4a23b Pull Request resolved: #2574
pytorch · Nov 18, 2024 · 5a2d9e2 · 5a2d9e2 · github-actions · Nov 18, 2024
1 parent 83a7a57
commit 5a2d9e2
Show file tree

Hide file tree

Showing 49 changed files with 402 additions and 397 deletions.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -22,15 +22,15 @@ TorchRL provides pytorch and python-first, low and high level abstractions for R
 The code is aimed at supporting research in RL. Most of it is written in python in a highly modular way, such that researchers can easily swap components, transform them or write new ones with little effort.
 
 This repo attempts to align with the existing pytorch ecosystem libraries in that it has a "dataset pillar"
-:doc:`(environments) <reference/envs>`,
-:ref:`transforms <reference/envs:Transforms>`,
-:doc:`models <reference/modules>`,
+:ref:`(environments) <Environment-API>`,
+:ref:`transforms <transforms>`,
+:ref:`models <ref_modules>`,
 data utilities (e.g. collectors and containers), etc.
 TorchRL aims at having as few dependencies as possible (python standard library, numpy and pytorch).
 Common environment libraries (e.g. OpenAI gym) are only optional.
 
 On the low-level end, torchrl comes with a set of highly re-usable functionals
-for :doc:`cost functions <reference/objectives>`, :ref:`returns <reference/objectives:Returns>` and data processing.
+for :ref:`cost functions <ref_objectives>`, :ref:`returns <ref_returns>` and data processing.
 
 TorchRL aims at a high modularity and good runtime performance.
 

diff --git a/docs/source/reference/data.rst b/docs/source/reference/data.rst
@@ -944,7 +944,7 @@ not predictable.
     MultiCategorical
     MultiOneHot
     NonTensor
-    OneHotDiscrete
+    OneHot
     Stacked
     StackedComposite
     Unbounded
@@ -1050,7 +1050,7 @@ and the tree can be expanded for each of these. The following figure shows how t
 
     BinaryToDecimal
     HashToInt
-    MCTSForeset
+    MCTSForest
     QueryModule
     RandomProjectionHash
     SipHash

diff --git a/docs/source/reference/objectives.rst b/docs/source/reference/objectives.rst
@@ -274,6 +274,9 @@ QMixer
 
 Returns
 -------
+
+.. _ref_returns:
+
 .. currentmodule:: torchrl.objectives.value
 
 .. autosummary::

diff --git a/sota-implementations/decision_transformer/lamb.py b/sota-implementations/decision_transformer/lamb.py
@@ -15,15 +15,15 @@ class Lamb(Optimizer):
     LAMB was proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.
     Arguments:
         params (iterable): iterable of parameters to optimize or dicts defining parameter groups.
-        lr (float, optional): learning rate. (default: 1e-3)
+        lr (:obj:`float`, optional): learning rate. (default: 1e-3)
         betas (Tuple[float, float], optional): coefficients used for computing
             running averages of gradient and its norm. (default: (0.9, 0.999))
-        eps (float, optional): term added to the denominator to improve
+        eps (:obj:`float`, optional): term added to the denominator to improve
             numerical stability. (default: 1e-8)
-        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        weight_decay (:obj:`float`, optional): weight decay (L2 penalty) (default: 0)
         grad_averaging (bool, optional): whether apply (1-beta2) to grad when
             calculating running averages of gradient. (default: True)
-        max_grad_norm (float, optional): value used to clip global grad norm (default: 1.0)
+        max_grad_norm (:obj:`float`, optional): value used to clip global grad norm (default: 1.0)
         trust_clip (bool): enable LAMBC trust ratio clipping (default: False)
         always_adapt (boolean, optional): Apply adaptive learning rate to 0.0
             weight decay parameter (default: False)

diff --git a/torchrl/collectors/collectors.py b/torchrl/collectors/collectors.py
@@ -1384,11 +1384,13 @@ class _MultiDataCollector(DataCollectorBase):
             instances) it will be wrapped in a `nn.Module` first.
             Then, the collector will try to assess if these
             modules require wrapping in a :class:`~tensordict.nn.TensorDictModule` or not.
+
             - If the policy forward signature matches any of ``forward(self, tensordict)``,
               ``forward(self, td)`` or ``forward(self, <anything>: TensorDictBase)`` (or
               any typing with a single argument typed as a subclass of ``TensorDictBase``)
               then the policy won't be wrapped in a :class:`~tensordict.nn.TensorDictModule`.
-            - In all other cases an attempt to wrap it will be undergone as such: ``TensorDictModule(policy, in_keys=env_obs_key, out_keys=env.action_keys)``.
+            - In all other cases an attempt to wrap it will be undergone as such:
+              ``TensorDictModule(policy, in_keys=env_obs_key, out_keys=env.action_keys)``.
 
     Keyword Args:
         frames_per_batch (int): A keyword-only argument representing the
@@ -1476,7 +1478,7 @@ class _MultiDataCollector(DataCollectorBase):
         update_at_each_batch (boolm optional): if ``True``, :meth:`~.update_policy_weight_()`
             will be called before (sync) or after (async) each data collection.
             Defaults to ``False``.
-        preemptive_threshold (float, optional): a value between 0.0 and 1.0 that specifies the ratio of workers
+        preemptive_threshold (:obj:`float`, optional): a value between 0.0 and 1.0 that specifies the ratio of workers
             that will be allowed to finished collecting their rollout before the rest are forced to end early.
         num_threads (int, optional): number of threads for this process.
             Defaults to the number of workers.
@@ -2093,11 +2095,13 @@ class MultiSyncDataCollector(_MultiDataCollector):
     trajectory and the start of the next collection.
     This class can be safely used with online RL sota-implementations.
 
-    .. note:: Python requires multiprocessed code to be instantiated within a main guard:
+    .. note::
+        Python requires multiprocessed code to be instantiated within a main guard:
 
             >>> from torchrl.collectors import MultiSyncDataCollector
             >>> if __name__ == "__main__":
             ...     # Create your collector here
+            ...     collector = MultiSyncDataCollector(...)
 
         See https://docs.python.org/3/library/multiprocessing.html for more info.
 
@@ -2125,8 +2129,8 @@ class MultiSyncDataCollector(_MultiDataCollector):
         ...         if i == 2:
         ...             print(data)
         ...             break
-        ... collector.shutdown()
-        ... del collector
+        >>> collector>shutdown()
+        >>> del collector
         TensorDict(
             fields={
                 action: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.float32, is_shared=False),
@@ -2753,7 +2757,7 @@ class aSyncDataCollector(MultiaSyncDataCollector):
         update_at_each_batch (boolm optional): if ``True``, :meth:`~.update_policy_weight_()`
             will be called before (sync) or after (async) each data collection.
             Defaults to ``False``.
-        preemptive_threshold (float, optional): a value between 0.0 and 1.0 that specifies the ratio of workers
+        preemptive_threshold (:obj:`float`, optional): a value between 0.0 and 1.0 that specifies the ratio of workers
             that will be allowed to finished collecting their rollout before the rest are forced to end early.
         num_threads (int, optional): number of threads for this process.
             Defaults to the number of workers.

diff --git a/torchrl/data/map/tree.py b/torchrl/data/map/tree.py
@@ -48,7 +48,7 @@ class Tree(TensorClass["nocast"]):
             If there are multiple actions taken at this node, subtrees are stored in the corresponding
             entry. Rollouts can be reconstructed using the :meth:`~.rollout_from_path` method.
         node (TensorDict): Data defining this node (e.g., observations) before the next branching.
-            Entries usually matches the ``in_keys`` in ``MCTSForeset.node_map``.
+            Entries usually matches the ``in_keys`` in ``MCTSForest.node_map``.
         subtree (Tree): A stack of subtrees produced when actions are taken.
         num_children (int): The number of child nodes (read-only).
         is_terminal (bool): whether the tree has children nodes (read-only).

diff --git a/torchrl/data/postprocs/postprocs.py b/torchrl/data/postprocs/postprocs.py
@@ -90,7 +90,7 @@ class MultiStep(nn.Module):
     It is an identity transform whenever :attr:`n_steps` is 0.
 
     Args:
-        gamma (float): Discount factor for return computation
+        gamma (:obj:`float`): Discount factor for return computation
         n_steps (integer): maximum look-ahead steps.
 
     .. note:: This class is meant to be used within a ``DataCollector``.

diff --git a/torchrl/data/replay_buffers/replay_buffers.py b/torchrl/data/replay_buffers/replay_buffers.py
@@ -897,16 +897,14 @@ class PrioritizedReplayBuffer(ReplayBuffer):
 
     All arguments are keyword-only arguments.
 
-    Presented in
-        "Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2015.
-        Prioritized experience replay."
-        (https://arxiv.org/abs/1511.05952)
+    Presented in "Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2015.
+    Prioritized experience replay." (https://arxiv.org/abs/1511.05952)
 
     Args:
-        alpha (float): exponent α determines how much prioritization is used,
+        alpha (:obj:`float`): exponent α determines how much prioritization is used,
             with α = 0 corresponding to the uniform case.
-        beta (float): importance sampling negative exponent.
-        eps (float): delta added to the priorities to ensure that the buffer
+        beta (:obj:`float`): importance sampling negative exponent.
+        eps (:obj:`float`): delta added to the priorities to ensure that the buffer
             does not contain null priorities.
         storage (Storage, optional): the storage to be used. If none is provided
             a default :class:`~torchrl.data.replay_buffers.ListStorage` with
@@ -1366,10 +1364,10 @@ class TensorDictPrioritizedReplayBuffer(TensorDictReplayBuffer):
     tensordict to be passed to it with its new priority value.
 
     Keyword Args:
-        alpha (float): exponent α determines how much prioritization is used,
+        alpha (:obj:`float`): exponent α determines how much prioritization is used,
             with α = 0 corresponding to the uniform case.
-        beta (float): importance sampling negative exponent.
-        eps (float): delta added to the priorities to ensure that the buffer
+        beta (:obj:`float`): importance sampling negative exponent.
+        eps (:obj:`float`): delta added to the priorities to ensure that the buffer
             does not contain null priorities.
         storage (Storage, optional): the storage to be used. If none is provided
             a default :class:`~torchrl.data.replay_buffers.ListStorage` with

diff --git a/torchrl/data/replay_buffers/samplers.py b/torchrl/data/replay_buffers/samplers.py
@@ -298,10 +298,10 @@ class PrioritizedSampler(Sampler):
 
     Args:
         max_capacity (int): maximum capacity of the buffer.
-        alpha (float): exponent α determines how much prioritization is used,
+        alpha (:obj:`float`): exponent α determines how much prioritization is used,
             with α = 0 corresponding to the uniform case.
-        beta (float): importance sampling negative exponent.
-        eps (float, optional): delta added to the priorities to ensure that the buffer
+        beta (:obj:`float`): importance sampling negative exponent.
+        eps (:obj:`float`, optional): delta added to the priorities to ensure that the buffer
             does not contain null priorities. Defaults to 1e-8.
         reduction (str, optional): the reduction method for multidimensional
             tensordicts (ie stored trajectory). Can be one of "max", "min",
@@ -1652,10 +1652,10 @@ class PrioritizedSliceSampler(SliceSampler, PrioritizedSampler):
         :meth:`~.update_priority`.
 
     Args:
-        alpha (float): exponent α determines how much prioritization is used,
+        alpha (:obj:`float`): exponent α determines how much prioritization is used,
             with α = 0 corresponding to the uniform case.
-        beta (float): importance sampling negative exponent.
-        eps (float, optional): delta added to the priorities to ensure that the buffer
+        beta (:obj:`float`): importance sampling negative exponent.
+        eps (:obj:`float`, optional): delta added to the priorities to ensure that the buffer
             does not contain null priorities. Defaults to 1e-8.
         reduction (str, optional): the reduction method for multidimensional
             tensordicts (i.e., stored trajectory). Can be one of "max", "min",

diff --git a/torchrl/data/rlhf/utils.py b/torchrl/data/rlhf/utils.py
@@ -41,7 +41,7 @@ class ConstantKLController(KLControllerBase):
     with.
 
     Keyword Arguments:
-        kl_coef (float): The coefficient to multiply KL with when calculating the
+        kl_coef (:obj:`float`): The coefficient to multiply KL with when calculating the
             reward.
         model (nn.Module, optional): wrapped model that needs to be controlled.
             Must have an attribute ``"kl_coef"``. If provided, the ``"kl_coef"`` will
@@ -73,8 +73,8 @@ class AdaptiveKLController(KLControllerBase):
     """Adaptive KL Controller as described in Ziegler et al. "Fine-Tuning Language Models from Human Preferences".
 
     Keyword Arguments:
-        init_kl_coef (float): The starting value of the coefficient.
-        target (float): The target KL value. When the observed KL is smaller, the
+        init_kl_coef (:obj:`float`): The starting value of the coefficient.
+        target (:obj:`float`): The target KL value. When the observed KL is smaller, the
             coefficient is decreased, thereby relaxing the KL penalty in the training
             objective and allowing the model to stray further from the reference model.
             When the observed KL is greater than the target, the KL coefficient is
@@ -146,10 +146,10 @@ class RolloutFromModel:
         reward_model: (nn.Module, tensordict.nn.TensorDictModule): a model which, given
             ``input_ids`` and ``attention_mask``, calculates rewards for each token and
             end_scores (the reward for the final token in each sequence).
-        kl_coef: (float, optional): initial kl coefficient.
+        kl_coef: (:obj:`float`, optional): initial kl coefficient.
         max_new_tokens (int, optional): the maximum length of the sequence.
             Defaults to 50.
-        score_clip (float, optional): Scores from the reward model are clipped to the
+        score_clip (:obj:`float`, optional): Scores from the reward model are clipped to the
             range ``(-score_clip, score_clip)``. Defaults to 10.
         kl_scheduler (KLControllerBase, optional): the KL coefficient scheduler.
         num_steps (int, optional): number of steps between two optimization.

diff --git a/torchrl/envs/common.py b/torchrl/envs/common.py
@@ -840,15 +840,15 @@ def full_action_spec(self) -> Composite:
             ...     break
             >>> env = BraxEnv(envname)
             >>> env.full_action_spec
-        Composite(
-            action: BoundedContinuous(
-                shape=torch.Size([8]),
-                space=ContinuousBox(
-                    low=Tensor(shape=torch.Size([8]), device=cpu, dtype=torch.float32, contiguous=True),
-                    high=Tensor(shape=torch.Size([8]), device=cpu, dtype=torch.float32, contiguous=True)),
-                device=cpu,
-                dtype=torch.float32,
-                domain=continuous), device=cpu, shape=torch.Size([]))
+            Composite(
+                action: BoundedContinuous(
+                    shape=torch.Size([8]),
+                    space=ContinuousBox(
+                        low=Tensor(shape=torch.Size([8]), device=cpu, dtype=torch.float32, contiguous=True),
+                        high=Tensor(shape=torch.Size([8]), device=cpu, dtype=torch.float32, contiguous=True)),
+                    device=cpu,
+                    dtype=torch.float32,
+                    domain=continuous), device=cpu, shape=torch.Size([]))
 
         """
         full_action_spec = self.input_spec.get("full_action_spec", None)
@@ -1791,7 +1791,7 @@ def register_gym(
                 (results are tensors).
                 This arg can be passed during a call to :func:`~gym.make` (see
                 example below).
-            reward_threshold (float, optional): [Gym kwarg] The reward threshold
+            reward_threshold (:obj:`float`, optional): [Gym kwarg] The reward threshold
                 considered to have learnt an environment.
             nondeterministic (bool, optional): [Gym kwarg If the environment is nondeterministic
                 (even with knowledge of the initial seed and all actions). Defaults to

diff --git a/torchrl/envs/custom/tictactoeenv.py b/torchrl/envs/custom/tictactoeenv.py
@@ -34,6 +34,7 @@ class TicTacToeEnv(EnvBase):
     output entry).
 
     Specs:
+        >>> print(env.specs)
         Composite(
             output_spec: Composite(
                 full_observation_spec: Composite(

diff --git a/torchrl/envs/transforms/rb_transforms.py b/torchrl/envs/transforms/rb_transforms.py
@@ -36,7 +36,7 @@ class MultiStepTransform(Transform):
         n_steps (int): Number of steps in multi-step. The number of steps can be
             dynamically changed by changing the ``n_steps`` attribute of this
             transform.
-        gamma (float): Discount factor.
+        gamma (:obj:`float`): Discount factor.
 
     Keyword Args:
         reward_keys (list of NestedKey, optional): the reward keys in the input tensordict.

diff --git a/torchrl/envs/transforms/rlhf.py b/torchrl/envs/transforms/rlhf.py
@@ -25,7 +25,7 @@ class KLRewardTransform(Transform):
             have the following features: it must have a set of input (``in_keys``)
             and output keys (``out_keys``). It must have a ``get_dist`` method
             that outputs the distribution of the action.
-        coef (float): the coefficient of the KL term. Defaults to ``1.0``.
+        coef (:obj:`float`): the coefficient of the KL term. Defaults to ``1.0``.
         in_keys (str or list of str/tuples of str): the input key where the
             reward should be fetched. Defaults to ``"reward"``.
         out_keys (str or list of str/tuples of str): the output key where the

diff --git a/torchrl/envs/transforms/transforms.py b/torchrl/envs/transforms/transforms.py
@@ -1534,7 +1534,7 @@ class TargetReturn(Transform):
     reward achieved at each step or remains constant.
 
     Args:
-        target_return (float): target return to be achieved by the agent.
+        target_return (:obj:`float`): target return to be achieved by the agent.
         mode (str): mode to be used to update the target return. Can be either "reduce" or "constant". Default: "reduce".
         in_keys (sequence of NestedKey, optional): keys pointing to the reward
             entries. Defaults to the reward keys of the parent env.
@@ -2552,7 +2552,7 @@ class ObservationNorm(ObservationTransform):
 
             as it is done for standardization. Default is `False`.
 
-        eps (float, optional): epsilon increment for the scale in the ``standard_normal`` case.
+        eps (:obj:`float`, optional): epsilon increment for the scale in the ``standard_normal`` case.
             Defaults to ``1e-6`` if not recoverable directly from the scale dtype.
 
     Examples:
@@ -2845,7 +2845,7 @@ class CatFrames(ObservationTransform):
             has to be written. Defaults to the value of `in_keys`.
         padding (str, optional): the padding method. One of ``"same"`` or ``"constant"``.
             Defaults to ``"same"``, ie. the first value is used for padding.
-        padding_value (float, optional): the value to use for padding if ``padding="constant"``.
+        padding_value (:obj:`float`, optional): the value to use for padding if ``padding="constant"``.
             Defaults to 0.
         as_inverse (bool, optional): if ``True``, the transform is applied as an inverse transform. Defaults to ``False``.
         reset_key (NestedKey, optional): the reset key to be used as partial
@@ -6194,6 +6194,7 @@ class SelectTransform(Transform):
         keep_dones (bool, optional): if ``False``, the done keys must be provided
             if they should be kept. Defaults to ``True``.
 
+    Examples:
         >>> import gymnasium
         >>> from torchrl.envs import GymWrapper
         >>> env = TransformedEnv(

diff --git a/torchrl/modules/__init__.py b/torchrl/modules/__init__.py
@@ -23,6 +23,7 @@
 )
 from .models import (
     BatchRenorm1d,
+    ConsistentDropout,
     ConsistentDropoutModule,
     Conv3dNet,
     ConvNet,
-Original file line number
+Diff line change
@@ Expand Up / @@ -274,6 +274,9 @@ QMixer @@
     Returns
     -------
+    .. _ref_returns:
     .. currentmodule:: torchrl.objectives.value
     .. autosummary::
@@ Expand Down @@