[User guides] Add user guides for DeepSpeed and Accelerate (ray-proje…

…ct#38513) Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>
ShuN6211 · Aug 19, 2023 · e78b0ef · e78b0ef
1 parent c429c10
commit e78b0ef
Show file tree

Hide file tree

Showing 18 changed files with 1,076 additions and 23 deletions.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -71,6 +71,8 @@ parts:
             sections:
               - file: train/huggingface-accelerate
                 title: Hugging Face Accelerate Guide
+              - file: train/deepspeed
+                title: DeepSpeed Guide
               - file: train/distributed-tensorflow-keras
                 title: TensorFlow and Keras Guide
               - file: train/distributed-xgboost-lightgbm

diff --git a/doc/source/images/accelerate_logo.png b/doc/source/images/accelerate_logo.png
diff --git a/doc/source/images/deepspeed_logo.svg b/doc/source/images/deepspeed_logo.svg
diff --git a/doc/source/ray-overview/examples.rst b/doc/source/ray-overview/examples.rst
@@ -1402,3 +1402,17 @@ Ray Examples
         :link-type: doc
 
         Fine-tune vicuna-13b-v1.3 with DeepSpeed and LightningTrainer
+
+    .. grid-item-card:: :bdg-secondary:`Code example`
+        :class-item: gallery-item training llm pytorch nlp
+        :link: deepspeed_example
+        :link-type: ref
+
+        Distributed Training with DeepSpeed ZeRO-3 and TorchTrainer
+
+    .. grid-item-card:: :bdg-secondary:`Code example`
+        :class-item: gallery-item training llm pytorch huggingface nlp
+        :link: deepspeed_example
+        :link-type: ref
+
+        Distributed Training with Hugging Face Accelelate and TorchTrainer
diff --git a/doc/source/train/deepspeed.rst b/doc/source/train/deepspeed.rst
@@ -0,0 +1,94 @@
+.. _train-deepspeed:
+
+Training with DeepSpeed
+=======================
+
+The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_  training across a distributed Ray cluster.
+
+All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this:
+
+.. code-block:: python
+
+    import deepspeed
+    from deepspeed.accelerator import get_accelerator
+
+    def train_func(config):
+        # Instantiate your model and dataset
+        model = ...
+        train_dataset = ...
+        eval_dataset = ...
+        deepspeed_config = {...} # Your Deepspeed config
+
+        # Prepare everything for distributed training
+        model, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize(
+            model=model,
+            model_parameters=model.parameters(),
+            training_data=tokenized_datasets["train"],
+            collate_fn=collate_fn,
+            config=deepspeed_config,
+        )
+
+        # Define the GPU device for the current worker
+        device = get_accelerator().device_name(model.local_rank)
+
+        # Start training
+        ...
+    
+    from ray.train.torch import TorchTrainer
+    from ray.train import ScalingConfig
+
+    trainer = TorchTrainer(
+        train_func,
+        scaling_config=ScalingConfig(...),
+        ...
+    )
+    trainer.fit()
+
+
+Below is a simple example of ZeRO-3 training with DeepSpeed only. 
+
+.. tabs::
+
+    .. group-tab:: Example with Ray Data
+
+        .. dropdown:: Show Code
+
+            .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py
+                :language: python
+                :start-after: __deepspeed_torch_basic_example_start__
+                :end-before: __deepspeed_torch_basic_example_end__
+
+    .. group-tab:: Example with PyTorch DataLoader
+
+        .. dropdown:: Show Code
+
+            .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer_no_raydata.py
+                :language: python
+                :start-after: __deepspeed_torch_basic_example_no_raydata_start__
+                :end-before: __deepspeed_torch_basic_example_no_raydata_end__
+
+.. tip::
+
+    To run DeepSpeed with pure PyTorch, you **don't need to** provide any additional Ray Train utilities 
+    like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training funciton. Instead, 
+    keep using `deepspeed.initialize() <https://deepspeed.readthedocs.io/en/latest/initialize.html>`_ as usual to prepare everything 
+    for distributed training.
+
+Running DeepSpeed with other frameworks
+-------------------------------------------
+
+Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train.
+
+Please check the below examples for more details:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Framework
+     - Example
+   * - Accelelate (:ref:`User Guide <train-hf-accelerate>`)
+     - `Fine-tune Llama-2 series models with Deepspeed, Accelerate, and Ray Train. <https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed>`_
+   * - Transformers (:ref:`User Guide <train-pytorch-transformers>`)
+     - :ref:`Fine-tune GPT-J-6b with DeepSpeed and Hugging Face Transformers <gptj_deepspeed_finetune>`
+   * - Lightning (:ref:`User Guide <train-pytorch-lightning>`)
+     - :ref:`Fine-tune vicuna-13b with DeepSpeed and PyTorch Lightning <vicuna_lightning_deepspeed_finetuning>`
diff --git a/doc/source/train/doc_code/accelerate_trainer.py b/doc/source/train/doc_code/accelerate_trainer.py
@@ -52,7 +52,7 @@ def train_loop_per_worker():
             print(f"epoch: {epoch}, loss: {loss.item()}")
 
         train.report(
-            {},
+            metrics={"epoch": epoch, "loss": loss.item()},
             checkpoint=Checkpoint.from_dict(
                 dict(epoch=epoch, model=accelerator.unwrap_model(model).state_dict())
             ),

diff --git a/doc/source/train/examples/accelerate/accelerate_example.rst b/doc/source/train/examples/accelerate/accelerate_example.rst
@@ -0,0 +1,8 @@
+:orphan:
+
+.. _accelerate_example:
+
+Hugging Face Accelerate Distributed Training Example with Ray Train
+===================================================================
+
+.. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py
diff --git a/doc/source/train/examples/deepspeed/deepspeed_example.rst b/doc/source/train/examples/deepspeed/deepspeed_example.rst
@@ -0,0 +1,8 @@
+:orphan:
+
+.. _deepspeed_example:
+
+DeepSpeed ZeRO-3 Distributed Training Example with Ray Train
+============================================================
+
+.. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py