forked from hpcaitech/ColossalAI
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[doc] migrate the markdown files (hpcaitech#2652)
- Loading branch information
1 parent
a020eec
commit 85b2303
Showing
84 changed files
with
9,729 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
name: Check Documentation on PR | ||
|
||
on: | ||
pull_request: | ||
paths: | ||
- 'docs/**' | ||
|
||
jobs: | ||
check-i18n: | ||
name: Check docs in diff languages | ||
if: | | ||
github.event.pull_request.draft == false && | ||
github.base_ref == 'main' && | ||
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI' | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
|
||
- uses: actions/setup-python@v2 | ||
with: | ||
python-version: '3.8.14' | ||
|
||
- run: python .github/workflows/scripts/check_doc_i18n.py -d docs/source |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
import argparse | ||
import os | ||
|
||
|
||
def compare_dirs(dir1, dir2): | ||
# First, we need to check if the two directories exist | ||
if not os.path.exists(dir1) or not os.path.exists(dir2): | ||
return False | ||
|
||
# Now, we compare the list of items in each directory | ||
items1 = os.listdir(dir1) | ||
items2 = os.listdir(dir2) | ||
|
||
# If the number of items in each directory is different, the directories are different | ||
if len(items1) != len(items2): | ||
return False | ||
|
||
# For each item in the first directory, we check if there is a corresponding item in the second directory | ||
for item in items1: | ||
item_path1 = os.path.join(dir1, item) | ||
item_path2 = os.path.join(dir2, item) | ||
|
||
# If the corresponding item doesn't exist in the second directory, the directories are different | ||
if not os.path.exists(item_path2): | ||
print(f'Found mismatch: {item_path1}, {item_path2}') | ||
return False | ||
|
||
# If the corresponding item is a directory, we compare the two directories recursively | ||
if os.path.isdir(item_path1) and os.path.isdir(item_path2): | ||
if not compare_dirs(item_path1, item_path2): | ||
print(f'Found mismatch: {item_path1}, {item_path2}') | ||
return False | ||
|
||
# both are files | ||
elif os.path.isfile(item_path1) and os.path.isfile(item_path2): | ||
continue | ||
|
||
# If the corresponding item is not a file or a directory, the directories are different | ||
else: | ||
print(f'Found mismatch: {item_path1}, {item_path2}') | ||
return False | ||
|
||
# If all items are the same, the directories are the same | ||
return True | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('-d', '--directory', help="The directory where the multi-language source files are kept.") | ||
args = parser.parse_args() | ||
|
||
i18n_folders = os.listdir(args.directory) | ||
i18n_folders = [os.path.join(args.directory, val) for val in i18n_folders] | ||
|
||
if len(i18n_folders) > 1: | ||
for i in range(1, len(i18n_folders)): | ||
dir1 = i18n_folders[0] | ||
dir2 = i18n_folders[i] | ||
print(f'comparing {dir1} vs {dir2}') | ||
match = compare_dirs(i18n_folders[0], i18n_folders[i]) | ||
|
||
if not match: | ||
print( | ||
f"{dir1} and {dir2} don't match, please ensure that your documentation is available in different languages" | ||
) | ||
else: | ||
print(f"{dir1} and {dir2} match") |
Empty file.
Empty file.
Empty file.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Setup | ||
|
||
## Announcement | ||
|
||
Our auto-parallel feature is a alpha version. It is still under development. We will keep updating it and make it more stable. If you encounter any problem, please feel free to raise an issue. | ||
|
||
## Requirements | ||
|
||
We need some extra dependencies to support auto-parallel. Please install them before using auto-parallel. | ||
|
||
### Install PyTorch | ||
|
||
We only support PyTorch 1.12 now, other versions are not tested. We will support more versions in the future. | ||
|
||
```bash | ||
#conda | ||
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch | ||
#pip | ||
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113 | ||
``` | ||
|
||
### Install pulp and coin-or-cbc | ||
|
||
```bash | ||
pip install pulp | ||
conda install -c conda-forge coin-or-cbc | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Introduction | ||
|
||
In recent years, the deployment of large-scale machine learning models has become increasingly important. However, distributed training systems often require **manual parallelization plans**, which can be complex and require expert knowledge in system engineering and configuration. This can be a challenge for most AI developers without the necessary skills. The need for manual parallelization can make deploying large-scale machine learning models difficult and expensive. | ||
|
||
**Colossal-Auto** simplifies the process of deploying large-scale machine learning models for AI developers. Compared to other solutions that require manual configuration of complex parallel policies and model modification, Colossal-Auto only requires one line of code from the user, along with cluster information and model configurations, to enable distributed training. Technically, It seamlessly **integrates with popular AI model frameworks like Hugging Face and Timm.** | ||
|
||
|
||
|
||
## Overview | ||
|
||
<figure style={{textAlign: "center"}}> | ||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_parallel.png"/> | ||
</figure> | ||
|
||
|
||
## Usage | ||
|
||
```python | ||
# wrap the model using auto_engine | ||
model = autoparallelize(model, meta_input_samples) | ||
# normal training loop | ||
... | ||
``` | ||
|
||
|
||
## Graph Tracing | ||
|
||
Colossal-Auto is **the first auto-parallelism system** that uses static graph analysis based on the PyTorch framework. Obtaining a static execution plan for PyTorch, a dynamic graph framework, has long been an area of research in the field of machine learning systems. Colossal-Auto uses ColoTracer, a forked version of the torch.FX Tracer, to guide the search for an optimal parallelization strategy. The meta-information of each tensor, such as tensor shape, dims, dtype, etc., is computed and recorded during the tracing process. This approach has the advantage of better generalization, as it is not tied to specific models or configurations. | ||
|
||
|
||
|
||
## Fine-grained Parallelism Search | ||
Colossal-AI’s auto-parallelism searches for strategies in regard to each operand with the goal of achieving the fastest runtime while meeting memory budget constraints. It ultimately determines the actual training time strategy, including the tensor split strategy for each tensor, the type of communication operators to be inserted between different computing nodes, whether to replace operators, etc. The tensor, data, and hybrid parallelism such as column and row split used by NVIDIA in Megatron-LM and other parallelism systems are all subsets of strategies that can be searched by Colossal-AI. In addition to these parallelisms that can be manually specified, Colossal-AI can specify a unique parallelism method for each operation and, potentially finding a better parallelism strategy than what human experts could provide. | ||
|
||
|
||
|
||
## Distributed Tensor and Shape-Consistency System | ||
|
||
The Colossal-AI system uses a device-mesh, similar to PyTorch's latest DTensor release, to manage its cluster. Colossal-AI uses a sharding-spec to annotate the storage status of each tensor and facilitate their distribution across the cluster. The system also employs a shape-consistency manager to automatically transform tensors between different sharding-specs, allowing for seamless slicing and dicing of tensors, while the shape-consistency manager ensures that the output of upstream operands is consistently stored in the cluster, regardless of how the input of downstream operands is stored. This makes Colossal-AI highly versatile and easy to use without users worrying about the storage status of tensors when performing operations on them. | ||
<figure style={{textAlign: "center"}}> | ||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/shape_consistency.png"/> | ||
</figure> | ||
|
||
Here are some key advantages of Colossal-AI compared to PyTorch DTensor: | ||
Colossal-AI's device-mesh uses cluster performance metrics and profiling results to estimate the time consumption of different communication operators. This helps Colossal-AI optimize communication between nodes and improve overall system efficiency. | ||
Colossal-AI's shape-consistency manager uses a greedy search algorithm to find relatively efficient ways to transform tensors between different sharding-specs, rather than simply transforming dimensions one by one. This can lead to more efficient and effective transformations. | ||
The integration of all-to-all operations in Colossal-AI increases the scalability of the system by enabling more efficient communication between nodes. This is especially useful for large-scale machine learning tasks that require the transfer of large amounts of data between nodes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Quick Demo | ||
|
||
Colossal-Auto simplifies the process of deploying large-scale machine learning models for AI developers. Compared to other solutions that require manual configuration of complex parallel policies and model modification, Colossal-Auto only requires one line of code from the user, along with cluster information and model configurations, to enable distributed training. Quick demos showing how to use Colossal-Auto are given below. | ||
|
||
### 1. Basic usage | ||
|
||
Colossal-Auto can be used to find a hybrid SPMD parallel strategy includes data, tensor(i.e., 1D, 2D, sequencial) for each operation. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel). | ||
Detailed instructions can be found in its `README.md`. | ||
|
||
### 2. Integration with activation checkpoint | ||
|
||
Colossal-Auto's automatic search function for activation checkpointing finds the most efficient checkpoint within a given memory budget, rather than just aiming for maximum memory compression. To avoid a lengthy search process for an optimal activation checkpoint, Colossal-Auto has implemented a two-stage search process. This allows the system to find a feasible distributed training solution in a reasonable amount of time while still benefiting from activation checkpointing for memory management. The integration of activation checkpointing in Colossal-AI improves the efficiency and effectiveness of large model training. You can follow the [Resnet example](TBA). | ||
Detailed instructions can be found in its `README.md`. | ||
|
||
<figure style={{textAlign: "center"}}> | ||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_ckpt.jpg"/> | ||
</figure> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# Add Your Own Parallel Mode | ||
|
||
Author: Shenggui Li, Yongbin Li | ||
|
||
**Prerequisite:** | ||
- [Define Your Configuration](../basics/define_your_config.md) | ||
- [Configure Parallelization](../basics/configure_parallelization.md) | ||
|
||
## Introduction | ||
|
||
To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm | ||
with less effort, we have decoupled various components in the training lifecycle. You can implement your own | ||
parallelism by simply inheriting from the base class. | ||
|
||
The main components are: | ||
|
||
1. `ProcessGroupInitializer` | ||
2. `GradientHandler` | ||
3. `Schedule` | ||
|
||
**This currently requires some code to the source code, thus we recommend that you install from source with the `-e` flag. | ||
`-e` flag makes the installation editable, thus, your code change will be reflected in your Python runtime. | ||
We will work on this to avoid change to source code in future releases.** | ||
|
||
|
||
## Process Group Initializer | ||
|
||
Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same | ||
process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a | ||
global context for users to easily manage their process groups. If you wish to add new process group, you can easily | ||
define a new class and set it in your configuration file. To define your own way of creating process groups, you can | ||
follow the steps below to create a new distributed initialization. | ||
|
||
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`. | ||
```python | ||
class ParallelMode(Enum): | ||
GLOBAL = 'global' | ||
DATA = 'data' | ||
PIPELINE = 'pipe' | ||
... | ||
|
||
NEW_MODE = 'new_mode' # define your mode here | ||
``` | ||
|
||
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The | ||
first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other | ||
arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the | ||
registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`. | ||
```python | ||
# sample initializer class | ||
@DIST_GROUP_INITIALIZER.register_module | ||
class MyParallelInitializer(ProcessGroupInitializer): | ||
|
||
def __init__(self, | ||
rank: int, | ||
world_size: int, | ||
config: Config, | ||
data_parallel_size: int, | ||
pipeline_parlalel_size: int, | ||
tensor_parallel_size: int, | ||
arg1, | ||
arg2): | ||
super().__init__(rank, world_size, config) | ||
self.arg1 = arg1 | ||
self.arg2 = arg2 | ||
# ... your variable init | ||
|
||
def init_parallel_groups(self): | ||
# initialize your process groups | ||
pass | ||
|
||
``` | ||
|
||
Then, you can insert your new initializer to the current mode-to-initialize mapping | ||
in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically. | ||
|
||
```python | ||
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer' | ||
``` | ||
|
||
3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows | ||
the `ParallelContext` to create your initializer and initialize your desired process groups. | ||
|
||
```python | ||
parallel = dict( | ||
pipeline=dict(size=1), | ||
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode | ||
) | ||
``` | ||
|
||
## Gradient Handler | ||
|
||
Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce | ||
strategies may be executed for different kinds of parallelism, users can | ||
inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library | ||
uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data | ||
parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own | ||
gradient handler like below: | ||
|
||
```python | ||
from colossalai.registry import GRADIENT_HANDLER | ||
from colossalai.engine import BaseGradientHandler | ||
|
||
@GRADIENT_HANDLER.register_module | ||
class YourGradientHandler(BaseGradientHandler): | ||
|
||
def handle_gradient(self): | ||
do_something() | ||
|
||
``` | ||
|
||
Afterwards, you can specify the gradient handler you want to use in your configuration file. | ||
|
||
```python | ||
gradient_handlers = [ | ||
dict(type='YourGradientHandler'), | ||
] | ||
``` | ||
|
||
## Schedule | ||
|
||
Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline | ||
schedules. If you want to modify how the forward and backward passes are executed, you can | ||
inherit `colossalai.engine.schedule.BaseSchedule` and implement the `forward_back_step` function. |
36 changes: 36 additions & 0 deletions
36
docs/source/en/advanced_tutorials/define_your_own_parallel_model.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Define your own parallel model | ||
|
||
Author: Zhengda Bian, Yongbin Li | ||
|
||
> ⚠️ We are working on this documentation to make it more detailed. We will introduce the mechanism of different parallelism | ||
> and how to use them to write a model. | ||
Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it | ||
impossible to fit into a single GPU directly. Don't worry, Colossal-AI is here to help you sort things out. With the help of Colossal-AI, | ||
you can write your model in the familiar way in which you used to write models for a single GPU, while Colossal-AI automatically | ||
splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple | ||
2D parallel model in the Colossal-AI context. | ||
|
||
## Write a simple 2D parallel model | ||
|
||
```python | ||
from colossalai.nn import Linear2D | ||
import torch.nn as nn | ||
|
||
class MLP_2D(nn.Module): | ||
|
||
def __init__(self): | ||
super().__init__() | ||
self.linear_1 = Linear2D(in_features=1024, out_features=16384) | ||
self.linear_2 = Linear2D(in_features=16384, out_features=1024) | ||
|
||
def forward(self, x): | ||
x = self.linear_1(x) | ||
x = self.linear_2(x) | ||
return x | ||
``` | ||
|
||
## Use pre-defined model | ||
|
||
For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *ViT*, *MoE*, | ||
and *GPT*. Feel free to customize them into different sizes to fit into your special needs. |
Oops, something went wrong.