Optimize pipeline schedule #94

ver217 · 2021-12-29T05:31:45Z

Add pipeline shared module wrapper

This feature is especially useful when train GPT/BERT whose word embedding layer is used at the first and the end.

Usage:

pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
rank = gpc.get_global_rank()
wrapper = PipelineSharedModuleWrapper([0, pipeline_size - 1])
parts = _partition_uniform(num_layers, pipeline_size, num_chunks)[pipeline_rank]
models = []
for start, end in parts:
    kwargs['num_layers'] = end - start
    kwargs['first'] = start == 0
    kwargs['last'] = end == num_layers
    print(f'==> Rank{rank} build layer {start}-{end}, total {num_layers}')
    chunk = PipelineGPT1D(**kwargs).to(device)
    if start == 0:
        wrapper.register_module(chunk.embedding.word_embeddings)
    elif end == num_layers:
        wrapper.register_module(chunk.head)
    models.append(chunk)
if len(models) == 1:
    model = models[0]
else:
    model = nn.ModuleList(models)

PipelineSharedModuleWrapper must be initialized in all ranks, and PipelineSharedModuleWrapper.register_module() should be called in the ranks which share the module. Modules have to be moved to corresponding device based on your distributed backend before calling register_module().

Update `load_batch()` of schedule

We update the rule of loading batch in schedule to support GPT/BERT training with pipeline parallelism.

Now please make sure the item your dataset returned is a tuple of (data, label), and the type of data and label must be torch.Tensor or dict. When you set sync_data to True in schedule, you must make sure the values of the dict are torch.Tensor. Note that when your dataset returns dict, the keys must be the same as arguments in your model.forward() or loss_function.forward().

When using pipeline parallelism, the input of first layer is from dataloader. For other layers, the first argument of forward is the output of the previous pipeline stage and other arguments are from dataloader. Note that each layer can only return one tensor in forward().

Optimize GPU memory usage of pipeline schedule

Add a argument (return_output_label) in schedule.forward_backward_step, trainer.fit and trainer.evaluate. The output of model and labels won't be returned, which can further reduce GPU memory usage especially when using pipeline parallelism.

Optimize loss accumulation in pipeline schedule. Use loss.detach() when accumulating it to avoid unexpected large memory usage.

Example:

trainer.fit(
    train_dataloader=train_dataloader,
    epochs=num_epochs,
    test_interval=1,
    hooks=hook_list,
    display_progress=True,
    return_output_label=False
)

Reduce communication of pipeline schedule

Add a argument tensor_shape for PipelineSchedule and InterleavedPipelineSchedule. You can set this argument to a Union[torch.Size, List[int], Tuple[int]] if the tensor shapes transmitted along pipeline are the same and fixed during training. By setting this, the communication will be further reduced.

Example:

schedule = InterleavedPipelineSchedule(num_micro_batches,
                                               num_model_chunks, tensor_shape=tensor_shape)

add pipeline shared module wrapper and update load batch

* added model parallel process group for amp and clip grad * update amp and clip with model parallel process group

micro batch offload

optimize pipeline gpu memory usage

FrankLeeeee · 2021-12-30T06:06:54Z

Hi @ver217 . In your demo, I do not see PipelineGPT1D in the changed file. Also, the function _partition_uniform should be public but it starts with an underscore, if it will be called by the user. There is one question from me as well.

colossalai/engine/gradient_handler/_pipeline_parallel_gradient_handler.py

In line 39, is division by world size required before all reduce?

ver217 and others added 9 commits December 23, 2021 15:05

add pipeline shared module wrapper and update load batch

d7f36c4

Merge pull request #87 from ver217/feature/pipeline

5b1036c

add pipeline shared module wrapper and update load batch

added model parallel process group for amp and clip grad (#86)

fade142

* added model parallel process group for amp and clip grad * update amp and clip with model parallel process group

remove pipeline_prev/next group (#88)

550efac

micro batch offload

2c4e56d

Merge pull request #90 from ver217/feature/pipeline

764abfc

micro batch offload

optimize pipeline gpu memory usage

961ab6a

Merge pull request #91 from ver217/feature/pipeline

e0c3753

optimize pipeline gpu memory usage

pipeline can receive tensor shape (#93)

a85c1ad

ver217 requested a review from FrankLeeeee December 29, 2021 05:32

Merge branch 'main' into feature/pipeline

c720397

ver217 requested review from FrankLeeeee and removed request for FrankLeeeee December 29, 2021 05:48

ver217 added 2 commits December 29, 2021 16:53

optimize pipeline gpu memory usage

b8cc42f

fix grad accumulation step counter

cd205af

ver217 requested review from FrankLeeeee and removed request for FrankLeeeee December 29, 2021 08:55

fix split batch in load batch

2a4266f

ver217 requested review from FrankLeeeee and removed request for FrankLeeeee December 30, 2021 04:27

rename classes and functions

8648a7a

ver217 requested review from FrankLeeeee and removed request for FrankLeeeee December 30, 2021 07:32

FrankLeeeee approved these changes Dec 30, 2021

View reviewed changes

FrankLeeeee merged commit 96780e6 into main Dec 30, 2021

ver217 deleted the feature/pipeline branch January 4, 2022 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize pipeline schedule #94

Optimize pipeline schedule #94

ver217 commented Dec 29, 2021

FrankLeeeee commented Dec 30, 2021

Optimize pipeline schedule #94

Optimize pipeline schedule #94

Conversation

ver217 commented Dec 29, 2021

Add pipeline shared module wrapper

Update load_batch() of schedule

Optimize GPU memory usage of pipeline schedule

Reduce communication of pipeline schedule

FrankLeeeee commented Dec 30, 2021

Update `load_batch()` of schedule