[pipeline] refactor pipeline #679

YuliangLiu0306 · 2022-04-06T07:34:32Z

infer pipeline schedule parameters, such as tensor_shape or scatter_gather_tensors, from user config file

…/ColossalAI into feature/refactor_pipeline

colossalai/engine/schedule/_pipeline_schedule.py

ver217 · 2022-04-06T09:08:23Z

Is this a good practice? In this way, users rely on config file heavier, which is diffenret from Pytorch's pattern.

YuliangLiu0306 · 2022-04-06T12:47:59Z

Is this a good practice? In this way, users rely on config file heavier, which is diffenret from Pytorch's pattern.

Actually, this PR do not add any addtitional requirement to the config file except set NUM_MICRO_BATCHES in pipeline parallel mode. We could infer those argument for schedule before, but we don't. This PR just fill it.

kurisusnowdeng · 2022-04-06T13:22:26Z

I think a good way to define tensor shape for pipeline is to set static batch size, seq length and hidden size for the model. The schedule reads the sizes from the model. If they are None, we transfer the shape in the first step.

YuliangLiu0306 · 2022-04-06T14:57:43Z

I think a good way to define tensor shape for pipeline is to set static batch size, seq length and hidden size for the model. The schedule reads the sizes from the model. If they are None, we transfer the shape in the first step.

I agree with you, but I don't find a standard to define the config file, so the rule in this PR is induced from the examples config. If they are None, we will get the shape in the warmup stage.

FrankLeeeee · 2022-04-06T15:34:17Z

I think a good way to define tensor shape for pipeline is to set static batch size, seq length and hidden size for the model. The schedule reads the sizes from the model. If they are None, we transfer the shape in the first step.

I agree with you, but I don't find a standard to define the config file, so the rule in this PR is induced from the examples config. If they are None, we will get the shape in the warmup stage.

In fact, I am pondering over the necessity of setting static tensor shape at all. In my opinion, there are several problems if we ask the user to set the static tensor shape.

The user does not know the tensor shape. For example, NLP experts surely know that the intermediate activation of transformers is of shape (b, s, h), but no guarantee that novices can get it instantly and correctly.
Static tensor shape is highly coupled with the tensor parallel mode. For example, 1D and 2D will split the tensor differently, and this is the prior knowledge that a user must know in order to give the correct tensor shape
Static tensor shape does not help with multi-tensor passing between pipeline stages. If a layer produces multiple outputs which should be passed to the next stage, how can the user specify the tensor shape?
Static tensor shape may potentially fail in the last batch of a training epoch as the last batch may be of a different shape due to insufficient data to make up to a full batch. This requires the users to configure other components, e.g. asking the user to drop the last batch in order to prevent error.

I propose to solve this problem by recording the tensor shape on the fly. For the first iteration, we check the tensor shape and data type and store these meta info in an object (e.g. PipelineMetaTracker). The subsequent iterations will re-use the meta information stored in it for communication between pipeline stages. When a different tensor shape is detected, it will update the data in the object. In this way, we can solve problems 1-4.

kurisusnowdeng · 2022-04-07T02:21:53Z

@FrankLeeeee In what case we need to transfer multiple tensors across the pipeline?

FrankLeeeee · 2022-04-07T02:23:01Z

@FrankLeeeee In what case we need to transfer multiple tensors across the pipeline?

In cases such as dual path network, one example is SlowFast, one of the most popular video networks.

YuliangLiu0306 · 2022-04-07T05:11:39Z

@YuliangLiu0306 scatter_gather_tensors is only used for 1D parallelism. We suggest to enable it when 1D is used. So, how about hide this option from users?

Yes. As we put schedule into engine, not only that option, the whole initialization process of schedule has been hidden. From users' view, they don't have to care about how to initialize schedule.

YuliangLiu0306 and others added 9 commits April 1, 2022 18:12

refactor pipeline---put runtime schedule into engine.

18b5f25

add type hint for schedule Optional[BaseSchedule]

41b593e

preprocess schedule during engine initializing

172f912

preprocess schedule during engine initializing

548e776

Merge branch 'hpcaitech:main' into feature/refactor_pipeline

408ef64

Merge branch 'hpcaitech:main' into feature/refactor_pipeline

0c53ad0

infer pipeline schedule params from config

c99c77a

Merge branch 'hpcaitech:main' into feature/refactor_pipeline

5c7ae0e

Merge branch 'feature/refactor_pipeline' of github.com:YuliangLiu0306…

f23078f

…/ColossalAI into feature/refactor_pipeline

YuliangLiu0306 requested review from feifeibear and FrankLeeeee April 6, 2022 07:34

YuliangLiu0306 linked an issue Apr 6, 2022 that may be closed by this pull request

[FEATURE]: infer pipeline schedule params from config #680

Closed

YuliangLiu0306 marked this pull request as ready for review April 6, 2022 07:35

YuliangLiu0306 added the Run Build and Test label Apr 6, 2022

feifeibear reviewed Apr 6, 2022

View reviewed changes

colossalai/engine/schedule/_pipeline_schedule.py Show resolved Hide resolved

feifeibear changed the title ~~Feature/refactor pipeline~~ [pipeline] refactor pipeline Apr 6, 2022

FrankLeeeee approved these changes Apr 7, 2022

View reviewed changes

FrankLeeeee merged commit 0ed7042 into hpcaitech:main Apr 7, 2022

Wesley-Jzy mentioned this pull request Apr 7, 2022

The hooks in trainer don't work now. #693

Closed

YuliangLiu0306 deleted the feature/refactor_pipeline branch July 27, 2022 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pipeline] refactor pipeline #679

[pipeline] refactor pipeline #679

YuliangLiu0306 commented Apr 6, 2022

ver217 commented Apr 6, 2022 •

edited

Loading

YuliangLiu0306 commented Apr 6, 2022

kurisusnowdeng commented Apr 6, 2022

YuliangLiu0306 commented Apr 6, 2022

FrankLeeeee commented Apr 6, 2022 •

edited

Loading

kurisusnowdeng commented Apr 7, 2022

FrankLeeeee commented Apr 7, 2022

YuliangLiu0306 commented Apr 7, 2022

[pipeline] refactor pipeline #679

[pipeline] refactor pipeline #679

Conversation

YuliangLiu0306 commented Apr 6, 2022

ver217 commented Apr 6, 2022 • edited Loading

YuliangLiu0306 commented Apr 6, 2022

kurisusnowdeng commented Apr 6, 2022

YuliangLiu0306 commented Apr 6, 2022

FrankLeeeee commented Apr 6, 2022 • edited Loading

kurisusnowdeng commented Apr 7, 2022

FrankLeeeee commented Apr 7, 2022

YuliangLiu0306 commented Apr 7, 2022

ver217 commented Apr 6, 2022 •

edited

Loading

FrankLeeeee commented Apr 6, 2022 •

edited

Loading