Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline transformer for wait/record node #3513

Merged
merged 13 commits into from
Apr 23, 2020

Conversation

xzhu1900
Copy link
Contributor

Description: Add pipeline transformer to insert wait/record node to pipeline training graph

Motivation and Context
pipeline training requires communication between model partitions. The communication/syncing is done through inserting related ops (send/recv/wait/record) in the graph.

This change includes:

  • pipeline transformer frame with code that inserts wait/record nodes.
  • offline graph split script update

@xzhu1900 xzhu1900 requested a review from a team as a code owner April 13, 2020 23:55
@xzhu1900 xzhu1900 added the training issues related to ONNX Runtime training; typically submitted using template label Apr 13, 2020
@xzhu1900 xzhu1900 requested a review from SherlockNoMad April 13, 2020 23:58
Node* recv_bw{nullptr};
for (auto& node : graph.Nodes()) {
if (node.OpType() == "Send") {
if (node.Description() == "Backward pass") {
Copy link
Contributor

@ke1337 ke1337 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node.Description() == "Backward pass" [](start = 10, length = 37)

Can we make it a util function to check if node is in bw graph? We might change its representation later, for example, turning the string into json to carry more info. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll change it.


In reply to: 408339164 [](ancestors = 408339164)

}

ORT_RETURN_IF_NOT(
(send_fw && recv_bw) || (!send_fw && !recv_bw),
Copy link
Contributor

@ke1337 ke1337 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(send_fw && recv_bw) || (!send_fw && !recv_bw) [](start = 10, length = 46)

send_fw == recv_bw #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do that, right? As they are two different pointers.


In reply to: 408340457 [](ancestors = 408340457)

Copy link
Contributor

@ke1337 ke1337 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, was thinking them as bool, then maybe !send_fw == !recv_bw


In reply to: 408344271 [](ancestors = 408344271,408340457)

Copy link
Contributor

@liwchang liwchang Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use more readable ones, like nullptr!=send_fw.

#Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is more like (send_fw XNOR recv_bw). I have updated the code based on Ke's suggestion.


In reply to: 408378340 [](ancestors = 408378340)


auto mp = model->ToProto();
std::ofstream ofs(filename + "_back.onnx", std::ofstream::binary);
std::cout << "output to file " << filename + "_back.onnx" << std::endl;
Copy link
Contributor Author

@xzhu1900 xzhu1900 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delete this. #Resolved

auto load_gradient_graph = [](std::string& filename) {
auto config = MakeBasicTrainingConfig();

config.use_pipeline = true;
Copy link
Contributor

@ke1337 ke1337 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.use_pipeline = true; [](start = 4, length = 27)

how do we insert Send Recv? I thought the config is supposed to have some info about where the cut point is, then the python script pipeline_model_split.py can go away? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that will be the next step -- adding the logic of graph cut into pipeline_transformer.


In reply to: 408351407 [](ancestors = 408351407)


# new edge is being added, need to re-inference shape
if need_shape_inference:
model = onnx.shape_inference.infer_shapes(model)

# after all need-to-be-cut edges identified, split the graph
new_sends, new_receives = split_graph(model, split_edge_groups)
sub_graphs = generate_subgraph(model, new_receives)
remove_identity(model, new_identity)
Copy link
Contributor Author

@xzhu1900 xzhu1900 Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_identity(model, new_identity) [](start = 4, length = 36)

It seems identity node doesn't register its gradient op. So to work around it, after the graph is split, remove the inserted identity node. #Resolved

}

void AddRecordBackward(Graph& graph, Node* send_bw, std::vector<std::string>& new_input_names) {
// gradient graph can contain some dangling leaf nodes. Add them all to WaitEvent
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this logic implies that adding Optimizer must happen after applying pipeline-transformer. Otherwise, we will have deadlock because Optimizer does not always run when we have gradient accumulation. In other words, adding Record prevents partial graph evaluation. Please add a note here to help future devs. #Resolved

void AddRecordBackward(Graph& graph, Node* send_bw, std::vector<std::string>& new_input_names) {
// gradient graph can contain some dangling leaf nodes. Add them all to WaitEvent
// backward node's input.
auto add_leaf_node = [&](std::vector<NodeArg*>& input_args) {
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda function seems too long. Can we make it an isolated function? #Resolved

AddInputEvent(graph, "RecordEvent", false /* is_forward */, input_args, new_input_names);

if (send_bw) {
// if we have send op in backward pass (at the end of the graph), add the RecordEvent op after that.
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if we have send op in backward pass (at the end of the graph), add the RecordEvent op after that.
// if we have Send in backward pass (at the end of the graph), we make sure the RecordEvent happens after that send by adding Send's outputs to RecordEvent's input list.

Also, I think you only need one output from Send. #Resolved

Copy link
Contributor Author

@xzhu1900 xzhu1900 Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send only has one output :-)


In reply to: 409739947 [](ancestors = 409739947)

// only check backward node
continue;
}
bool find_consumer_nodes = false;
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determining if a node is leaf can be another function. #Resolved

std::begin(send_bw->MutableOutputDefs()),
std::end(send_bw->MutableOutputDefs()));
}
add_leaf_node(input_args);
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_leaf_node is not a good name, because the following line is a line which adds a leaf node. #Resolved

ONNX_NAMESPACE::TypeProto type_proto(*(old_input->TypeAsProto()));

auto& wait_output = graph.GetOrCreateNodeArg(new_name, &type_proto);
output_args.push_back(&wait_output);
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push_back a pointer of local variable? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_output is not a local variable. It is a reference of a node that is created and added into graph. So the lifetime of wait_output is controled by graph object and it won't get distroyed after existing this function.


In reply to: 409776283 [](ancestors = 409776283)


auto& wait_output = graph.GetOrCreateNodeArg(new_name, &type_proto);
output_args.push_back(&wait_output);
input_args.push_back(old_input);
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please push_back(old_input) earlier than push_back(new_input). Code should happen as early as possible. #Resolved

}

Status AddWaitForward(Graph& graph, Node* recv_fw, std::vector<std::string>& new_input_names) {
auto update_wait_input_output = [&](NodeArg* old_input,
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto update_wait_input_output = [&](NodeArg* old_input,
// Append old_input to input_args and return its pass-through value. Note that
// input_args and output_args are Wait's inputs and outputs, respectively.
auto update_wait_input_output = [&](NodeArg* old_input,
``` #Resolved

nullptr,
kMSDomain);
} else {
// the first stage doesn't have recv_fw. Add Wait for all inputs.
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies even if there is no Recv. Do we really need to separate recv_fw==NULL and recv_fw!=NULL? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, depending on whether recv_fw is nullptr or not, the input and output of Wait will be different, thus needs to be handled differently.


In reply to: 409781157 [](ancestors = 409781157)

Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Can't we treat Recv as a normal operator and connect Recv to outputs of WaitEvent? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in thise case, there is no Recv op in forward pass (first stage), so instead of conect Recv to otput of Wait, we inset Wait between input tensor and its consuming ops.


In reply to: 409822235 [](ancestors = 409822235)

return (node.Description() == "Backward pass");
}

void AddInputEvent(Graph& graph, const std::string& op_name,
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why it doesn't return the event (or its name) it adds? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok either way. I feel this style saves one std::move for the returning string. But if you have a strong prefernce over the other, I can change it too.


In reply to: 409789701 [](ancestors = 409789701)

AddInputEvent(graph, "WaitEvent", true /* is_forward */, input_args, new_input_names);
const std::vector<const NodeArg*>& graph_inputs = graph.GetInputs();

ORT_RETURN_IF_NOT(
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should throw. #Resolved

NodeArg* mutable_input = graph.GetNodeArg(input_arg->Name());
auto& wait_output = update_wait_input_output(mutable_input, input_args, output_args);
std::vector<Node*> nodes = graph.GetMutableConsumerNodes(input_arg->Name());
for (auto& consumer_node : nodes) {
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire loop should be a function. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loop is used once and the logic is pretty straight forward. Why do we need to put it into a function?


In reply to: 409795835 [](ancestors = 409795835)


AddRecordBackward(graph, send_bw, new_input_names);
ORT_RETURN_IF_ERROR(AddWaitForward(graph, recv_fw, new_input_names));
ORT_RETURN_IF_ERROR(AddRecordForwardWaitBackward(graph, send_fw, recv_bw, new_input_names));
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is AddRecordForwardWaitBackward applied to the last pipeline stage? If no, please use an if block to avoid calling AddRecordForwardWaitBackward for that case. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's apply to all the stages except the last stage :-) The check is down through checking send_fw && recv_bw. It's less ideal. But I think once we have the information of what the current stage is, we can have a better check in this case.


In reply to: 409797150 [](ancestors = 409797150)

}

Status AddRecordForwardWaitBackward(Graph& graph, Node* send_fw, Node* recv_bw, std::vector<std::string>& new_input_names) {
ORT_RETURN_IF_NOT(
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw? If there is an error, the graph is likely not runnable. #Resolved

"has send forward: ",
send_fw, " and recv backward: ", recv_bw);

auto create_node_args = [&](const NodeArg* base_arg) -> NodeArg& {
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto create_node_args = [&](const NodeArg* base_arg) -> NodeArg& {
auto create_node_arg = [&](const NodeArg* base_arg) -> NodeArg& {
``` #Resolved

auto create_node_args = [&](const NodeArg* base_arg) -> NodeArg& {
const auto& new_name = graph.GenerateNodeArgName(base_arg->Name());
ONNX_NAMESPACE::TypeProto type_proto(*(base_arg->TypeAsProto()));
return graph.GetOrCreateNodeArg(new_name, &type_proto);
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should throw if the node we want to create already exists. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I don't find a way to identify if the returned node is old or new, so added a check for existing node before creating. It would hurt the perf a little bit, but it is a one time thing on initialization so I guess it's OK.


In reply to: 409827827 [](ancestors = 409827827)

ONNX_NAMESPACE::TypeProto event_type_proto;
event_type_proto.mutable_tensor_type()->set_elem_type(ONNX_NAMESPACE::TensorProto_DataType_INT64);

auto event_id_name = graph.GenerateNodeArgName(op_name + (is_forward ? "_fw" : "_bw") + "_event_id");
Copy link
Contributor

@wschin wschin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should be passed in when calling pipeline transformer in TrainingSession code. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it can either be passed in or as a return value returned by pipeline transformer. Let's discuss this when integrating with your code.


In reply to: 409830743 [](ancestors = 409830743)

@@ -25,6 +25,7 @@ namespace test {
namespace {
constexpr auto ORIGINAL_MODEL_PATH = ORT_TSTR("testdata/test_training_model.onnx");
constexpr auto BACKWARD_MODEL_PATH = ORT_TSTR("testdata/temp_backward_model.onnx");
constexpr auto PIPELINE_MODEL_BASE = ORT_TSTR("testdata/test_training_model_");
Copy link
Contributor Author

@xzhu1900 xzhu1900 Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

("testdata/test_training_model_"); [](start = 45, length = 34)

Intending to add the split models into repo for this test. Searched test_training_model.onnx in github repo but didn't find where it originally comes from. Maybe I missed some folder? #Resolved

@xzhu1900 xzhu1900 force-pushed the xuzhu/pipeline_transformer branch from e40d796 to 87b0cfd Compare April 21, 2020 23:42
// add optimizer or gradient accumulation
if (config.optimizer_config.has_value()) {
OptimizerGraphConfig opt_graph_config{};
std::unordered_map<std::string, OptimizerNodeConfig> opt_node_configs{};
ORT_RETURN_IF_ERROR(SetupOptimizerParams(
weight_names_to_train, fp32_weight_name_to_fp16_node_arg,
weights_to_train_, fp32_weight_name_to_fp16_node_arg,
Copy link
Contributor

@wschin wschin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
weights_to_train_, fp32_weight_name_to_fp16_node_arg,
float_weights_names_to_train, fp32_weight_name_to_fp16_node_arg,
``` #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weights_to_train_ is been used for many places, I think I'll just keep using this member variable instead of creating a new one. There are also checks to verify weights_to_train_'s value matches config value. So I'll keep the code simple by just keep one version.


In reply to: 412685534 [](ancestors = 412685534)

}

// if we have a send forward op followed by a recv backward op, insert WaitEvent and RecordEvent in between.
Node* record_node = nullptr;
Copy link
Contributor

@wschin wschin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Node* record_node = nullptr;
Node* record_node = nullptr;
Node* wait_node = nullptr;

just for readability. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_node is not used. Having it declared here will trigger an compiler error.


In reply to: 412686589 [](ancestors = 412686589)

Copy link
Contributor

@liwchang liwchang Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an ORT macro allowing specifying unused object. #Resolved


auto mp = model->ToProto();
std::ofstream ofs(filename + "_back.onnx", std::ofstream::binary);
mp.SerializeToOstream(&ofs);
Copy link
Contributor

@wschin wschin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if forward/backward Send/Recv are inside the model. #Resolved

wschin
wschin previously approved these changes Apr 22, 2020
Copy link
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

wschin
wschin previously approved these changes Apr 22, 2020
Copy link
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@xzhu1900 xzhu1900 requested a review from wschin April 23, 2020 03:04
@xzhu1900 xzhu1900 merged commit f1ba9aa into ort_training Apr 23, 2020
@xzhu1900 xzhu1900 deleted the xuzhu/pipeline_transformer branch April 23, 2020 06:28
edgchen1 added a commit that referenced this pull request Apr 24, 2020
* Introduce training changes.

* Enable CI for training.

* Change Tensor::[Set]ByteOffset() to use ptrdiff_t.

* Add back orttraining-linux-gpu-inference-only-ci-pipeline.yml. (#3182)

* Initial implementation of graph cut and pipeline

This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline"

* Merged PR 5686: fix P100/fp16 issues

1. misaligned address in atomic_add()
2. GatherNDGradKernel to use atomic_add
3. enable/add UTs for GatherNDGrad and reduction_ops using half
- __CUDA_ARCH__ won't take effect on .cc code, leverage HasCudaEnvironment() instead
4. verified convergence graph and perf test
- p100 is much slower than v100 on fp16
- fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue
5. verify convergence test on Dev3/v100

TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though)

* Merged PR 5688: Upgrade ONNX submodule to the latest from github ONNX master.

We want to implement SoftmaxCrossentropy and NegativeLossLikelihoodLoss forward training ops for opset-12 but that requires ONNX submodule to point to the latest commit to have the latest and greatest ONNX spec!

- Reverse integrate changes from *.in.proto files in github ONNX repo.
- Regenerate csharp/test/Microsoft.ML.OnnxRuntime.Tests/OnnxMl.cs
- Disable ONNX tests that don't have op implementation for the latest opset.

* Revert change from RelWithDebInfo to Release in OnnxRuntime.CSharp.sln.

* Tweak the dropout calculation.

* Update bert-base convergence values

* Udpate License Header (#3212)

* Fix build issues (#3214)

* Fixed issues with Python and inference-only build.

* Handle ImportError for training imports.

* fix windows build

* fix compile error

* fix centos build

* fix windows build

* fix compile error

* Use SafeInt for allocation calculation, fix typo.

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Register ONNX Training Ops (#3252)

* Add ort_training build status file. (#3257)

* Address PR comments (#3255)

* Added comment for ntfw_remove().

* Rewrite WindowsEnv::DeleteFolder(), some other clean up.

* Address PR comments (#3256)

* comments

* fix path

* fix path

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Remove orttraining/tools/scripts/profile directory. (#3268)

* refactor frontend (#3235)

* refactor frontend

* remove training python files from inferencing build

* update according to reviewer's comments

* merge pybind_state.cc

* refactor pybind_state.cc

* code clean up

* missed a forward declaration in ort_pybind_state.cc

* passed pytest

* move training_session.py into a subfolder per reviewer's comment

* add copyright

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* unittests comments (#3278)

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* fix build break

* Make gradient clipping configurable. (#3243)

* Make gradient clipping configurable.
add control flag to c++ and python frontend

* fix pybind issue introduced by merge

* Implement pipeline event generator (#3206)

Implement pipeline event generator with OneFWOneBW schedule in timeline. Each stage of pipeline contains FW and BW of a subset of the model and are scheduled in one worker thread for each microbatch.

* Aggregated Send/Recv (#3232)

* Aggregated Send/Recv

* fix typos

* CR refine

* CR refine

* CR refine

* Add scalar check.

* typo

* reformat

* CR refine

* Forgot to swap order in the implementation after spec changed

* CR refine

* Cr refine

* add Send's input type checking

* Update ort_trainer.py with lazy onnx export (#3244)

* Delay onnx export to avoid extra info

* handle cases where onnx model is provided at initialization

* address comments

* fix rebase error

* fix python error

* Add bias correction in Adam & Lamb for C++ frontend & python frontend (#3301)

* move env to .cc file

* fix windows build

* address PR comments (#3312)

* address PR comments

* PR comments

* PR comments

* disable logging

* typo

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Expose frozen_weights in PyTorch Frontend (#3317)

* Addressing PR comments (#3334)

* PR comments

* PR comments

* PR comments

* error out bad shape

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* support Huggingface's adamw (#3318)

* add weight decay mode to support both pytorch and huggingface's adamw

* Implement WhereGrad (#3343)

* Don't cast to fp16 in LayernormGrad (#3328)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* Address PR comments (#3352)

* PR comments

* revert code for a couple comments

* add negative test case

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Address master merge PR comments (#3348)

Address some comments from #3174.

- #3174 (comment)
- #3174 (comment)
- #3174 (comment)
- #3174 (comment)
- #3174 (comment)

* Fix code-base after breaking API changes

* add pipeline graph split script (#3275)

* pipeline graph cut

* add element type

* add input wait event and shape info

* shape inference

* support multiple cuts

* format script

* address feedback

* address feedback

* Fix InferenceSession API

* Update Op's Domain and Version (#3356)

* Update Nccl ops domain opset

* Update ZeroGradient Domain OpSet

* Update InPlaceAccumulator Domain OpSet

* Update SoftmaxGrad Domain and OpSet

* Update LayerNormalizationGrad Domain and OpSet

* Update BatchNormGrad Domain and Opset

* Update IsAllFinite Domain and Opset

* Update DivGrad Domain and Opset

* Update GatherGrad Domain and Opset

* Update IsFinite Domain and OpSet

* Update ReduceAllL2 Domain and Opset

* Update MixedPrecisionScale Doman and Opset

* Update AllOp Domain and Opset

* Update GroupOp Domain and OpSet

* Update ViewOp Domain and OpSet

* PR comments (#3374)

* PR comments

* PR comments

* PR comments

* PR comments

* PR comments

* PR comments

* PR comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Disable tests (temporary)

* Revert _SliceKernel cuda implementation

* Revert Session and InferenceSession implementation

* Disable GradientCheckerTest tests for GPU/Debug build (#3407)

* Disable GradientCheckerTest tests for GPU/Debug build (#3407)

* Revert "Addressing PR comments (#3334)" (#3412)

This reverts commit 131c65d.

* Enable loss scale input from Python frontend (#3327)

Made some fixes to enable loss scale to be wired up to ORT from the Python frontend. In particular, now addition of loss scaling is done unconditionally if mixed precision is enabled. The generated loss scale input name is passed back to the frontend.

Also fixed how inputs were added during the training graph configuration. Graph::SetInputs() was causing some issues - it seems to not be working correctly.

Also added some mixed precision Python frontend tests.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Reapply commit 131c65d; Fix memory regression issue. (#3423)

* Reapply commit 131c65d

* fix merge error

* Disable gradient clipping for E2E test.

* View Op - new unit tests and add support for tensor memcpy by offset/size (#3439)

* view ops UTs

* update per comments

* PR comments - code clean up

* code clean up per comments

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* frontend test to use random seed (#3209)

frontend test to use random seed

* safeint for region bytes in bfc arena and code clean up  (#3447)

* PR comments

* remove build issue workaround

* SafeInt for region bytes

* fix build

* fix build

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* raid rtol to unblock CI (#3457)

raise rtol to avoid expected CI test failure in onnxruntime_test_ort_trainer.py

* Address comments around bfc arena  (#3460)

* rename setting

* todo comments

* fix build

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Fix onnxruntime_unittests.cmake after merge.

* Fix dynamicslice.cc after merge.

* create pipeline for ci frontend tests (#3422)

create pipeline for nightly python front-end e2e tests

* Rename ONNX OPTIONAL to OPTIONAL_VALUE.

* Get cuda_common.h from master.

* Get onnxruntime/core/providers/cuda/tensor/slice.h from ort_training.

* Get onnxruntime/contrib_ops/cuda/bert/fast_gelu.cc from ort_training.

* Get onnxruntime/core/providers/cuda/cu from ort_training.

* Get onnxruntime/core/providers/cuda/math/matmul_integer.cc from ort_training.

* Remove FastGelu from activations.

* Fixes for Where, ConcatGrad and ReduceSumGrad (#3415)

* Fixes for Expand, Where, ConcatGrad ReduceSumGrad.

* Roll back expand, fix, add tests for reduce grad.

* Roll back CPU Expand change.

* Fix after merge.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>

* Remove orttraining/docker directory. (#3476)

The docker images are not publicly available yet.
Addressing PR comment: #3174 (comment)

* Put dropout_default, dropout_random, celu back in the list of broken tests.

* fix internal loss scale (#3483)

* Changed internal loss scale to 1-D

* added test

Co-authored-by: root <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>

* Publish unit test results from Linux and Mac builds (#3480)

* Added publish test results step to Linux and Mac builds.

* Fix test result file pattern.

* Add to list of failing backend tests from master.

* Get cudnn_common.cc from master.

* Remove usage of DeviceProp (which is removed in ort_training) from cudnn_common.cc.

* Put back SubmoduleCheckoutMode parameter into mac-ci.yml.

* Update Graph SetInputs and SetOutputs for training (#3446)

Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch.

* Fix fp16 type mismatch when graph output is an fp32-only node (#3411)

* verify output node before changing its type in mixed precision mode

* Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. (#3523)

* MaxBatchSize E2E Test  (#3454)

* max batch size e2e test

*update test data snapshot

* Add Python API to set random seed:  onnxruntime.seed(<seed>)

* Rename API to onnxruntime.set_seed(<seed>)

* Address PR comments and clean up. (#3536)

Address PR comments and clean up.
- #3174 (comment)
- #3174 (comment)

* Put safeint_interface include directory into onnxruntime_common interface include directories to simplify usage by other targets. (#3546)

* SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. (#3465)

* Update ONNX submodule commit to the latest.

* build break.

* SoftmaxCrossEntropyLoss: Forward and backward kernel implementation.

* Revert "build break."

This reverts commit 847cb50.

* Add more tests and misc clean up.

* revert unintended changes.

* PR feedback.

* cleanup.

* PR feedback.

* Ort training README (#3404)

Added README for ORT Training

* Fix GraphTest.UnusedValueInfoSerializes.

* Add SafeInt include to WinML targets (#3558)

Fixing Windows builds on the ort_training branch in preparation for the merge to master.
SafeInt (included via onnxruntime/core/common/safeint.h) was recently made a dependency of onnxruntime/core/framework/bfc_arena.h. That requires consumers of bfc_arena to compile with the SafeInt include directory.

* Disable or update flaky tests, improve test random seed accessibility. (#3495)

- Add output of test random seed
- Allow setting of test random seed with environment variable
- Disable / relax tolerance for flaky tests

* subgraph type override handling and unit test (#3560)

* unit test for subgraph type override

* unit test - re-wire input properly to subgraph

* update args

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Clean up docs. (#3579)

* Fix orttraining/README.md formatting.

* Delete ORT_TRAINING_BUILDS.md.

* Fix typo.

* Support ONNX test version parsing from path on Windows in onnx_test_runner. (#3588)

* Add front-end MNIST test (#3231)

* add frontend minst test

* to use torch nightly with torchvision

* remove incorrect comment per reviewer's comment

* experiment torchvision import failure

* experiment install_deps.sh

* more experiment install_deps.sh

* experiment install_deps.sh with --upgrade

* Experiment with install_deps.sh.

* Experiment with install_ubuntu.sh.

* Use Ubuntu 18.04 and Python 3.6 for CI.

* Update cmake version for CI.

* Install MPI on Ubuntu 18.04 for CI.

* Increase tolerance for MNIST test.

* Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.

* Clean-up.

* Update ort_trainer.py from ort_training.

* Get default Ubuntu Python ver back to 3.5.

* Add underscore to opset_version parameter name in ORTTrainer constructor.

* Move loss/model wrap before the call for sample output.

* Update expected values for MNIST test.

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>

* Sync onnx_backend_test_series.py disabled tests (#3603)

Make the set of disabled tests consistent between ort_training and master. Fix some regex patterns.

* Fix merge issue.

* Fix GraphTransformationTests tests.

* Revert "Convert Gelu to use TryParallelFor (#3599)"

This reverts commit 2579a72.

* Disable CudaKernelTest.SoftmaxCrossEntropyLoss_LargeSizeTensor because it's flaky.

* Add --enable_onnx_tests to Windows builds to allow set up of test data directory.

* Add --skip_onnx_tests to orttraining Windows builds.

* Update Optimizer Domain and Opset (#3602)

* Update Domain and Opset for SGD

* Update Adam Domain and Opset

* Update Lamb Domain and Opset

* Remove Windows CUDA 9 build definition and helper scripts. (#3615)

* Eliminate Useless Cast during Transformer. (#3606)

* Remove Useless Cast during Transformer.

* Resolve comments.

* Check if graph can remove the node.

Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Clean up OPTIONAL name conflict workarounds in ort_training. (#3622)

* Clean up OPTIONAL name conflict workarounds.

* Cleanup unnecessory header files onnx_protobuf.h

Co-authored-by: Sherlock Huang

* Add Lamb shape inference (#3634)

* Refactoring code related to WARP_SIZE. (#3623)

1. Centralize its definition in common.cuh.
2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
3. Centralize warp shuffle functions.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* fixes for ort_trainer.py to resume from checkpoint (#3510)

* fixes for ort_trainer.py to resume from checkpoint

* define self.state_dict_ during init

* add comment of explanation

* add unit test for restore from checkpoint

* fix file not found

Co-authored-by: suffian khan <sukha@microsoft.com>

* Add check for nullptr in PlannerImpl::FindReusableTensor(). (#3619)

* expose training session so the training app could register custom kernel and transformers (#3642)

Co-authored-by: Cheng Tang <chenta@microsoft.com>

* Expand elimination and Expand gradient. (#3610)

* Expand elmination and Expand gradient.

* Resolve comments.

* Fix test break.

* Check if graph can remove the node.

* Resolve comment.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>

* Try not to modify base name (#3638)

* GatherElementsGrad Kernels (#3627)

* GatherElementsGrad cuda kernel & tests

* Fix comments

* Fix include path

* Add pipeline transformer for wait/record node (#3513)

* pipeline transformer

* clean up

* address feedback

* add record/wait for first stage and updated split script

* address feedback

* make recv/send signal as initializer

* merge

* address feedback

* unify input and initializer

* address feedback and bug fix

* minor fix

* windows build

* fix

* fixed mnist bug (#3569)

* fixed mnist bug

* fixed train_step param

* Simplify and clean code (#3655)

1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
2. Minor change in layer norm kernel to simplify the code and resolve building warning.

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* Change CentOS build to use agent pool because builds on hosted agents run out of disk space. (#3662)

* disable broken test in DML (#3666)

* temporary disable LSTM_Seq_lens_unpacked for dml test

* temporary disable LSTM_Seq_lens_unpacked for dml test

* temporary disable LSTM_Seq_lens_unpacked

Co-authored-by: Ethan Tao <ettao@microsoft.com>

* Revert "Try not to modify base name (#3638)"

This reverts commit d9641f2.

Reverting to fix onnx_test_runner test failures.

Co-authored-by: Ke Deng <kedeng@microsoft.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>
Co-authored-by: Zeeshan Siddiqui <mzs@microsoft.com>
Co-authored-by: Jesse Benson <benson.jesse@gmail.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: ytaous <4484531+ytaous@users.noreply.github.com>
Co-authored-by: liqunfu <liqun_fu@hotmail.com>
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Xueyun Zhu <xzhu1900@gmail.com>
Co-authored-by: Tixxx <tix@microsoft.com>
Co-authored-by: Li-Wen Chang <30609447+liwchang@users.noreply.github.com>
Co-authored-by: Bowen Bao <semisqg@gmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Xueyun Zhu <40807589+xzhu1900@users.noreply.github.com>
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Co-authored-by: liqunfu <liqfu@microsoft.com>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
Co-authored-by: Vincent Wang <wangwchpku@outlook.com>
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: root <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: harshitha <havenka@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: manashgoswami <magoswam@microsoft.com>
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: suffiank <suffiankh@gmail.com>
Co-authored-by: suffian khan <sukha@microsoft.com>
Co-authored-by: Tang, Cheng <souptc@gmail.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: XiaocenDong <63833153+XiaocenDong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants