Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PipelineEngine.eval_batch result #3316

Merged
merged 4 commits into from
May 2, 2023

Conversation

nrailg
Copy link
Contributor

@nrailg nrailg commented Apr 20, 2023

With F16 enabled, PipelineEngine.eval_batch will not correctly broadcast loss. In last stage, eval_batch returns f16 loss, while in other stages, eval_batch will return noise.

def _bcast_pipe_scalar(self, data, src_rank=None, dtype=torch.float32):
    # Default to last stage (e.g., for broadcasting loss)
    if src_rank is None:
        src_rank = self.grid.stage_to_global(self.num_stages - 1)
    assert src_rank in self.grid.pp_group

    if self.global_rank == src_rank:
        result = data.clone().detach() # f16 tensor
    else:
        result = torch.Tensor([0.]).type(dtype).to(self.device) # f32 tensor

    # trying to broadcast a f16 tensor to f32 tensors here, and the result is noise.
    dist.broadcast(tensor=result, src=src_rank, group=self.mpu.get_pipe_parallel_group())

    return result

Environments:

  • torch 1.13.1
  • cuda 11.7
  • GPU A100 40GB + driver 450.80.02

@nrailg nrailg force-pushed the nrwu/fixppevalbatchdtype branch from 425b16c to 7a632c6 Compare April 21, 2023 02:39
@nrailg nrailg force-pushed the nrwu/fixppevalbatchdtype branch from 7a632c6 to d908937 Compare April 21, 2023 02:39
@nrailg
Copy link
Contributor Author

nrailg commented Apr 21, 2023

We have been working on LM recently, and encountered this problem. I am trying to fix it. @ShadenSmith @duli2012

Copy link
Contributor

@ShadenSmith ShadenSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!!

@jeffra jeffra merged commit b0d9c4d into microsoft:master May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants