-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Insights: microsoft/DeepSpeed
September 21, 2024 – September 28, 2024
Overview
Could not load contribution data
Please try again later
12 Pull requests merged by 8 people
-
Fix torch include in
op_builder/mlu/fused_adam.py
and update no-torch workflow triggers#6584 merged
Sep 27, 2024 -
Fixes on the accelerate side mean we do not need to skip this test
#6583 merged
Sep 27, 2024 -
add bfloat16 to inference support dtypes
#6528 merged
Sep 27, 2024 -
[COMPILE] workflow for deepspeed + torch.compile
#6570 merged
Sep 27, 2024 -
Add APIs to offload states of model, optimizer, and engine
#6011 merged
Sep 27, 2024 -
[XPU] Support DeepNVMe new code structure
#6532 merged
Sep 27, 2024 -
fix errors when setting zero3 leaf modules with torch.compile
#6564 merged
Sep 26, 2024 -
Fix gradient accumulation for Z2+offload
#6550 merged
Sep 26, 2024 -
[Accelerator] Cambricon MLU support
#6472 merged
Sep 26, 2024 -
DeepNVMe perf tuning
#6560 merged
Sep 26, 2024 -
Use msgpack for p2p comm
#6547 merged
Sep 26, 2024 -
Skip failing newly added tests in accelerate
#6574 merged
Sep 25, 2024
7 Pull requests opened by 6 people
-
Move LR Step
#6575 opened
Sep 26, 2024 -
Add llama3.2 vision autotp
#6577 opened
Sep 27, 2024 -
[DO NOT MERGE] Debug leaked semaphore
#6578 opened
Sep 27, 2024 -
Support safetensors export
#6579 opened
Sep 27, 2024 -
[CI] add CI for Huawei Ascend NPU
#6580 opened
Sep 27, 2024 -
Add API to get devices of offload states
#6586 opened
Sep 27, 2024 -
[ROCm] Fix subprocess error
#6587 opened
Sep 27, 2024
13 Issues closed by 6 people
-
no-torch CI test failure
#6576 closed
Sep 27, 2024 -
[BUG] Training time regression with ZeRO-3 after upgrade to torch 2.3.1 and CUDA 12.1
#5844 closed
Sep 27, 2024 -
[BUG] `GatheredParameters` resulting in NCCL mismatched collectives error
#6492 closed
Sep 27, 2024 -
Possible bug with layernorm kernel
#581 closed
Sep 26, 2024 -
Inference with the MoE based GPT model trained by ds_pretrain_gpt_345M_MoE128.sh [BUG]
#5647 closed
Sep 26, 2024 -
does DeepSpeed support AMSP (a new DP shard strategy)
#5661 closed
Sep 26, 2024 -
[BUG] Training gets stuck when model starts training
#4443 closed
Sep 26, 2024 -
# 🚀 Feature request
#6561 closed
Sep 25, 2024 -
[BUG] When using Zero-Infinity, Assertion `n_completes >= min_completes' failed
#4888 closed
Sep 25, 2024 -
[BUG] Gradient accumulation causing training loss differences in Deepspeed vs FSDP
#5898 closed
Sep 25, 2024 -
[BUG] AVX2 support for AdamCPU with DeepSpeed 0.14.2
#6363 closed
Sep 25, 2024 -
Something get wrong when run “aio_” and "gds_" file(DeepNVMe)
#6566 closed
Sep 24, 2024
9 Issues opened by 9 people
-
[BUG] subprocess.CalledProcessError
#6585 opened
Sep 27, 2024 -
[BUG] DeepSpeed Ulysses zero3 compatibility
#6582 opened
Sep 27, 2024 -
LSB_AFFINITY_HOSTFILE could not be found
#6581 opened
Sep 27, 2024 -
[BUG] AttributeError: 'NoneType' object has no attribute 'set_moe'
#6572 opened
Sep 25, 2024 -
[BUG] ValueError: Tensors must be contiguous when using deepspeed.initialize
#6571 opened
Sep 25, 2024 -
[BUG] The learning rate scheduler is being ignored in the first optimization step.
#6569 opened
Sep 25, 2024 -
Something get wrong when run “aio_” and "gds_" file(DeepNVMe)
#6567 opened
Sep 24, 2024 -
[request] Install error on Windows
#6563 opened
Sep 24, 2024
29 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Adding the new feature of FPDT
#6462 commented on
Sep 27, 2024 • 4 new comments -
Enabled configurable auto Tensor Parallelism (TP) for the inference of diverse models
#6553 commented on
Sep 25, 2024 • 1 new comment -
Clean up prefetched parameters
#6557 commented on
Sep 27, 2024 • 0 new comments -
Improve consistency of zero_grad
#6554 commented on
Sep 27, 2024 • 0 new comments -
Enabled Qwen2-MoE Tensor Parallelism (TP) inference
#6551 commented on
Sep 27, 2024 • 0 new comments -
Fix expert grad scaling problem with ZeRO optimizer
#6546 commented on
Sep 27, 2024 • 0 new comments -
reduce setting global variables to reduce torch compile graph breaks
#6541 commented on
Sep 27, 2024 • 0 new comments -
Set shuffle=True by default in data_sampler
#6531 commented on
Sep 27, 2024 • 0 new comments -
Fix device selection using CUDA_VISIBLE_DEVICES
#6530 commented on
Sep 27, 2024 • 0 new comments -
Handle when `backend` is also in compile_kwargs
#6502 commented on
Sep 27, 2024 • 0 new comments -
add option to disable logger while compiling to avoid graph breaks
#6496 commented on
Sep 27, 2024 • 0 new comments -
Unpin tests that previously used a pinned version of transformers
#6387 commented on
Sep 27, 2024 • 0 new comments -
[NaN check] Add NaN check to support bfloat16.
#5879 commented on
Sep 27, 2024 • 0 new comments -
inference: remove unused _validate_args function
#5505 commented on
Sep 26, 2024 • 0 new comments -
Rearrange inference OPS and stop using builder.load
#5490 commented on
Sep 27, 2024 • 0 new comments -
Fix training of pipeline based peft's lora model
#5477 commented on
Sep 23, 2024 • 0 new comments -
[BUG] CUDA error: no kernel image is available for execution on the device
#6549 commented on
Sep 27, 2024 • 0 new comments -
[BUG] Pipeline Dataloader Sampler: `shuffle=False`
#5619 commented on
Sep 27, 2024 • 0 new comments -
CUDA Graphs + ZeRO-3 / TP+PP
#6552 commented on
Sep 27, 2024 • 0 new comments -
[BUG]RuntimeError: disagreement between rank0 and rank1: rank0:
#5799 commented on
Sep 27, 2024 • 0 new comments -
[BUG] Learning rate scheduler and optimizer logical issue
#5731 commented on
Sep 26, 2024 • 0 new comments -
how to set "training_step" during training?
#5779 commented on
Sep 26, 2024 • 0 new comments -
[BUG] Universal checkpoint conversion failed
#5822 commented on
Sep 26, 2024 • 0 new comments -
[BUG] CUDA_VISIBLE_DEVICES is not parsed correctly
#5278 commented on
Sep 25, 2024 • 0 new comments -
[BUG] Concern around mixed precision training where weights are in low precision
#5307 commented on
Sep 24, 2024 • 0 new comments -
[BUG] AttributeError: 'NoneType' object has no attribute 'swap_folder'
#4998 commented on
Sep 24, 2024 • 0 new comments -
[REQUEST] A minimal example to load universal checkpoint
#6548 commented on
Sep 23, 2024 • 0 new comments -
nv-nightly CI test failure
#6529 commented on
Sep 23, 2024 • 0 new comments -
nv-ds-chat CI test failure
#5616 commented on
Sep 23, 2024 • 0 new comments