-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] using deepspeed slower inference time
bug
Something isn't working
inference
#6818
opened Dec 4, 2024 by
williamlin0518
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6811
opened Dec 1, 2024 by
NirSonnenschein
[BUG] Enabling drop_tokens in MoE layer causes inference to hang
bug
Something isn't working
inference
#6809
opened Nov 29, 2024 by
Shamauk
[Questions] Why Ulysess need all2all for QKV, but RingAttention just need KV under context parallel ?
#6808
opened Nov 29, 2024 by
elevenxiang
multiple runs on same machine, with ctrl+c, all runs are killed
#6807
opened Nov 29, 2024 by
ysyyork
[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU
bug
Something isn't working
training
#6806
opened Nov 28, 2024 by
rileyhun
[BUG] deepspeed inference for llama3.1 70b for 2 node, each node with 2 gpu
bug
Something isn't working
inference
#6805
opened Nov 28, 2024 by
rastinrastinii
[BUG] Unnecessary memory copy in paramater partition in ZeRO3
bug
Something isn't working
training
#6804
opened Nov 28, 2024 by
yingtongxiong
Use DS4Sci_EvoformerAttention and torch.util.checkpoint.checkpoint at the same time during training
#6802
opened Nov 28, 2024 by
cbyzju
Question about using Autotuner with ZeRO and tensor parallelism
#6796
opened Nov 27, 2024 by
rlanday
AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3
#6793
opened Nov 26, 2024 by
66RomanReigns
[REQUEST] Let ZeRO-offload use CPU and GPU parallelly
enhancement
New feature or request
#6778
opened Nov 23, 2024 by
fzyzcjy
[BUG] [Fix Suggestion] Uneven head sequence parallelism
bug
Something isn't working
training
#6774
opened Nov 21, 2024 by
Eugene29
[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs
bug
Something isn't working
training
#6772
opened Nov 20, 2024 by
traincheck-team
[BUG] [Fix-Suggested] Checkpoint Inconsistency When Freezing Model Parameters Before
deepspeed.initialize
#6771
opened Nov 20, 2024 by
traincheck-team
[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch
#6770
opened Nov 20, 2024 by
traincheck-team
[BUG] clip_grad_norm for zero_optimization mode is not working
bug
Something isn't working
training
#6767
opened Nov 20, 2024 by
chengmengli06
Previous Next
ProTip!
Adding no:label will show everything without a label.