-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add support for TE MXFP8 recipe in accelerate #3688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is outside the initial scope for this PR, but there's some oddity when using Deepspeed + FP8 + the HF Trainer. If you set And if you omit it, Interestingly it's still possible to use FP8 with deepspeed currently? But it seems like a bug. This check: There have been a number of "FP8 + deepspeed" PRs here in the past, I'm wondering if the cleanest option is to separate "mixed_precision" from fp8. fp8 typically uses bf16 for model weights and between FP8-enabled layers anyways. |
| if ( | ||
| AcceleratorState._shared_state != {} | ||
| and AcceleratorState().distributed_type == DistributedType.DEEPSPEED | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting only change, not sure why it's changing it from main
edbe9d5 to
46ebf27
Compare
Signed-off-by: Peter St. John <[email protected]>
46ebf27 to
1fb8f76
Compare
|
Do I understand correctly that this only covers DeepSpeed? |
|
No, this lets you pass I think ultimately the complication with deepspeed is that there's a single accelerate/benchmarks/fp8/transformer_engine/ddp.py Lines 70 to 71 in 23cf4ef
|
|
@akakakakakaa, I'm not sure I've fully fixed the FP8 + deepspeed bugs in this PR; this is mainly focused on trying to get MXFP8 support with TE enabled |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this clean PR. Indeed for deepspeed and fp8, some cleaning is still required. I think it could make sense to separate fp8`` from mixed_precision`. As you said, those should be orthogonal
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
Adds support for the MXFP8 format in TE. See the TE docs pages for more background:
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling
This adds an additional fp8_recipe argument,
use_mxfp8_block_scaling, that switches the recipe from theDelayedScalingrecipe to MXFP8BlockScaling.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.