Skip to content

Conversation

@NirSonnenschein
Copy link
Contributor

This change is intended to help enable support for using a tensor learning rate value vs a scalar ones.
We found this helpful in cases where the optimizer is torch.compiled (in such cases changing the scalar LR value could cause recompilation degrading the performance).
The implementation allows the model script to determine the type of LR value used by setting the initial value.

This change is intended to help enable support
for using a tensor learning rate value vs a scalar
ones. We found this helpful in cases where the
Optimizer is torch.compiled (in such cases changing
the scalar LR value could cause recompilation degrading
the performance).
The implementation allows the model script to determine the type of
LR value used , by setting the initial value.
@NirSonnenschein
Copy link
Contributor Author

Thanks @sfc-gh-truwase
small question: the CI failure doesn't seem to be related to the commit:
FAILED tests/unit/v1/zero/test_zero.py::TestZero3RepeatForwardLoop::test[True] - AttributeError: 'int' object has no attribute 'pt_reserved_cores_perc'
is this a known issue?

@eternalNight
Copy link
Contributor

Thanks @sfc-gh-truwase small question: the CI failure doesn't seem to be related to the commit: FAILED tests/unit/v1/zero/test_zero.py::TestZero3RepeatForwardLoop::test[True] - AttributeError: 'int' object has no attribute 'pt_reserved_cores_perc' is this a known issue?

#7634 attempts to fix that, but is blocked because the CI seems not testing the right branch (yet).

@tohtana tohtana enabled auto-merge (squash) October 20, 2025 05:08
@tohtana tohtana merged commit 407708c into deepspeedai:master Oct 20, 2025
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants