-
Notifications
You must be signed in to change notification settings - Fork 4.7k
avoid setting device_id to init_process_group
#7542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Liu, Kaixuan <[email protected]>
|
@delock , pls help take a look, thx |
|
If you undo #7266 you will get a rain of warnings (per rank) from recent pytorch versions that collectives could be doing the wrong thing and hang - see the PR I linked to. This is not a bug to fix. Some other approach needs to be used to address your need without breaking the main use-case. I will defer to @tjruwase on design. Most likely some flag needs to be passed to decide what to do. |
|
Hi @stas00 , I read your PR, and it seems you meet hang issue when adding |
|
You misunderstood the purpose of #7266.
Now you're requesting that
What I'm suggesting is that in order to meet everybody's needs I propose to add a new config variable that will control this behavior - by default it should set |
|
@kaixuanliu what will we encounter if set |
|
@delock it will try to new group using gloo backend on XPU and crash, here is the log: |
|
wait a sec, are you saying the problem happens when you run on If so why not check if the hw is xpus and then not set For example does the problem go away if you set |
|
@stas00 , on CUDA this will not crash, as CUDA also supports |
|
Do you mean you wish to call
|
|
After consideration, we think it's best to solve this bug in pytorch side. And here we can make a WA to add |
|
I think your proposal is a good start, @kaixuanliu - we can always expand the use case as we go. Does WA stands for workaround? Could you write it out in the comment as a full word as most readers won't know what WA stands for. |
|
you can run the test failing in modal-torch-latest is unrelated - you can ignore it. |
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
stas00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now.
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Flakes342 <[email protected]>
In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set
device_idhere, it will prevent us from new distributed group on cpu: L230 . This PR fixes this bug.