Skip to content

Conversation

@kaixuanliu
Copy link
Contributor

In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set device_id here, it will prevent us from new distributed group on cpu: L230 . This PR fixes this bug.

@yao-matrix
Copy link
Contributor

@delock , pls help take a look, thx

@sfc-gh-truwase sfc-gh-truwase requested review from stas00 and removed request for GuanhuaWang September 3, 2025 23:45
@stas00
Copy link
Collaborator

stas00 commented Sep 4, 2025

If you undo #7266 you will get a rain of warnings (per rank) from recent pytorch versions that collectives could be doing the wrong thing and hang - see the PR I linked to.

This is not a bug to fix. Some other approach needs to be used to address your need without breaking the main use-case. I will defer to @tjruwase on design. Most likely some flag needs to be passed to decide what to do.

@kaixuanliu
Copy link
Contributor Author

kaixuanliu commented Sep 4, 2025

Hi @stas00 , I read your PR, and it seems you meet hang issue when adding device_id argument w/ PT from 2.6.0 to 2.7.1, right? So the hang issue here(in this PR we do not pass device_id) is not a problem I suppose. As for the warning issue, can you give a concrete example to reproduce? Maybe I can also take a look.

@stas00
Copy link
Collaborator

stas00 commented Sep 4, 2025

You misunderstood the purpose of #7266.

  1. Pytorch started issuing a warning around v2.7.0 and wanted torch.dist users to set device_id or else... So the PR addressed that issue.

  2. In the process of testing the fix myself and others have found that setting device_id actually lead to hanging and it proved to be pytorch version-specific (the issue really comes from how pytorch interacts with nccl). The problem has been resolved in pytorch-2.7.1. thus the PR carefully chooses when it's safe to set device_id, which should not cause the hanging.

Now you're requesting that device_id won't be set, which is a problem for the general case since:

  1. pytorch's warning is alarming to the user
  2. the hanging might actually happen, though I am yet to see it myself

What I'm suggesting is that in order to meet everybody's needs I propose to add a new config variable that will control this behavior - by default it should set device_id and in those cases where it must not be set the user can set it. Does it make sense?

@delock
Copy link
Collaborator

delock commented Sep 4, 2025

@kaixuanliu what will we encounter if set device_id when init_process_group on CPU?

@kaixuanliu
Copy link
Contributor Author

kaixuanliu commented Sep 4, 2025

@delock it will try to new group using gloo backend on XPU and crash, here is the log:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/root/test_vllm.py", line 12, in <module>
[rank7]:     vllm_model = LLM(
[rank7]:                  ^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 270, in __init__
[rank7]:     self.llm_engine = LLMEngine.from_engine_args(
[rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 490, in from_engine_args
[rank7]:     return engine_cls.from_vllm_config(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 127, in from_vllm_config
[rank7]:     return cls(vllm_config=vllm_config,
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 104, in __init__
[rank7]:     self.engine_core = EngineCoreClient.make_client(
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 82, in make_client
[rank7]:     return InprocClient(vllm_config, executor_class, log_stats)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 245, in __init__
[rank7]:     self.engine_core = EngineCore(*args, **kwargs)
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 80, in __init__
[rank7]:     self.model_executor = executor_class(vllm_config)
[rank7]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
[rank7]:     self._init_executor()
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 128, in _init_executor
[rank7]:     self.collective_rpc("init_device")
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
[rank7]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3031, in run_method
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 603, in init_device
[rank7]:     self.worker.init_device()  # type: ignore
[rank7]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/xpu_worker.py", line 164, in init_device
[rank7]:     init_worker_distributed_environment(self.vllm_config, self.rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 609, in init_worker_distributed_environment
[rank7]:     init_distributed_environment(parallel_config.world_size, rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1025, in init_distributed_environment
[rank7]:     _WORLD = init_world_group(ranks, local_rank, backend)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 865, in init_world_group
[rank7]:     return GroupCoordinator(
[rank7]:            ^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 230, in __init__
[rank7]:     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
[rank7]:     func_return = func(*args, **kwargs)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5254, in new_group
[rank7]:     return _new_group_with_tag(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5344, in _new_group_with_tag
[rank7]:     pg, pg_store = _new_process_group_helper(
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2123, in _new_process_group_helper
[rank7]:     if device_id and pg._get_backend(device_id).supports_splitting:
[rank7]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: No backend type associated with device type xpu

@stas00
Copy link
Collaborator

stas00 commented Sep 4, 2025

wait a sec, are you saying the problem happens when you run on xpu? You didn't say that in the OP.

If so why not check if the hw is xpus and then not set device_id if it's not good there? Or am I misunderstanding the particulars of your use case?

For example does the problem go away if you set export CUDA_VISIBLE_DEVICES= if you don't want the gpus?

@kaixuanliu
Copy link
Contributor Author

@stas00 , on CUDA this will not crash, as CUDA also supports gloo while XPU does not. However, L230 target to new group on CPU, if we set device_id in init_process_group, it will new distributed group on CUDA actually, I am not sure if it is a good choice. I accept to add xpu check in L159, although it is somewhat wierd... @delock @tjruwase WDYT? In the VLLM case I mentioned, we need both gpus and CPU. So it is not suitable just export CUDA_VISIBLE_DEVICES=

@delock
Copy link
Collaborator

delock commented Sep 4, 2025

Do you mean you wish to call init_process_group for CPU, but if device_id not equal to None, Pytorch will think you want to call init_process_group for XPU, and get an error?

@delock it will try to new group using gloo backend on XPU and crash, here is the log:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/root/test_vllm.py", line 12, in <module>
[rank7]:     vllm_model = LLM(
[rank7]:                  ^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 270, in __init__
[rank7]:     self.llm_engine = LLMEngine.from_engine_args(
[rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 490, in from_engine_args
[rank7]:     return engine_cls.from_vllm_config(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 127, in from_vllm_config
[rank7]:     return cls(vllm_config=vllm_config,
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 104, in __init__
[rank7]:     self.engine_core = EngineCoreClient.make_client(
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 82, in make_client
[rank7]:     return InprocClient(vllm_config, executor_class, log_stats)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 245, in __init__
[rank7]:     self.engine_core = EngineCore(*args, **kwargs)
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 80, in __init__
[rank7]:     self.model_executor = executor_class(vllm_config)
[rank7]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
[rank7]:     self._init_executor()
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 128, in _init_executor
[rank7]:     self.collective_rpc("init_device")
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
[rank7]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3031, in run_method
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 603, in init_device
[rank7]:     self.worker.init_device()  # type: ignore
[rank7]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/xpu_worker.py", line 164, in init_device
[rank7]:     init_worker_distributed_environment(self.vllm_config, self.rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 609, in init_worker_distributed_environment
[rank7]:     init_distributed_environment(parallel_config.world_size, rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1025, in init_distributed_environment
[rank7]:     _WORLD = init_world_group(ranks, local_rank, backend)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 865, in init_world_group
[rank7]:     return GroupCoordinator(
[rank7]:            ^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 230, in __init__
[rank7]:     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
[rank7]:     func_return = func(*args, **kwargs)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5254, in new_group
[rank7]:     return _new_group_with_tag(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5344, in _new_group_with_tag
[rank7]:     pg, pg_store = _new_process_group_helper(
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2123, in _new_process_group_helper
[rank7]:     if device_id and pg._get_backend(device_id).supports_splitting:
[rank7]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: No backend type associated with device type xpu

@kaixuanliu
Copy link
Contributor Author

@delock yes, pls refer to the VLLM code: L226-L230, we need to new group both on CPU and XPU(CUDA). If we explicitly set device_id here, it will only new group on XPU(CUDA), which will cause error.

@kaixuanliu
Copy link
Contributor Author

After consideration, we think it's best to solve this bug in pytorch side. And here we can make a WA to add device_id explicitly only for CUDA device. Do you think it is OK? @stas00

@stas00
Copy link
Collaborator

stas00 commented Sep 4, 2025

I think your proposal is a good start, @kaixuanliu - we can always expand the use case as we go.

Does WA stands for workaround? Could you write it out in the comment as a full word as most readers won't know what WA stands for.

@stas00
Copy link
Collaborator

stas00 commented Sep 4, 2025

you can run make format to get the formatting fixed for you automatically.

the test failing in modal-torch-latest is unrelated - you can ignore it.

Copy link
Collaborator

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

@stas00 stas00 enabled auto-merge (squash) September 5, 2025 06:03
@stas00 stas00 merged commit 08879a3 into deepspeedai:master Sep 5, 2025
12 checks passed
Signed-off-by: Liu, Kaixuan <[email protected]>
Signed-off-by: Liu, Kaixuan <[email protected]>
Flakes342 pushed a commit to Flakes342/DeepSpeed that referenced this pull request Sep 9, 2025
In some usecases such as vllm, we need to new distributed group not only
on gpu, but also on cpu, if we set `device_id` here, it will prevent us
from new distributed group on cpu:
[L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230)
. This PR fixes this bug.

---------

Signed-off-by: Liu, Kaixuan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Flakes342 <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
In some usecases such as vllm, we need to new distributed group not only
on gpu, but also on cpu, if we set `device_id` here, it will prevent us
from new distributed group on cpu:
[L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230)
. This PR fixes this bug.

---------

Signed-off-by: Liu, Kaixuan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants