You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A clear and concise description of what the bug is.
Programs get blocked when using multiple nodes. By setting export LOG_LEVEL=DEBUG, I can see that it got stuck at BaguaSingleCommunicator, since it prints
When I set --node_rank=0, the program can run smoothly.
Environment
Your operating system and version: Linux node-162 4.4.0-131-generic ci: update gpu test threshold #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Your python version: Python 3.8.13 (default, Mar 28 2022, 11:38:47)
Your PyTorch version: 1.12.1
NCCL version: 2.10.3
How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: conda
Have you tried using latest bagua master (python3 -m pip install --pre bagua)?: yes
@zhaone
You need to set node_rank when using multiple nodes. But if you only use one node, you can ignore these parameters: node_rank/master_addr/master_port .
Describe the bug
A clear and concise description of what the bug is.
Programs get blocked when using multiple nodes. By setting
export LOG_LEVEL=DEBUG
, I can see that it got stuck atBaguaSingleCommunicator
, since it prints2022-11-21T12:40:23.673510Z DEBUG bagua_core_internal::communicators: creating communicator, nccl_unique_id AgCwgcCQEwkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=, rank 8, nranks 16, device_id 0, stream_ptr 94639511762624
but fail to print
al communicator initialized at XXX
.When I set
--node_rank=0
, the program can run smoothly.Environment
python3 -m pip install --pre bagua
)?: yesReproducing
Please provide a minimal working example. This means the runnable code.
Please also write what exact commands are required to reproduce your results.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: