A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pytorch/pytorch/issues/150852 below:

NCCL init hits CUDA failure 'invalid argument' on 12.2 driver · Issue #150852 · pytorch/pytorch · GitHub

🐛 Describe the bug

Error seen with nightly build, e.g. torch==2.8.0.dev20250327+cu126

[2025-04-08 08:39:46] devgpu263:589012:591652 [0] transport/nvls.cc:254 NCCL WARN Cuda failure 1 'invalid argument'
devgpu263:589012:591652 [0] NCCL INFO transport/nvls.cc:409 -> 1
devgpu263:589012:591652 [0] NCCL INFO init.cc:1141 -> 1
devgpu263:589012:591652 [0] NCCL INFO init.cc:1409 -> 1
devgpu263:589012:591652 [0] NCCL INFO group.cc:75 -> 1 [Async thread]
devgpu263:589012:589012 [0] NCCL INFO group.cc:422 -> 1
devgpu263:589012:589012 [0] NCCL INFO group.cc:581 -> 1
devgpu263:589012:589012 [0] NCCL INFO init.cc:1836 -> 1
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/kw2501/local/driver_12.2/repro.py", line 12, in <module>
[rank0]:     dist.all_reduce(x)
[rank0]:   File "/home/kw2501/.conda/envs/titan/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/kw2501/.conda/envs/titan/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2868, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 1 'invalid argument'

Mini repro:

import os
import torch
import torch.distributed as dist


if __name__ == "__main__":
    rank = int(os.getenv("RANK"))
    world_size = int(os.getenv("WORLD_SIZE"))
    device = torch.device("cuda", rank)
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    x = torch.empty(4, device=device)
    dist.all_reduce(x)
    print(x)

Command line:

NCCL_DEBUG=INFO torchrun --standalone --nproc-per-node 4 repro.py

Fails with 12.2 driver:
Driver Version: 535.154.05 CUDA Version: 12.2

Works with 12.4 driver:
Driver Version: 550.90.07 CUDA Version: 12.4

                 |   Run w/ 12.2 driver   |   Run w/ 12.4 driver or higher
torch Built w/   |   
12.2 toolkit     |          Works         |     Not tested
--------------------------------------------------------------------
torch Built w/   |   
12.6 toolkit     |          Fails         |          Works         

Line 254 in nvls.cc:

CUCHECKGOTO(cuMulticastBindMem(*mcHandle, 0/*mcOffset*/, *ucHandle, 0/*memOffset*/, ucsize, 0/*flags*/), ret, fail);
Versions
[kw2501@devgpu263.prn2 ~]$ python collect_env.py 
Collecting environment information...
PyTorch version: 2.8.0.dev20250327+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version: Could not collect
CMake version: version 3.26.5
Libc version: glibc-2.34

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.4.3-0_fbk20_zion_2830_g3e5ab162667d-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.6.85
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100
GPU 1: NVIDIA H100
GPU 2: NVIDIA H100
GPU 3: NVIDIA H100
GPU 4: NVIDIA H100
GPU 5: NVIDIA H100
GPU 6: NVIDIA H100
GPU 7: NVIDIA H100

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.8.0
/usr/lib64/libcudnn.so.9.6.0
/usr/lib64/libcudnn_adv.so.9.6.0
/usr/lib64/libcudnn_adv_infer.so.8.8.0
/usr/lib64/libcudnn_adv_train.so.8.8.0
/usr/lib64/libcudnn_cnn.so.9.6.0
/usr/lib64/libcudnn_cnn_infer.so.8.8.0
/usr/lib64/libcudnn_cnn_train.so.8.8.0
/usr/lib64/libcudnn_engines_precompiled.so.9.6.0
/usr/lib64/libcudnn_engines_runtime_compiled.so.9.6.0
/usr/lib64/libcudnn_graph.so.9.6.0
/usr/lib64/libcudnn_heuristic.so.9.6.0
/usr/lib64/libcudnn_ops.so.9.6.0
/usr/lib64/libcudnn_ops_infer.so.8.8.0
/usr/lib64/libcudnn_ops_train.so.8.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             384
On-line CPU(s) list:                0-383
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9654 96-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 96
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU(s) scaling MHz:                 84%
CPU max MHz:                        3707.8120
CPU min MHz:                        1500.0000
BogoMIPS:                           4792.81
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          6 MiB (192 instances)
L1i cache:                          6 MiB (192 instances)
L2 cache:                           192 MiB (192 instances)
L3 cache:                           768 MiB (24 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-95,192-287
NUMA node1 CPU(s):                  96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch-triton==3.3.0+git96316ce5
[pip3] torch==2.8.0.dev20250327+cu126
[pip3] torchaudio==2.6.0.dev20250329+cu126
[pip3] torchdata==0.11.0
[pip3] torchtitan==0.0.2
[pip3] torchvision==0.22.0.dev20250329+cu126
[pip3] triton==3.2.0
[conda] numpy                     2.1.2                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
[conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
[conda] torch                     2.8.0.dev20250327+cu126          pypi_0    pypi
[conda] torchaudio                2.6.0.dev20250329+cu126          pypi_0    pypi
[conda] torchdata                 0.11.0                   pypi_0    pypi
[conda] torchtitan                0.0.2                    pypi_0    pypi
[conda] torchvision               0.22.0.dev20250329+cu126          pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4