Skip to content

torchvision.datasets.ImageNet() race condition when multi-process training launched with root containing only *.tar *.tar.gz downloaded files #8707

@EIFY

Description

@EIFY

🐛 Describe the bug

Hi,

When I launched multi-process training (8x A100) using torchvision.datasets.ImageNet() with fresh-prepared root (i.e. containing only ILSVRC2012_devkit_t12.tar.gz, ILSVRC2012_img_train.tar, and ILSVRC2012_img_val.tar) I got this kind of errors:

$ NUMEXPR_MAX_THREADS=116 $PYTHON $MUPVIT_MAIN /data/ImageNet/ --workers $N_WORKERS --multiprocessing-distributed --batch-size 1024 --log-steps 100 Use GPU: 4 for training Use GPU: 2 for training Use GPU: 6 for training Use GPU: 1 for training Use GPU: 5 for training Use GPU: 0 for training Use GPU: 7 for training Use GPU: 3 for training [rank0]:[W1101 02:49:23.260690939 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any p ending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been add ed since PyTorch 2.4 (function operator()) W1101 02:49:24.409000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336321 via signal SIGTERM W1101 02:49:24.409000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336322 via signal SIGTERM W1101 02:49:24.411000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336323 via signal SIGTERM W1101 02:49:24.413000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336324 via signal SIGTERM W1101 02:49:24.417000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336326 via signal SIGTERM W1101 02:49:24.418000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336327 via signal SIGTERM W1101 02:49:24.422000 336226 torch/multiprocessing/spawn.py:160] Terminating process 336328 via signal SIGTERM Traceback (most recent call last): File "/home/ubuntu/Downloads/mup-vit/main.py", line 713, in <module> main() File "/home/ubuntu/Downloads/mup-vit/main.py", line 189, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args, )) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes while not context.join(): File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 4 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap fn(i, *args) File "/home/ubuntu/Downloads/mup-vit/main.py", line 337, in main_worker train_dataset = datasets.ImageNet(args.data, split='train', transform=v2.Compose(transform)) File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/imagenet.py", line 53, in __init__ self.parse_archives() File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/imagenet.py", line 70, in parse_archives parse_train_archive(self.root) File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/imagenet.py", line 183, in parse_train_archive extract_archive(archive, os.path.splitext(archive)[0], remove_finished=True) File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 362, in extract_archive suffix, archive_type, compression = _detect_file_type(from_path) File "/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 268, in _detect_file_type raise RuntimeError( RuntimeError: File '/data/ImageNet/train/n02104365' has no suffixes that could be used to detect the archive type and compression.

The other errors are all about certain files exist already or don't exist yet. I am quite sure this is due to untar race condition and I worked around it by deleting all the intermediate files (meta.bin, train/ and val/ folders) and forcing a single process training launch to extract & place the files first.

It might be difficult to detect distributed training launch like this, but can we at least provide a warning in the documentation? Some tools to extract & place the files beforehand like python3 -m big_vision.tools.download_tfds_datasets imagenet2012 would also be helpful. This is kind of like the flip side of #2023.

Versions

The output below shows PyTorch version: 2.3.1 and torchvision==0.18.1, but I am actually running

$ pip freeze (...) torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1 (...) 

I don't know why python3 -mpip list --format=freeze finds the old packages.

Collecting environment information... PyTorch version: 2.3.1 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB Nvidia driver version: 550.90.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 124 On-line CPU(s) list: 0-123 Vendor ID: AuthenticAMD Model name: AMD EPYC 7542 32-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 124 Stepping: 0 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 7.8 MiB (124 instances) L1i cache: 7.8 MiB (124 instances) L2 cache: 62 MiB (124 instances) L3 cache: 1.9 GiB (124 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-123 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] flake8==4.0.1 [pip3] numpy==1.21.5 [pip3] optree==0.12.1 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] triton==2.3.1 [conda] Could not collect 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions