Sometimes I get Dataset Errors when using the lightning module in a distributed manor

@justusschock

Bug description

I use a Lightning Datamodule. In this module I initialize (according to your
tutorials a torch dataset:

class CustomImageDataset(Dataset): # Torch dataset to handle basic file operations

class DataModule(L.LightningDataModule): # Lightning DataModule to handle dataloaders and train/test split dset = CustomImageDataset()

In most cases it works perfectly fine, but sometimes I get an error when initializing my training, which forces me to start it again until the bug does not appear anymore. This only happens in distributed training.

It happens when I read in my dataset in the CustomImageDataset() by using a csv reader. The error is:

train.py 74 <module> mydata.setup(stage="fit") dataset.py 206 setup self.train_set = self.create_dataset("train") dataset.py 190 create_dataset dset = CustomImageDataset(self.data_dir, dataset.py 50 __init__ self.data_paths, self.targets = self._load_data() dataset.py 59 _load_data paths, targets = get_paths(self.data_dir, "train", self.seed) dataset.py 22 get_paths r = list(reader) _csv.Error: line contains NUL

Since the list conversion seems to trigger the bug I am bit lost on how to solve it, but maybe you guys already stumbled upon it.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 1.5.0): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @justusschock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sometimes I get Dataset Errors when using the lightning module in a distributed manor #20088

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sometimes I get Dataset Errors when using the lightning module in a distributed manor #20088

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions