Фреймворк для работы с датасетами
pip install git+https://github.com/ai-forever/DataProcessingFrameworkOr you can install from sources:
git clone https://github.com/ai-forever/DataProcessingFramework cd DataProcessingFramework pip install -r requirements.txtThe framework supports next operations:
- Reading a dataset
- Filter datasets with variety of filters
- Convert datasets to other formats
- Validate datasets
Modalities:
- Texts
- Images
- Videos
Data formats:
- Shards
- ShardedFiles
Reading a dataset:
from DPF.configs import ShardsDatasetConfig from DPF.dataset_reader import DatasetReader config = ShardsDatasetConfig.from_modalities( 'examples/example_dataset/', image_name_col='image_name', caption_col='caption' ) reader = DatasetReader() processor = reader.from_config(config) processor.dfApplying a filter:
from DPF.filters.images.base_images_info_filter import ImageInfoGatherer datafilter = ImageInfoGatherer(workers=8) processor.apply_data_filter(datafilter) processor.df # new columns ['width', 'height', 'is_correct'] are addedConverting to other formats:
processor.to_shards( 'destination/dir/', filenaming="counter", # or "uuid" keys_mapping={"text": "caption"}, workers=4 )processor.to_sharded_files( 'destination/dir/', filenaming="counter", # or "uuid" keys_mapping={"text": "caption"}, workers=4 )