A framework for processing and filtering multimodal datasets.
Install with pip:
pip install git+https://github.com/ai-forever/DataProcessingFramework
Install from repository:
git clone https://github.com/ai-forever/DataProcessingFramework cd DataProcessingFramework pip install .
Extra requirements: filters
, dev
, llava
, video_llava
To install extra requirements run: pip install .[filters]
Framework supports following features:
- Reading datasets
- Filtering datasets and calculating metrics using different models
- Converting datasets to other storage formats
- Datasets validating
- Supports different filesystems (local, s3)
- Data filtering pipelines
DPF allows you to easily filter datasets and add new metadata. For example, the code below generates synthetic captions for images in shards on remote s3 storage and updates dataset metadata without downloading shards:
Before running the example below, install extra requirements: pip install DPF[filters,llava]
from DPF import S3Connector, DatasetReader, ShardsDatasetConfig # creating connector for S3 storage connector = S3Connector( key='access_key', secret='secret_key', endpoint_url='endpoint_url' ) reader = DatasetReader(connector) # creating dataset config config = ShardsDatasetConfig.from_path_and_columns( "s3://your-bucket/path/to/shards", image_name_col='image_name', ) # reading a dataset processor = reader.read_from_config(config, workers=16) from DPF.filters.images.llava_captioning_filter import LLaVaCaptioningFilter # creating LLaVA captioner filter datafilter = LLaVaCaptioningFilter( workers=16, prompt='short', batch_size=16, device="cuda:0" ) print(datafilter.result_columns) # prints list of columns that will be added # applying filter to dataset processor.apply_data_filter(datafilter) # new metadata is created new_column_name = datafilter.result_columns[1] # name of new added column with generated caption print(processor.df[new_column_name]) # prints generated image captions # adding new metadata to remote dataset processor.update_columns([new_column_name], workers=16)
More examples there
The framework supports data that has any combination of the following modalities:
- Text
- Image
- Video
Datasets with several data of the same modality in one sample are not supported. For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.
The dataset should be stored in one of the following formats:
- Files
- Shards
- Sharded files
To read a dataset, you must first create a config that describes the dataset and the type of data in it. For each data format, you need to use the appropriate config.
Example for shards format:
from DPF import ShardsDatasetConfig config = ShardsDatasetConfig.from_path_and_columns( 'examples/example_dataset', # path to shards image_name_col='image_name', # name of column in csv file with image names text_col='caption' # name of column in csv file with text/captions )
You can read dataset using DatasetReader.from_config
method:
from DPF import ShardsDatasetConfig, DatasetReader config = ShardsDatasetConfig.from_path_and_columns( 'examples/example_dataset', image_name_col='image_name', text_col='caption' ) reader = DatasetReader() processor = reader.read_from_config(config)
Example for sharded files format:
from DPF import ShardedFilesDatasetConfig, DatasetReader config = ShardedFilesDatasetConfig.from_path_and_columns( 'examples/example_video_dataset', video_name_col='video_name', text_col='caption' ) reader = DatasetReader() processor = reader.read_from_config(config)
Examples of reading data in other formats
Example reading a dataset directly from S3 storage:
from DPF import S3Connector, DatasetReader, ShardsDatasetConfig connector = S3Connector( key='access_key', secret='secret_key', endpoint_url='endpoint_url' ) reader = DatasetReader(connector) config = ShardsDatasetConfig.from_path_and_columns( "s3://your-bucket/path/to/shards", image_name_col='image_name', ) processor = reader.read_from_config(config, workers=16)
A dataset processor provides an interface for interacting with data and modifying it.
Filters are models or algorithms that calculate metrics for a dataset. Filters process the data and add new columns with the calculated metrics.
You can transform data in dataset with DPF. For example, resize videos or photos in dataset. You can use DPF.transforms
for these tasks.
Pipelines help to combine several filters into one pipeline and process the dataset using it. For example:
from DPF.configs import ShardsDatasetConfig from DPF.dataset_reader import DatasetReader from DPF.pipelines import FilterPipeline from DPF.filters.images.info_filter import ImageInfoFilter from DPF.filters.images.hash_filters import PHashFilter reader = DatasetReader() config = ShardsDatasetConfig.from_path_and_columns( "examples/example_dataset", image_name_col='image_name', ) processor = reader.read_from_config(config, workers=4) pipeline = FilterPipeline("pipeline_example") pipeline.add_datafilter( ImageInfoFilter, {'workers': 4}, processor_run_kwargs={'return_none_on_error': True}, ) pipeline.add_datafilter(PHashFilter, {'workers': 4}) pipeline.add_deduplication(["image_phash_8"]) pipeline.add_shuffle() pipeline.run(processor)