Skip to content

Commit cb1b3da

Browse files
committed
docs: update documentation and examples
1 parent 7c0fa28 commit cb1b3da

File tree

8 files changed

+121
-47
lines changed

8 files changed

+121
-47
lines changed

README.md

Lines changed: 59 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# DataProcessingFramework
22

3-
A framework for processing and filtering multimodal datasets.
3+
**DPF** - a framework for processing and filtering multimodal datasets.
44

55
- [Installation](#installation)
66
- [Overview](#overview)
@@ -19,22 +19,70 @@ cd DataProcessingFramework
1919
pip install .
2020
```
2121

22-
Extra requirements: `filters`, `dev`, `llava`, `video_llava`
22+
Extra requirements: `filters`, `dev`, `llava`, `video_llava`, `lita`
2323

2424
To install extra requirements run: `pip install .[filters]`
2525

2626
## Overview
2727

2828
Framework supports following features:
2929
1. Reading datasets
30-
2. Filtering datasets and calculating metrics using different models
31-
3. Converting datasets to other storage formats
32-
4. Datasets validating
33-
5. Supports different filesystems (local, s3)
34-
6. Data filtering pipelines
30+
2. Filtering datasets and calculating metrics using different models and algorithms. Full list of filters can be found [there](docs/filters.md)
31+
3. Effectively transforming data such as videos and images
32+
4. Data filtering and transformation pipelines
33+
5. Converting datasets to other [formats](docs/formats.md)
34+
6. Validating datasets
35+
7. Support for various file systems (local, s3)
3536

36-
DPF allows you to easily filter datasets and add new metadata.
37-
For example, the code below generates synthetic captions for images in shards on remote s3 storage and updates dataset metadata without downloading shards:
37+
DPF allows you to easily filter datasets and add new metadata. You can use various filters and transformations on your data, create pipelines from them and run them efficiently and quickly. Basic code examples for filtering data are given below:
38+
39+
### Basic example
40+
Check out [basic usage](#basic-usage) for more info about DPF's API.
41+
42+
This is a simple example for image deduplication and image aesthetic quality prediction. All filters in DPF extract attributes from the dataset's data and write them into metadata. You can then use these attributes to filter the data according to your needs.
43+
44+
```python
45+
from DPF import ShardsDatasetConfig, DatasetReader
46+
47+
# creating config for dataset
48+
config = ShardsDatasetConfig.from_path_and_columns(
49+
'examples/example_dataset',
50+
image_name_col='image_name',
51+
text_col="caption"
52+
)
53+
54+
# reading dataset's metadata
55+
reader = DatasetReader()
56+
processor = reader.read_from_config(config)
57+
58+
from DPF.filters.images.hash_filters import PHashFilter
59+
datafilter = PHashFilter(sim_hash_size=8, workers=16) # creating PHash filter
60+
# calculating PHash
61+
# new column "image_phash_8" will be added
62+
processor.apply_data_filter(datafilter)
63+
64+
print('Dataset length before deduplication:', len(processor))
65+
processor.filter_df(~processor.df['image_phash_8'].duplicated())
66+
print('Dataset length after deduplication:', len(processor))
67+
68+
from DPF.filters.images.aesthetic_improved_filter import ImprovedAestheticFilter
69+
datafilter = ImprovedAestheticFilter(
70+
weights_folder='../weights', # path to weights folder, will be downloaded to this folder
71+
device='cuda:0',
72+
workers=16
73+
)
74+
processor.apply_data_filter(datafilter)
75+
76+
print(processor.df) # printing new dataset's metadata
77+
```
78+
79+
Run [simple_example.py](simple_example.py) file:
80+
```bash
81+
python simple_example.py
82+
```
83+
84+
### Synthetic captions example
85+
Code below generates synthetic captions for images in [shards](docs/formats.md) on remote S3-compatible storage and updates dataset's metadata without downloading shards:
3886

3987
Before running the example below, install extra requirements: `pip install DPF[filters,llava]`
4088

@@ -76,7 +124,7 @@ print(processor.df[new_column_name]) # prints generated image captions
76124
processor.update_columns([new_column_name], workers=16)
77125
```
78126

79-
More examples [there](examples/)
127+
You can find more examples [there](examples/)
80128

81129
### Supported data modalities
82130

@@ -86,8 +134,7 @@ The framework supports data that has any combination of the following modalities
86134
- Video
87135

88136
> Datasets with several data of the same modality in one sample are not supported.
89-
For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc.
90-
Modalities that are not supported: image2image, image-text-image, etc.
137+
For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.
91138

92139
### Supported data formats
93140

docs/filters.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Filters
1+
# Filters
22

33
Filters are models or algorithms that calculate metrics for a dataset.
44
Filters process the data and add new columns with the calculated metrics.
@@ -31,7 +31,7 @@ List of implemented filters:
3131
- [VideoLLaVAFilter](../DPF/filters/videos/video_llava_filter.py) - captioning videos using Video-LLaVA
3232
- [LITAFilter](../DPF/filters/videos/lita_filter.py) - captioning videos using [LITA model](https://github.com/NVlabs/LITA)
3333

34-
### Datafilter
34+
## Datafilter
3535

3636
Datafilters are filters that calculate new metadata (scores, captions, probabilities, etc) based on a file modalities: images and videos.
3737
To run a datafilter, use `processor.apply_data_filter()` method.
@@ -44,7 +44,7 @@ processor.apply_data_filter(datafilter)
4444
processor.df # new columns ['width', 'height', 'is_correct'] are added
4545
```
4646

47-
### Columnfilter
47+
## Columnfilter
4848

4949
Columnfilters are filters that also calculates new metadata, but based on a existing metadata (texts, etc).
5050
To run a columnfilter, use `processor.apply_column_filter()` method.
@@ -58,7 +58,7 @@ processor.apply_column_filter(columnfilter)
5858
processor.df # new columns ["lang", "lang_score"] are added
5959
```
6060

61-
### Running filter on several GPUs
61+
## Running filter on several GPUs
6262

6363
To run a datafilter on multiple GPUs use `MultiGPUDataFilter` class:
6464

@@ -78,20 +78,20 @@ processor.apply_multi_gpu_data_filter(multigpufilter)
7878
```
7979
See `help(MultiGPUDataFilter)` for more information.
8080

81-
### Examples
81+
## Examples
8282

8383
You can find usage examples [there](../examples).
8484
- [Image filters examples](../examples/image_filters_example.ipynb)
8585
- [Video filters examples](../examples/video_filters_example.ipynb)
8686
- [Text filters examples](../examples/text_filters_example.ipynb)
8787

88-
### Creating new filter
88+
## Creating new filter
8989

9090
To add your filter, you should create new filter class.
9191
If your filter uses only data from columns (e.g. _text_ modality), you should inherit your class from [ColumnFilter class](../DPF/filters/column_filter.py)
9292
If your filter uses data from files, you should inherit your class from [DataFilter class](../DPF/filters/data_filter.py)
9393

94-
#### Creating DataFilter
94+
### Creating DataFilter
9595

9696
To create a new datafilter, add new file in a folder with the modality used by your filter.
9797
For example, if your filter uses _images_ modality, create file in [DPF/filters/images/](../DPF/filters/images) folder.
@@ -114,7 +114,7 @@ from DPF.filters import DataFilter
114114
help(DataFilter)
115115
```
116116

117-
**Example of custom DataFilter:**
117+
Example of custom DataFilter:
118118
```python
119119
from typing import Any
120120

@@ -166,7 +166,7 @@ class PHashFilter(ImageFilter):
166166
This filter reads images and calculates PHash **in dataloader**.
167167
Then dataloader returns PHash strings and these strings are added in result dataframe.
168168

169-
#### Creating ColumnFilter
169+
### Creating ColumnFilter
170170

171171
To create a new columnfilter, add new file in a folder with the modality used by your filter.
172172
Inherit your class from [ColumnFilter](../DPF/filters/column_filter.py) class.
@@ -182,7 +182,7 @@ from DPF.filters import ColumnFilter
182182
help(ColumnFilter)
183183
```
184184

185-
**Example of custom ColumnFilter:**
185+
Example of custom ColumnFilter:
186186
```python
187187
from typing import Any
188188
from py3langid.langid import MODEL_FILE, LanguageIdentifier

docs/formats.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
## Supported data formats
1+
# Supported data formats
22

33
The dataset should be stored in one of the following formats:
44
- Files
55
- Shards
66
- Sharded files
77

8-
### Files format
8+
## Files format
99

1010
The files format is a csv file with metadata and paths to images, videos, etc. A csv file can look like this:
1111
```csv
@@ -28,7 +28,7 @@ reader = DatasetReader()
2828
processor = reader.read_from_config(config)
2929
```
3030

31-
### Shards format
31+
## Shards format
3232

3333
In this format, the dataset is divided into shards of N samples each.
3434
The files in each shard stored in `tar archive, and the metadata is stored in csv file.
@@ -66,7 +66,7 @@ reader = DatasetReader()
6666
processor = reader.read_from_config(config)
6767
```
6868

69-
### Sharded files format
69+
## Sharded files format
7070

7171
This format is similar to _shards_, but instead of tar archives, files are stored in folders.
7272

docs/pipelines.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Pipelines
1+
# Pipelines
22

33
Pipelines help to combine several filters into one pipeline and process the dataset using it.
44
You can build pipelines using [datafilters](../DPF/filters/data_filter.py), [columnfilters](../DPF/filters/column_filter.py),
@@ -12,7 +12,7 @@ Available methods for adding a pipeline stage:
1212
4. `add_deduplication` - Deduplicates the dataset using the specified columns
1313
5. `add_dataframe_filter` - Custom filter for dataset DataFrame
1414

15-
### Examples
15+
## Examples
1616

1717
```python
1818
from DPF.configs import ShardsDatasetConfig

docs/processor.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## DatasetProcessor guide
1+
# DatasetProcessor guide
22

33
Dataset processor supports following features:
44
- Update and change metadata
@@ -7,7 +7,7 @@ Dataset processor supports following features:
77
- Convert dataset to other formats
88
- View samples from a dataset
99

10-
### Example
10+
## Example
1111
```python
1212
from DPF import ShardsDatasetConfig, DatasetReader
1313

@@ -21,19 +21,19 @@ reader = DatasetReader()
2121
processor = reader.read_from_config(config)
2222
```
2323

24-
### Attributes
24+
## Attributes
2525
Dataset processor have three main attributes:
2626
- `processor.df` - Pandas dataframe with metadata
2727
- `processor.connector` - A connector to filesystem there dataset is located. Object of type `processor.connectors.Connector`
2828
- `processor.config` - Dataset config
2929

30-
### Print summary about dataset
30+
## Print summary about dataset
3131

3232
```python
3333
processor.print_summary()
3434
```
3535

36-
### Update and change metadata
36+
## Update and change metadata
3737

3838
Methods below modifying or adding columns to a dataset metadata (usually csv files).
3939

@@ -50,7 +50,7 @@ Delete columns in dataset metadata:
5050
processor.delete_columns(['column_to_delete'])
5151
```
5252

53-
### View samples
53+
## View samples
5454

5555
`processor.get_random_sample()` returns random sample from dataset.
5656

@@ -64,15 +64,15 @@ print(metadata['caption'])
6464
Image.open(io.BytesIO(modality2bytes['image']))
6565
```
6666

67-
### Filters
67+
## Filters
6868

6969
[Filters documentation](filters.md)
7070

71-
### Transformation
71+
## Transformation
7272

7373
[Transforms documentation](transforms.md)
7474

75-
### Convert to other formats
75+
## Convert to other formats
7676

7777
Convert to _shards_ format:
7878

docs/transforms.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Transforms
1+
# Transforms
22

33
You can transform data in dataset with DPF.
44
For example, resize videos or photos in dataset.
@@ -10,7 +10,7 @@ List of implemented transforms:
1010
- [ImageResizeTransforms](../DPF/transforms/image_resize_transforms.py) - transforms that resizes images
1111
- [VideoFFMPEGTransforms](../DPF/transforms/video_ffmpeg_transforms.py) - transforms that resizes and changing fps of videos using ffmpeg
1212

13-
### Examples
13+
## Examples
1414

1515
Resize all images to 768 pixels on the minimum side while maintaining the aspect ratio:
1616
```python

examples/image_filters_example.ipynb

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
"text": [
2424
"/home/user/conda/envs/dpf/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
2525
" from .autonotebook import tqdm as notebook_tqdm\n",
26-
"100%|██████████| 3/3 [00:00<00:00, 362.76it/s]\n"
26+
"100%|██████████| 3/3 [00:00<00:00, 318.39it/s]\n"
2727
]
2828
}
2929
],
@@ -50,7 +50,7 @@
5050
},
5151
{
5252
"cell_type": "code",
53-
"execution_count": 5,
53+
"execution_count": 3,
5454
"id": "0747f47c",
5555
"metadata": {},
5656
"outputs": [
@@ -72,7 +72,7 @@
7272
},
7373
{
7474
"cell_type": "code",
75-
"execution_count": 6,
75+
"execution_count": 4,
7676
"id": "1b194714",
7777
"metadata": {
7878
"scrolled": true
@@ -206,7 +206,7 @@
206206
"[500 rows x 3 columns]"
207207
]
208208
},
209-
"execution_count": 6,
209+
"execution_count": 4,
210210
"metadata": {},
211211
"output_type": "execute_result"
212212
}
@@ -478,15 +478,15 @@
478478
},
479479
{
480480
"cell_type": "code",
481-
"execution_count": 9,
481+
"execution_count": 5,
482482
"id": "c803322d",
483483
"metadata": {},
484484
"outputs": [
485485
{
486486
"name": "stderr",
487487
"output_type": "stream",
488488
"text": [
489-
"100%|██████████| 500/500 [00:01<00:00, 391.57it/s]\n"
489+
"100%|██████████| 500/500 [00:01<00:00, 345.42it/s]\n"
490490
]
491491
}
492492
],
@@ -499,7 +499,7 @@
499499
},
500500
{
501501
"cell_type": "code",
502-
"execution_count": 10,
502+
"execution_count": 6,
503503
"id": "0003049b",
504504
"metadata": {},
505505
"outputs": [
@@ -520,7 +520,7 @@
520520
"Name: image_phash_8, Length: 500, dtype: object"
521521
]
522522
},
523-
"execution_count": 10,
523+
"execution_count": 6,
524524
"metadata": {},
525525
"output_type": "execute_result"
526526
}

0 commit comments

Comments
 (0)