You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+59-12Lines changed: 59 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# DataProcessingFramework
2
2
3
-
A framework for processing and filtering multimodal datasets.
3
+
**DPF** - a framework for processing and filtering multimodal datasets.
4
4
5
5
-[Installation](#installation)
6
6
-[Overview](#overview)
@@ -19,22 +19,70 @@ cd DataProcessingFramework
19
19
pip install .
20
20
```
21
21
22
-
Extra requirements: `filters`, `dev`, `llava`, `video_llava`
22
+
Extra requirements: `filters`, `dev`, `llava`, `video_llava`, `lita`
23
23
24
24
To install extra requirements run: `pip install .[filters]`
25
25
26
26
## Overview
27
27
28
28
Framework supports following features:
29
29
1. Reading datasets
30
-
2. Filtering datasets and calculating metrics using different models
31
-
3. Converting datasets to other storage formats
32
-
4. Datasets validating
33
-
5. Supports different filesystems (local, s3)
34
-
6. Data filtering pipelines
30
+
2. Filtering datasets and calculating metrics using different models and algorithms. Full list of filters can be found [there](docs/filters.md)
31
+
3. Effectively transforming data such as videos and images
32
+
4. Data filtering and transformation pipelines
33
+
5. Converting datasets to other [formats](docs/formats.md)
34
+
6. Validating datasets
35
+
7. Support for various file systems (local, s3)
35
36
36
-
DPF allows you to easily filter datasets and add new metadata.
37
-
For example, the code below generates synthetic captions for images in shards on remote s3 storage and updates dataset metadata without downloading shards:
37
+
DPF allows you to easily filter datasets and add new metadata. You can use various filters and transformations on your data, create pipelines from them and run them efficiently and quickly. Basic code examples for filtering data are given below:
38
+
39
+
### Basic example
40
+
Check out [basic usage](#basic-usage) for more info about DPF's API.
41
+
42
+
This is a simple example for image deduplication and image aesthetic quality prediction. All filters in DPF extract attributes from the dataset's data and write them into metadata. You can then use these attributes to filter the data according to your needs.
weights_folder='../weights', # path to weights folder, will be downloaded to this folder
71
+
device='cuda:0',
72
+
workers=16
73
+
)
74
+
processor.apply_data_filter(datafilter)
75
+
76
+
print(processor.df) # printing new dataset's metadata
77
+
```
78
+
79
+
Run [simple_example.py](simple_example.py) file:
80
+
```bash
81
+
python simple_example.py
82
+
```
83
+
84
+
### Synthetic captions example
85
+
Code below generates synthetic captions for images in [shards](docs/formats.md) on remote S3-compatible storage and updates dataset's metadata without downloading shards:
38
86
39
87
Before running the example below, install extra requirements: `pip install DPF[filters,llava]`
40
88
@@ -76,7 +124,7 @@ print(processor.df[new_column_name]) # prints generated image captions
@@ -86,8 +134,7 @@ The framework supports data that has any combination of the following modalities
86
134
- Video
87
135
88
136
> Datasets with several data of the same modality in one sample are not supported.
89
-
For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc.
90
-
Modalities that are not supported: image2image, image-text-image, etc.
137
+
For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.
To add your filter, you should create new filter class.
91
91
If your filter uses only data from columns (e.g. _text_ modality), you should inherit your class from [ColumnFilter class](../DPF/filters/column_filter.py)
92
92
If your filter uses data from files, you should inherit your class from [DataFilter class](../DPF/filters/data_filter.py)
93
93
94
-
####Creating DataFilter
94
+
### Creating DataFilter
95
95
96
96
To create a new datafilter, add new file in a folder with the modality used by your filter.
97
97
For example, if your filter uses _images_ modality, create file in [DPF/filters/images/](../DPF/filters/images) folder.
@@ -114,7 +114,7 @@ from DPF.filters import DataFilter
114
114
help(DataFilter)
115
115
```
116
116
117
-
**Example of custom DataFilter:**
117
+
Example of custom DataFilter:
118
118
```python
119
119
from typing import Any
120
120
@@ -166,7 +166,7 @@ class PHashFilter(ImageFilter):
166
166
This filter reads images and calculates PHash **in dataloader**.
167
167
Then dataloader returns PHash strings and these strings are added in result dataframe.
168
168
169
-
####Creating ColumnFilter
169
+
### Creating ColumnFilter
170
170
171
171
To create a new columnfilter, add new file in a folder with the modality used by your filter.
172
172
Inherit your class from [ColumnFilter](../DPF/filters/column_filter.py) class.
@@ -182,7 +182,7 @@ from DPF.filters import ColumnFilter
182
182
help(ColumnFilter)
183
183
```
184
184
185
-
**Example of custom ColumnFilter:**
185
+
Example of custom ColumnFilter:
186
186
```python
187
187
from typing import Any
188
188
from py3langid.langid importMODEL_FILE, LanguageIdentifier
Copy file name to clipboardExpand all lines: examples/image_filters_example.ipynb
+8-8Lines changed: 8 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@
23
23
"text": [
24
24
"/home/user/conda/envs/dpf/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
25
25
" from .autonotebook import tqdm as notebook_tqdm\n",
0 commit comments