ai-forever
diff --git a/‎README.md‎
Lines changed: 59 additions & 12 deletions b/‎README.md‎
Lines changed: 59 additions & 12 deletions
diff --git a/‎docs/filters.md‎
Lines changed: 10 additions & 10 deletions b/‎docs/filters.md‎
Lines changed: 10 additions & 10 deletions
diff --git a/‎docs/formats.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/formats.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/pipelines.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/pipelines.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/processor.md‎
Lines changed: 9 additions & 9 deletions b/‎docs/processor.md‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎docs/transforms.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/transforms.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/image_filters_example.ipynb‎
Lines changed: 8 additions & 8 deletions b/‎examples/image_filters_example.ipynb‎
Lines changed: 8 additions & 8 deletions
@@ -1,6 +1,6 @@
 # DataProcessingFramework
 
-A framework for processing and filtering multimodal datasets.
+**DPF** - a framework for processing and filtering multimodal datasets.
 
 - [Installation](#installation)
 - [Overview](#overview)
@@ -19,22 +19,70 @@ cd DataProcessingFramework
 pip install .
 ```
 
-Extra requirements: `filters`, `dev`, `llava`, `video_llava`
+Extra requirements: `filters`, `dev`, `llava`, `video_llava`, `lita`
 
 To install extra requirements run: `pip install .[filters]`
 
 ## Overview
 
 Framework supports following features:
 1. Reading datasets
-2. Filtering datasets and calculating metrics using different models
-3. Converting datasets to other storage formats
-4. Datasets validating
-5. Supports different filesystems (local, s3)
-6. Data filtering pipelines
+2. Filtering datasets and calculating metrics using different models and algorithms. Full list of filters can be found [there](docs/filters.md)
+3. Effectively transforming data such as videos and images
+4. Data filtering and transformation pipelines
+5. Converting datasets to other [formats](docs/formats.md)
+6. Validating datasets
+7. Support for various file systems (local, s3)
 
-DPF allows you to easily filter datasets and add new metadata. 
-For example, the code below generates synthetic captions for images in shards on remote s3 storage and updates dataset metadata without downloading shards:
+DPF allows you to easily filter datasets and add new metadata. You can use various filters and transformations on your data, create pipelines from them and run them efficiently and quickly. Basic code examples for filtering data are given below:
+
+### Basic example
+Check out [basic usage](#basic-usage) for more info about DPF's API.
+
+This is a simple example for image deduplication and image aesthetic quality prediction. All filters in DPF extract attributes from the dataset's data and write them into metadata. You can then use these attributes to filter the data according to your needs.
+
+```python
+from DPF import ShardsDatasetConfig, DatasetReader
+
+# creating config for dataset
+config = ShardsDatasetConfig.from_path_and_columns(
+ 'examples/example_dataset',
+ image_name_col='image_name',
+ text_col="caption"
+)
+
+# reading dataset's metadata
+reader = DatasetReader()
+processor = reader.read_from_config(config)
+
+from DPF.filters.images.hash_filters import PHashFilter
+datafilter = PHashFilter(sim_hash_size=8, workers=16) # creating PHash filter
+# calculating PHash
+# new column "image_phash_8" will be added
+processor.apply_data_filter(datafilter)
+
+print('Dataset length before deduplication:', len(processor))
+processor.filter_df(~processor.df['image_phash_8'].duplicated())
+print('Dataset length after deduplication:', len(processor))
+
+from DPF.filters.images.aesthetic_improved_filter import ImprovedAestheticFilter
+datafilter = ImprovedAestheticFilter(
+ weights_folder='../weights', # path to weights folder, will be downloaded to this folder
+ device='cuda:0',
+ workers=16
+)
+processor.apply_data_filter(datafilter)
+
+print(processor.df) # printing new dataset's metadata
+```
+
+Run [simple_example.py](simple_example.py) file:
+```bash
+python simple_example.py
+```
+
+### Synthetic captions example
+Code below generates synthetic captions for images in [shards](docs/formats.md) on remote S3-compatible storage and updates dataset's metadata without downloading shards:
 
 Before running the example below, install extra requirements: `pip install DPF[filters,llava]`
 
@@ -76,7 +124,7 @@ print(processor.df[new_column_name]) # prints generated image captions
 processor.update_columns([new_column_name], workers=16)
 ```
 
-More examples [there](examples/)
+You can find more examples [there](examples/)
 
 ### Supported data modalities
 
@@ -86,8 +134,7 @@ The framework supports data that has any combination of the following modalities
 - Video
 
 > Datasets with several data of the same modality in one sample are not supported.
-For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc.
-Modalities that are not supported: image2image, image-text-image, etc.
+For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.
 
 ### Supported data formats
 
 
@@ -1,4 +1,4 @@
-## Filters
+# Filters
 
 Filters are models or algorithms that calculate metrics for a dataset. 
 Filters process the data and add new columns with the calculated metrics.
@@ -31,7 +31,7 @@ List of implemented filters:
  - [VideoLLaVAFilter](../DPF/filters/videos/video_llava_filter.py) - captioning videos using Video-LLaVA
  - [LITAFilter](../DPF/filters/videos/lita_filter.py) - captioning videos using [LITA model](https://github.com/NVlabs/LITA)
 
-### Datafilter
+## Datafilter
 
 Datafilters are filters that calculate new metadata (scores, captions, probabilities, etc) based on a file modalities: images and videos.
 To run a datafilter, use `processor.apply_data_filter()` method.
@@ -44,7 +44,7 @@ processor.apply_data_filter(datafilter)
 processor.df # new columns ['width', 'height', 'is_correct'] are added
 ```
 
-### Columnfilter
+## Columnfilter
 
 Columnfilters are filters that also calculates new metadata, but based on a existing metadata (texts, etc).
 To run a columnfilter, use `processor.apply_column_filter()` method.
@@ -58,7 +58,7 @@ processor.apply_column_filter(columnfilter)
 processor.df # new columns ["lang", "lang_score"] are added
 ```
 
-### Running filter on several GPUs
+## Running filter on several GPUs
 
 To run a datafilter on multiple GPUs use `MultiGPUDataFilter` class:
 
@@ -78,20 +78,20 @@ processor.apply_multi_gpu_data_filter(multigpufilter)
 ```
 See `help(MultiGPUDataFilter)` for more information.
 
-### Examples
+## Examples
 
 You can find usage examples [there](../examples).
 - [Image filters examples](../examples/image_filters_example.ipynb)
 - [Video filters examples](../examples/video_filters_example.ipynb)
 - [Text filters examples](../examples/text_filters_example.ipynb)
 
-### Creating new filter
+## Creating new filter
 
 To add your filter, you should create new filter class.
 If your filter uses only data from columns (e.g. _text_ modality), you should inherit your class from [ColumnFilter class](../DPF/filters/column_filter.py)
 If your filter uses data from files, you should inherit your class from [DataFilter class](../DPF/filters/data_filter.py)
 
-#### Creating DataFilter
+### Creating DataFilter
 
 To create a new datafilter, add new file in a folder with the modality used by your filter. 
 For example, if your filter uses _images_ modality, create file in [DPF/filters/images/](../DPF/filters/images) folder.
@@ -114,7 +114,7 @@ from DPF.filters import DataFilter
 help(DataFilter)
 ```
 
-**Example of custom DataFilter:**
+Example of custom DataFilter:
 ```python
 from typing import Any
 
@@ -166,7 +166,7 @@ class PHashFilter(ImageFilter):
 This filter reads images and calculates PHash **in dataloader**. 
 Then dataloader returns PHash strings and these strings are added in result dataframe. 
 
-#### Creating ColumnFilter
+### Creating ColumnFilter
 
 To create a new columnfilter, add new file in a folder with the modality used by your filter.
 Inherit your class from [ColumnFilter](../DPF/filters/column_filter.py) class.
@@ -182,7 +182,7 @@ from DPF.filters import ColumnFilter
 help(ColumnFilter)
 ```
 
-**Example of custom ColumnFilter:**
+Example of custom ColumnFilter:
 ```python
 from typing import Any
 from py3langid.langid import MODEL_FILE, LanguageIdentifier
 
@@ -1,11 +1,11 @@
-## Supported data formats
+# Supported data formats
 
 The dataset should be stored in one of the following formats:
 - Files
 - Shards
 - Sharded files
 
-### Files format
+## Files format
 
 The files format is a csv file with metadata and paths to images, videos, etc. A csv file can look like this:
 ```csv
@@ -28,7 +28,7 @@ reader = DatasetReader()
 processor = reader.read_from_config(config)
 ```
 
-### Shards format
+## Shards format
 
 In this format, the dataset is divided into shards of N samples each. 
 The files in each shard stored in `tar archive, and the metadata is stored in csv file. 
@@ -66,7 +66,7 @@ reader = DatasetReader()
 processor = reader.read_from_config(config)
 ```
 
-### Sharded files format
+## Sharded files format
 
 This format is similar to _shards_, but instead of tar archives, files are stored in folders.
 
 
@@ -1,4 +1,4 @@
-## Pipelines
+# Pipelines
 
 Pipelines help to combine several filters into one pipeline and process the dataset using it.
 You can build pipelines using [datafilters](../DPF/filters/data_filter.py), [columnfilters](../DPF/filters/column_filter.py), 
@@ -12,7 +12,7 @@ Available methods for adding a pipeline stage:
 4. `add_deduplication` - Deduplicates the dataset using the specified columns 
 5. `add_dataframe_filter` - Custom filter for dataset DataFrame
 
-### Examples
+## Examples
 
 ```python
 from DPF.configs import ShardsDatasetConfig
 
@@ -1,4 +1,4 @@
-## DatasetProcessor guide
+# DatasetProcessor guide
 
 Dataset processor supports following features:
 - Update and change metadata
@@ -7,7 +7,7 @@ Dataset processor supports following features:
 - Convert dataset to other formats
 - View samples from a dataset
 
-### Example
+## Example
 ```python
 from DPF import ShardsDatasetConfig, DatasetReader
 
@@ -21,19 +21,19 @@ reader = DatasetReader()
 processor = reader.read_from_config(config)
 ```
 
-### Attributes
+## Attributes
 Dataset processor have three main attributes:
 - `processor.df` - Pandas dataframe with metadata
 - `processor.connector` - A connector to filesystem there dataset is located. Object of type `processor.connectors.Connector`
 - `processor.config` - Dataset config
 
-### Print summary about dataset
+## Print summary about dataset
 
 ```python
 processor.print_summary()
 ```
 
-### Update and change metadata
+## Update and change metadata
 
 Methods below modifying or adding columns to a dataset metadata (usually csv files).
 
@@ -50,7 +50,7 @@ Delete columns in dataset metadata:
 processor.delete_columns(['column_to_delete'])
 ```
 
-### View samples
+## View samples
 
 `processor.get_random_sample()` returns random sample from dataset.
 
@@ -64,15 +64,15 @@ print(metadata['caption'])
 Image.open(io.BytesIO(modality2bytes['image']))
 ```
 
-### Filters
+## Filters
 
 [Filters documentation](filters.md)
 
-### Transformation
+## Transformation
 
 [Transforms documentation](transforms.md)
 
-### Convert to other formats
+## Convert to other formats
 
 Convert to _shards_ format:
 
 
@@ -1,4 +1,4 @@
-## Transforms
+# Transforms
 
 You can transform data in dataset with DPF.
 For example, resize videos or photos in dataset.
@@ -10,7 +10,7 @@ List of implemented transforms:
 - [ImageResizeTransforms](../DPF/transforms/image_resize_transforms.py) - transforms that resizes images
 - [VideoFFMPEGTransforms](../DPF/transforms/video_ffmpeg_transforms.py) - transforms that resizes and changing fps of videos using ffmpeg
 
-### Examples
+## Examples
 
 Resize all images to 768 pixels on the minimum side while maintaining the aspect ratio:
 ```python
 
@@ -23,7 +23,7 @@
  "text": [
  "/home/user/conda/envs/dpf/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
  " from .autonotebook import tqdm as notebook_tqdm\n",
- "100%|██████████| 3/3 [00:00<00:00, 362.76it/s]\n"
+ "100%|██████████| 3/3 [00:00<00:00, 318.39it/s]\n"
  ]
  }
  ],
@@ -50,7 +50,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 5,
+ "execution_count": 3,
  "id": "0747f47c",
  "metadata": {},
  "outputs": [
@@ -72,7 +72,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 6,
+ "execution_count": 4,
  "id": "1b194714",
  "metadata": {
  "scrolled": true
@@ -206,7 +206,7 @@
  "[500 rows x 3 columns]"
  ]
  },
- "execution_count": 6,
+ "execution_count": 4,
  "metadata": {},
  "output_type": "execute_result"
  }
@@ -478,15 +478,15 @@
  },
  {
  "cell_type": "code",
- "execution_count": 9,
+ "execution_count": 5,
  "id": "c803322d",
  "metadata": {},
  "outputs": [
  {
  "name": "stderr",
  "output_type": "stream",
  "text": [
- "100%|██████████| 500/500 [00:01<00:00, 391.57it/s]\n"
+ "100%|██████████| 500/500 [00:01<00:00, 345.42it/s]\n"
  ]
  }
  ],
@@ -499,7 +499,7 @@
  },
  {
  "cell_type": "code",
- "execution_count": 10,
+ "execution_count": 6,
  "id": "0003049b",
  "metadata": {},
  "outputs": [
@@ -520,7 +520,7 @@
  "Name: image_phash_8, Length: 500, dtype: object"
  ]
  },
- "execution_count": 10,
+ "execution_count": 6,
  "metadata": {},
  "output_type": "execute_result"
  }
Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,7 @@`
`23`	`23`	`"text": [`
`24`	`24`	`"/home/user/conda/envs/dpf/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",`
`25`	`25`	`" from .autonotebook import tqdm as notebook_tqdm\n",`
`26`		`- "100%\|██████████\| 3/3 [00:00<00:00, 362.76it/s]\n"`
	`26`	`+ "100%\|██████████\| 3/3 [00:00<00:00, 318.39it/s]\n"`
`27`	`27`	`]`
`28`	`28`	`}`
`29`	`29`	`],`
`@@ -50,7 +50,7 @@`
`50`	`50`	`},`
`51`	`51`	`{`
`52`	`52`	`"cell_type": "code",`
`53`		`- "execution_count": 5,`
	`53`	`+ "execution_count": 3,`
`54`	`54`	`"id": "0747f47c",`
`55`	`55`	`"metadata": {},`
`56`	`56`	`"outputs": [`
`@@ -72,7 +72,7 @@`
`72`	`72`	`},`
`73`	`73`	`{`
`74`	`74`	`"cell_type": "code",`
`75`		`- "execution_count": 6,`
	`75`	`+ "execution_count": 4,`
`76`	`76`	`"id": "1b194714",`
`77`	`77`	`"metadata": {`
`78`	`78`	`"scrolled": true`
`@@ -206,7 +206,7 @@`
`206`	`206`	`"[500 rows x 3 columns]"`
`207`	`207`	`]`
`208`	`208`	`},`
`209`		`- "execution_count": 6,`
	`209`	`+ "execution_count": 4,`
`210`	`210`	`"metadata": {},`
`211`	`211`	`"output_type": "execute_result"`
`212`	`212`	`}`
`@@ -478,15 +478,15 @@`
`478`	`478`	`},`
`479`	`479`	`{`
`480`	`480`	`"cell_type": "code",`
`481`		`- "execution_count": 9,`
	`481`	`+ "execution_count": 5,`
`482`	`482`	`"id": "c803322d",`
`483`	`483`	`"metadata": {},`
`484`	`484`	`"outputs": [`
`485`	`485`	`{`
`486`	`486`	`"name": "stderr",`
`487`	`487`	`"output_type": "stream",`
`488`	`488`	`"text": [`
`489`		`- "100%\|██████████\| 500/500 [00:01<00:00, 391.57it/s]\n"`
	`489`	`+ "100%\|██████████\| 500/500 [00:01<00:00, 345.42it/s]\n"`
`490`	`490`	`]`
`491`	`491`	`}`
`492`	`492`	`],`
`@@ -499,7 +499,7 @@`
`499`	`499`	`},`
`500`	`500`	`{`
`501`	`501`	`"cell_type": "code",`
`502`		`- "execution_count": 10,`
	`502`	`+ "execution_count": 6,`
`503`	`503`	`"id": "0003049b",`
`504`	`504`	`"metadata": {},`
`505`	`505`	`"outputs": [`
`@@ -520,7 +520,7 @@`
`520`	`520`	`"Name: image_phash_8, Length: 500, dtype: object"`
`521`	`521`	`]`
`522`	`522`	`},`
`523`		`- "execution_count": 10,`
	`523`	`+ "execution_count": 6,`
`524`	`524`	`"metadata": {},`
`525`	`525`	`"output_type": "execute_result"`
`526`	`526`	`}`