DEV Community

Cover image for Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL
AI Viewz
AI Viewz

Posted on

Generating Synthetic RTL OCR Data for Donut with SynthDoG-RTL

Introduction

Building OCR models for right-to-left (RTL) languages like Arabic, Urdu, Persian, or Hebrew often suffers from a lack of annotated training data. SynthDoG-RTL is a synthetic document generator adapted from Donut’s SynthDoG, extended to handle RTL text rendering correctly. In this post, we’ll walk through how advanced developers can generate large-scale synthetic datasets compatible with Donut.


What is SynthDoG-RTL?

SynthDoG (Synthetic Document Generator) was introduced with Donut to create training data on the fly for document understanding. SynthDoG-RTL extends it by:

  • Supporting RTL text direction and contextual script shaping.
  • Including sample corpora, fonts, and templates for Arabic, Urdu, Persian, Hebrew, and others.
  • Allowing custom YAML configuration for layouts, distortions, and effects.

Installation and Setup

Clone the repository and install dependencies:

git clone https://github.com/aiviewz/Synthdog-RTL.git cd Synthdog-RTL conda create -n synthdog python=3.8 -y conda activate synthdog pip install synthtiger 
Enter fullscreen mode Exit fullscreen mode

Make sure to install libraqm for proper Arabic/RTL shaping:

sudo apt-get install libfreetype6-dev libharfbuzz-dev 
Enter fullscreen mode Exit fullscreen mode

On macOS, set:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES 
Enter fullscreen mode Exit fullscreen mode

Preparing Resources

Each language needs:

  • Corpus: UTF-8 text file under resources/corpus/ (e.g., urdu.txt, arabic.txt).
  • Fonts: Place .ttf/.otf fonts in resources/font/<lang_code>/.
  • Backgrounds: Optional textures under resources/backgrounds/.

Example structure:

resources/ ├─ corpus/ │ ├─ urdu.txt │ └─ arabic.txt └─ font/ ├─ ur/ │ └─ NotoNastaliq.ttf └─ ar/ └─ NotoNaskh.ttf 
Enter fullscreen mode Exit fullscreen mode

Configuring Generation

YAML config files (e.g., config_ur.yaml) define page size, font range, distortions, and paths.

Example Urdu config:

corpus_path: "resources/corpus/urdu.txt" font_dir: "resources/font/ur/" page_width: 1240 page_height: 1754 min_font_size: 20 max_font_size: 40 rotate_angle: [-2, 2] background_dir: "resources/backgrounds/paper/" 
Enter fullscreen mode Exit fullscreen mode

Generating Synthetic Data

Run the CLI:

synthtiger -o ./outputs/synthdog_ur -c 1000 -w 8 -v template.py SynthDoG config_ur.yaml 
Enter fullscreen mode Exit fullscreen mode

This generates 1000 samples with 8 workers, outputting images and text into ./outputs/synthdog_ur/.

Repeat with config_ar.yaml, config_fa.yaml, etc. for multiple languages.


Formatting for Donut

Donut expects an image + JSON pair. Structure your dataset like:

my_dataset/ ├─ train/ │ ├─ metadata.jsonl │ ├─ 00000001.png │ └─ ... ├─ validation/ │ └─ ... └─ test/ └─ ... 
Enter fullscreen mode Exit fullscreen mode

Each line in metadata.jsonl:

{"file_name": "00000001.png", "ground_truth": "{\"gt_parse\":{\"text_sequence\":\"یہ اردو کا متن ہے\"}}"} 
Enter fullscreen mode Exit fullscreen mode

Donut will tokenize this internally. Ensure that file_name matches your image and text_sequence contains the RTL ground truth text.


Advanced Tips

  • Layouts: Customize template.py for multi-column, headers, or tables.
  • Effects: Add noise, blur, or perspective distortion in YAML for realism.
  • Fonts: Use multiple fonts per language to avoid overfitting.
  • Mixed Scripts: Include English corpora to simulate bilingual documents.
  • Scaling: Generate 10k–100k samples to pre-train Donut effectively.

Conclusion

With SynthDog-RTL you can rapidly bootstrap synthetic OCR datasets for all major RTL languages. The generated data integrates seamlessly with Donut, enabling you to train or fine-tune robust document understanding models even in low-resource settings.


References:

Top comments (0)