Menu

PP-OCRv5 Universal Text Recognition

Relevant source files

Purpose and Scope

PP-OCRv5 Universal Text Recognition is the core pipeline in PaddleOCR 3.x for general-purpose text detection and recognition tasks. This pipeline extracts text from images and outputs it in editable text format, supporting diverse text types including printed, handwritten, vertical, and text with rare characters across multiple languages (Simplified Chinese, Traditional Chinese, English, Japanese).

PP-OCRv5 represents a 13% accuracy improvement over PP-OCRv4 in multi-scenario benchmarks while maintaining efficient inference performance. The pipeline is designed for universal scene text recognition, covering street scenes, web images, documents, and handwritten content.

Related Pipelines:

Sources: docs/version3.x/pipeline_usage/OCR.md1-23 docs/version3.x/pipeline_usage/OCR.en.md1-25


Pipeline Architecture

Core Module Composition

PP-OCRv5 follows a modular architecture with two required modules and three optional preprocessing modules:

Sources: docs/version3.x/pipeline_usage/OCR.md15-21 docs/version3.x/pipeline_usage/OCR.en.md15-21

Module Responsibilities

ModulePurposeModel OptionsDefault Enabled
Document Orientation ClassificationDetects document rotation (0°, 90°, 180°, 270°) and corrects orientationPP-LCNet_x1_0_doc_oriYes
Text Image UnwarpingCorrects geometric distortions from photography/scanningUVDocYes
Text DetectionLocates text regions with bounding boxesPP-OCRv5_server_det, PP-OCRv5_mobile_detRequired
Text Line OrientationIdentifies inverted text lines (0° vs 180°)PP-LCNet_x0_25_textline_ori, PP-LCNet_x1_0_textline_oriYes
Text RecognitionRecognizes characters within detected regionsPP-OCRv5_server_rec, PP-OCRv5_mobile_recRequired

Sources: docs/version3.x/pipeline_usage/OCR.md27-116 docs/version3.x/pipeline_usage/OCR.en.md27-115


Model Variants and Performance

Detection Models

PP-OCRv5 provides two detection model variants optimized for different deployment scenarios:

ModelHmean (%)GPU Time (ms)CPU Time (ms)Size (MB)Use Case
PP-OCRv5_server_det83.889.55 / 70.19383.1584.3High-accuracy server deployment
PP-OCRv5_mobile_det79.010.67 / 6.3657.77 / 28.154.7Edge device deployment

Sources: docs/version3.x/pipeline_usage/OCR.md133-148 docs/version3.x/pipeline_usage/OCR.en.md132-147

Recognition Models

PP-OCRv5 recognition models support multi-language and multi-scenario text:

ModelAvg Acc (%)ChineseEnglishTraditional ChineseJapaneseGPU Time (ms)CPU Time (ms)Size (MB)
PP-OCRv5_server_rec86.3886.3864.7093.2960.358.46 / 2.3631.2181
PP-OCRv5_mobile_rec81.2981.2966.0083.5554.655.43 / 1.4621.20 / 5.3216

Key Features:

  • Single model supports 4 major languages natively
  • Handles complex scenarios: handwriting, vertical text, pinyin, rare characters (15,000+ characters)
  • Balanced inference speed and model robustness
  • Server model prioritizes accuracy; mobile model prioritizes efficiency

Sources: docs/version3.x/pipeline_usage/OCR.md184-284 docs/version3.x/pipeline_usage/OCR.en.md183-283


Supported Text Types and Scenarios

Text Type Coverage

Sources: docs/version3.x/pipeline_usage/OCR.md190-191 docs/version3.x/pipeline_usage/OCR.en.md190-191

Language Support

PP-OCRv5 models natively support:

  1. Simplified Chinese (86.38% server / 81.29% mobile)
  2. Traditional Chinese (93.29% server / 83.55% mobile)
  3. English (64.70% server / 66.00% mobile)
  4. Japanese (60.35% server / 54.65% mobile)

Additional language-specific models are available for Korean, Latin, Cyrillic, Arabic, Devanagari, Thai, Greek, and other scripts.

Sources: docs/version3.x/pipeline_usage/OCR.md246-632 docs/version3.x/pipeline_usage/OCR.en.md245-631


Pipeline Integration

Code Entry Points

The PP-OCRv5 pipeline can be accessed through multiple interfaces:

Sources: docs/version3.x/pipeline_usage/OCR.md709-1057 docs/version3.x/pipeline_usage/OCR.en.md708-1056

Command-Line Usage

Basic invocation with default PP-OCRv5 models:

Version selection:

Sources: docs/version3.x/pipeline_usage/OCR.md711-724 docs/version3.x/pipeline_usage/OCR.en.md710-723

Python API Usage

Configuration Parameters:

  • text_detection_model_name: Model name (e.g., PP-OCRv5_server_det)
  • text_recognition_model_name: Model name (e.g., PP-OCRv5_server_rec)
  • text_det_limit_side_len: Input size limit (default: 64)
  • text_det_thresh: Pixel threshold (default: 0.3)
  • text_det_box_thresh: Box threshold (default: 0.6)
  • text_rec_score_thresh: Recognition confidence threshold (default: 0.0)

Sources: docs/version3.x/pipeline_usage/OCR.md727-1057 docs/version3.x/pipeline_usage/OCR.en.md726-1056


Model Architecture and Training

Detection Architecture (DB Algorithm)

PP-OCRv5 detection models use the Differentiable Binarization (DB) architecture:

Key Components:

  • Backbone: Extracts multi-scale features (MobileNetV3 for mobile, ResNet for server)
  • Neck: Fuses features across scales using FPN or LKPAN
  • Head: Predicts probability map (text/non-text) and threshold map for binarization
  • Post-processing: Converts probability maps to polygons with configurable unclip_ratio

Sources: docs/version3.x/pipeline_usage/OCR.md841-876

Recognition Architecture (SVTR)

PP-OCRv5 recognition models use Scene Text Recognition Transformer (SVTR):

Character Set:

  • PP-OCRv5 supports 15,000+ characters including:
    • Common Chinese characters
    • Traditional Chinese characters
    • English letters and digits
    • Japanese hiragana, katakana, kanji
    • Pinyin with tone marks
    • Rare and ancient characters

Sources: docs/version3.x/pipeline_usage/OCR.md184-284 docs/version3.x/pipeline_usage/OCR.en.md183-283


Performance Optimization

Inference Modes

PP-OCRv5 supports two inference modes with different performance characteristics:

ModeConfigurationUse Case
Standard ModeFP32 precision, no accelerationDevelopment, debugging
High-Performance ModeTensorRT (GPU) / MKL-DNN (CPU), optimal precisionProduction deployment

Enabling High-Performance Mode:

Sources: docs/version3.x/pipeline_usage/OCR.md670-696 docs/version3.x/pipeline_usage/OCR.en.md669-695

Batch Processing

Recognition module supports batch processing for improved throughput:

Trade-offs:

  • Larger batch size → Higher throughput, higher memory usage
  • Smaller batch size → Lower latency, lower memory usage

Sources: docs/version3.x/pipeline_usage/OCR.md799-820

Hardware Acceleration

Supported acceleration backends:

Configuration:

Sources: docs/version3.x/pipeline_usage/OCR.md995-1050 docs/version3.x/pipeline_usage/OCR.en.md994-1049


Comparison with Previous Versions

Accuracy Improvements

MetricPP-OCRv4PP-OCRv5Improvement
Detection Hmean (server)69.2%83.8%+14.6%
Detection Hmean (mobile)63.8%79.0%+15.2%
Recognition Accuracy (server)85.19%86.38%+1.19%
Recognition Accuracy (mobile)78.74%81.29%+2.55%
Overall End-to-EndBaseline+13%-

Sources: docs/version3.x/pipeline_usage/OCR.md11 docs/version3.x/pipeline_usage/OCR.en.md11

Model Selection Guide

Recommendations:

  • Production servers: Use PP-OCRv5_server_det + PP-OCRv5_server_rec
  • Edge devices: Use PP-OCRv5_mobile_det + PP-OCRv5_mobile_rec
  • Legacy compatibility: Use PP-OCRv4 or PP-OCRv3 versions

Sources: docs/version3.x/pipeline_usage/OCR.md701-702


Integration with Other Pipelines

Usage in PP-StructureV3

PP-StructureV3 uses PP-OCRv5 as a sub-pipeline for text extraction within document regions:

Sources: docs/version3.x/pipeline_usage/PP-StructureV3.md11-20 docs/version3.x/pipeline_usage/PP-StructureV3.en.md11-20

Usage in PP-ChatOCRv4

PP-ChatOCRv4 leverages PP-OCRv5 for text extraction before LLM processing:

Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md15-26 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md13-24


Testing and Benchmarking

Test Environment Specifications

Hardware:

  • GPU: NVIDIA Tesla T4
  • CPU: Intel Xeon Gold 6271C @ 2.60GHz

Software:

  • OS: Ubuntu 20.04
  • CUDA: 11.8
  • cuDNN: 8.9
  • TensorRT: 8.6.1.6
  • PaddlePaddle: 3.0.0
  • PaddleOCR: 3.0.3

Test Datasets:

  • Detection: Custom dataset with 500 images (street scenes, web images, documents, handwriting)
  • Recognition: Custom dataset with 11,000 images (multi-scenario Chinese text)

Sources: docs/version3.x/pipeline_usage/OCR.md637-667 docs/version3.x/pipeline_usage/OCR.en.md636-666

Performance Metrics

Inference Time Breakdown:

ComponentServer Model (ms)Mobile Model (ms)
Document Orientation2.62 / 0.592.62 / 0.59
Text Unwarping19.0519.05
Text Detection89.55 / 70.1910.67 / 6.36
Line Orientation2.16 / 0.412.16 / 0.41
Text Recognition (per region)8.46 / 2.365.43 / 1.46

Format: Standard Mode / High-Performance Mode

Sources: docs/version3.x/pipeline_usage/OCR.md29-240 docs/version3.x/pipeline_usage/OCR.en.md28-239


Common Configuration Patterns

Minimal Configuration (Core Only)

For basic text extraction with minimal preprocessing:

Sources: docs/version3.x/pipeline_usage/OCR.md715-720

High-Accuracy Configuration

For maximum accuracy on server hardware:

Sources: docs/version3.x/pipeline_usage/OCR.md750-890

Mobile/Edge Configuration

For deployment on resource-constrained devices:

Sources: docs/version3.x/pipeline_usage/OCR.md142-148

Multi-Language Configuration

For specific language support:

Sources: docs/version3.x/pipeline_usage/OCR.md904-920 docs/version3.x/pipeline_usage/OCR.md432-632