Name	Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md	README.md
scene_text_recognition_chinese_version.md	scene_text_recognition_chinese_version.md

The State of the Art of Scene Text Recognition

Chinese_version

Author:陈晓雪

1. Datasets

1.1 Regular Scene Text Datasets

IIIT5K[31]：
- Introduction: There are 5000 images in total, 2000 for training and 3000 for testing. Text instances are mostly horizontal. Each image is associated with a short, 50-word lexicon and a long, 1000-word lexicon. (The lexicon consists of the groundtruth word and other random words.)
- Link: IIIT5K-download
SVT[32]：
- Introduction: There are 647 images of cropped words. Text instances are mostly horizontal. Many images are severely corrupted by noise, blur, and low resolution. SVT is collected from the Google Street View, and each image is associated with a 50-word lexicon. It only has word-level annotations.
- Link: SVT-download
ICDAR 2003(IC03)[33]：
- Introduction: There are 509 images in total, 258 for training and 251 for testing. It contains 867 test images of cropped word after filtering. Text instances are mostly horizontal. Each image is associated with a 50-word lexicon and a full-word lexicon. (The full lexicon combines all lexicon words.)
- Link: IC03-download
ICDAR 2013(IC13)[34]：
- Introduction: There are 1015 images of cropped word. Most images of IC13 are inherits from IC03. The text are mostly horizontal. No lexicon is provided.
- Link: IC13-download
SVHN[45]：
- Introduction: There are 600000 digits of house numbers in natural scenes. The digits are mostly horizontal. SVHN is collected from the Google View images, and is used to digit recognition.
- Link: SVHN-download

1.2 Irregular Scene Text Datasets

SVT-P[35]：
- Introduction: There are 639 images of cropped word. Many images are heavily distorted by the non-frontal view angle. SVT-P is collected from the side-view images in Google Street View. Each image is associated with a 50-word lexicon and a full-word lexicon.
- Link: SVT-P-download (Extraction code : vnis)
CUTE80[36]：
- Introduction: There are 80 high-resolution images taken in natural scenes. It contains 288 test images of cropped word after filtering and focuses on curved text. No lexicon is provided.
- Link: CUTE80-download
ICDAR 2015(IC15)[37]：
- Introduction: There are 1500 images in total, 1000 for training and 500 for testing. It contains 2077 test images of cropped word including more than 200 irregular text. No lexicon is provided.
- Link: IC15-download
COCO-Text[38]：
- Introduction: There are 63686 images in total. It contains 145859 test images of cropped word including handwritten and printed, clear and blur, English and non-English.
- Link: COCO-Text-download
Total-Text[39]：
- Introduction: There are 1555 images in total. It contains 11459 test images of cropped word with more than three different text orientations: horizontal, multi-oriented and curved.
- Link: Total-Text-download
CTW-1500[43]：
- Introduction: There are 1500 images in total, 1000 for training and 500 for testing. It contains 10751 test images of cropped word. Text instances are multi-oriented and curved. The main languages are Chinese and English.
- Link: CTW-1500-download

1.3 Chinese Scene Text Datasets

CTW-12k(RCTW competition，ICDAR17)[40]：
- Introduction: There are 12514 images in total, 11514 for training and 1000 for testing. Most images in CTW-12k is collected by camera or mobile phone, and others are generated images. Each image contains more than one text line. The competition tasks include text detection and end-to-end text recognition.
- Link: CTW-12K-download
MTWI(competition)[41]：
- Introduction: There are 20000 images. Text instances are mainly Chinese or English web text. The competition tasks include web text recognition, web text detection and end-to-end web text detection and recognition.
- Link: MTWI-download (Extraction code:gox9)
CTW[42]：
- Introduction: There are 32285 high resolution street view images of Chinese text, with 1018402 character instances in total. CTW contains planar text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc.
- Link: CTW-download

1.4 Synthetic Datasets

Synth90k [53] :
- Introduction: There are 9 million images of cropped word generated from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects.
- Link: Synth90k-download
SynthText [54] :
- Introduction: There are 6 million images of cropped word. The generation process is similar to that of Synth90k.
- Link: SynthText-download

1.5 Comparison of Datasets

Comparison of Datasets

Datasets

Language

Images

Lexicon

Label

Types

Pictures

Instances

Training Pictures

Training Instances

Testing Pictures

Testing Instances

Full

None

Char

Word

IIIT5K[31]

English

5000

2000

3000

√

regular

SVT[32]

English

350

100

250

647

√

regular

IC03[33]

English

509

258

251

867

√

regular

IC13[34]

English

462

229

233

1015

√

regular

SVHN[45]

Digits

600000

573968

26032

√

regular

SVT-P[35]

English

238

639

238

639

√

irregular

CUTE80[36]

English

288

√

irregular

IC15[37]

English

1500

1000

500

2077

√

irregular

COCO-Text[38]

English

63686

145859

43686

118309

2000

27550

√

regular

Total-Text[39]

English

1555

11459

1555

11459

√

irregular

CTW-1500[43]

Chinese/English

1500

10751

1000

500

√

irregular

CTW-12K[40]

Chinese

12514

11514

1000

√

regular

MTWI[41]

Chinese

20000

10000

√

regular

CTW[42]

Chinese

32285

1018402

25887

812872

3269

103519

√

regular

2. Summary of Scene Text Recognition Results

2.1 Introduction

As shown in Table "Summary of Scene Text Recognition Results", we summarize the main recognition algorithms in community from 2011 to the present. The content of the table includes the sources, highlights, codes, types and recognition performance. The '*' in the Method indicates the use of extra datasets. The bold in the numeric represents the highest recognition results; ''^" represents the highest result of using the extra datasets; "@" represents different evaluation method which only uses 1811 test images.

2.2 Summary of Scene Text Recognition Results

Summary of Scene Text Recognition Results
Method	Highlight	Code	Regular	Irregular	Segmentation	Extra data	CTC	Attention	IIIT5K			SVT		IC03				IC13	SVT-P			CUTE80	IC15(2077)	COCO-TEXT	Time	Source
Method	Highlight	Code	Regular	Irregular	Segmentation	Extra data	CTC	Attention	50	1K	None	50	None	50	Full	50k	None	None	50	Full	None	None	None	None	Time	Source
Wang et al. [1]: ABBYY	Authors propose a two-stage text recognition system. The system consists of a state-of-the-art text detection followed by a leading OCR engine that shows outstanding performance.	√	√	×	√	×	×	×	24.3	-	-	35.0	-	56.0	55.0	-	-	-	40.5	26.1	-	-	-	-	2011	ICCV
Wang et al. [1] : SYNTH+PLEX	Authors established a baseline for scene text recognition. The results showed that an object recognition-based pipeline outperform conventional OCR engines and do so without explicit use of a text detector, which significantly simplifying the recognition pipeline.	√	√	×	×	×	×	×	-	-	-	57.0	-	76.0	62.0	-	-	-	-	-	-	-	-	-	2011	ICCV
Mishra et al. [2]	Authors presented a new framework, which used CRF and some or all of the English dictionary as the priors to obtain the recognition results. Besides, they introduced the IIIT5K-word datasets.	×	√	×	√	×	×	×	64.1	57.5	-	73.2	-	81.8	67.8	-	-	-	45.7	24.7	-	-	-	-	2012	BMVC
Wang et al. [3]	Authors built a new recognition system for scene text recognition task. They combined the representational power of multi-layer CNN, NMS and beam search in a end-to-end, lexicon-driven, scene text recognition system.	√	√	×	√	×	×	×	-	-	-	70.0	-	90.0	84.0	-	-	-	40.2	32.4	-	-	-	-	2012	ICPR
Goel et al. [4] : wDTW	Authors presented a holistic word recognition framework which did not require explicit character segmentation. They generated synthetic images from lexicon words and recognized the text in the image by matching the scene and synthetic images features with wDTW.	×	√	×	√	×	×	×	-	-	-	77.3	-	89.7	-	-	-	-	-	-	-	-	-	-	2013	ICDAR
Bissacco et al. [5] : PhotoOCR	Authors presented a two-stage text recognition system. The system was based on HOG features and recognized by a 5-layer CNN. Besides, they built a self-supervision mechanism to construct additional datasets.	×	√	×	√	×	×	×	-	-	-	90.4	78.0	-	-	-	-	87.6	-	-	-	-	-	-	2013	ICCV
Phan et al. [6]	Authors proposed a two-stage text recognition system for scene text recognition with perspective distortion. The system used MSERs to detect characters, SIFT descriptors to extract features, and SVM clustering to recognize words. Besides, the SVT-P datasets was introduced.	×	×	√	√	×	×	×	-	-	-	73.7	-	82.2	-	-	-	-	62.3	42.2	-	-	-	-	2013	ICCV
Alsharif et al. [7] : HMM/Maxout	Authors constructed a two-stage text recognition system based on segmentation method. The system recognized words by leveraging the convolutional Maxout networks along with hybrid HMM models. Besides, author showed the performance of end-to-end recognition.	×	√	×	√	×	×	×	-	-	-	74.3	-	93.1	88.6	85.1	-	-	-	-	-	-	-	-	2014	ICLR
Almazan et al [8] : KCSR	Authors embedded the word images and text strings in a common vectorial subspace, and regarded the recognition task as a nearest neighbor problem. The proposed representation had a fixed length, low dimensional and was fast to compute.	√	√	×	×	×	×	×	88.6	75.6	-	87.0	-	-	-	-	-	-	-	-	-	-	-	-	2014	TPAMI
Yao et al. [9] : Strokelets	Authors proposed a novel multi-scale representation 'Strokelets' for scene text recognition. Strokelets possed four distinctive advantages: usability, robustness, generality and expressivity.	×	√	×	√	×	×	×	80.2	69.3	-	75.9	-	88.5	80.3	-	-	-	-	-	-	-	-	-	2014	CVPR
R.-Serrano et al.[10] : Label embedding	Authors embedded word labels and word images into a common Euclidean space. The advantages of proposed method was simple and effective. It could combine with any descriptor and did not require costly pre-/post-processing operations. It also allowed for the recognition of never-seen before words.	×	√	×	×	×	×	×	76.1	57.4	-	70.0	-	-	-	-	-	-	-	-	-	-	-	-	2015	IJCV
Jaderberg et al. [11]	Authors proposed a novel CNN classifier that enabled efficient feature sharing for text detection, character case-sensitive and insensitive classification, and bi-gram classification. Besides, they proposed a method of automated data mining of Flickr.	√	√	×	√	×	×	×	-	-	-	86.1	-	96.2	91.5	-	-	-	-	-	-	-	-	-	2014	ECCV
Su and Lu [12]	Authors proposed a novel method that recognized the whole word images without character-level segmentation. The proposed method was based on HOG features, using BLSTM and CTC to achieve text recognition.	×	√	×	×	×	√	×	-	-	-	83.0	-	92.0	82.0	-	-	-	-	-	-	-	-	-	2014	ACCV
Gordo[13] : Mid-features	Authors proposed a descriptive, robust, and compact fixed-length representation: mid-level features. The proposed features could be paired with word attributes framework to improve recognition performance.	×	√	×	√	×	×	×	93.3	86.6	-	91.8	-	-	-	-	-	-	-	-	-	-	-	-	2015	CVPR
Jaderberg et al. [14]	Authors presented an end-to-end system for text localizing and recognizing in natural scene images and a synthetic dataset containing 9 million images. The proposed system used CNN to identify text, and the categories was 90k, which covered almost all words. Besides, the recognition result of the system can be used to update the detector.	√	√	×	×	×	×	×	97.1	92.7	-	95.4	80.7	98.7	98.6	93.3	93.1	90.8	-	-	-	-	-	-	2015	IJCV
Jaderberg et al. [15]	Authors proposed a model incorporating CNN and CRF for the unconstrained recognition of words in natural images. The entire model could be jointly optimized by back-propagating the structured output loss. The results set the baseline for lexicon-free.	×	√	×	×	×	×	×	95.5	89.6	-	93.2	71.7	97.8	97.0	93.4	89.6	81.8	-	-	-	-	-	-	2015	ICLR
Shi, Bai, and Yao [16] : CRNN	Authors modeled scene text recognition as a sequence problem by integrating the advantages of both deep CNN and RNN. They proposed an end-to-end trainable framework where transcription was made by CTC.	√	√	×	×	×	√	×	97.8	^95.0	81.2	97.5	82.7	98.7	98.0	95.7	91.9	89.6	-	-	-	-	-	-	2017	TPAMI
Shi et al. [17] : RARE	Authors proposed RARE, a recognition model for irregular text. The input images were firstly rectified via STN, extracted features by CNN, and then decoded by a recurrent network based on attention mechanism.	×	×	√	×	×	×	√	96.2	93.8	81.9	95.5	81.9	98.3	96.2	94.8	90.1	88.6	91.2	77.4	71.8	59.2	-	-	2016	CVPR
Lee and Osindero [18] : R2AM	Authors proposed recursive CNN, which allowed for parametrically efficient and effective image feature extraction. They constructed R2AM, which implicitly learned character-level language model. The use of a soft-attention mechanism allowed the model to decode selectively.	×	√	×	×	×	×	√	96.8	94.4	78.4	96.3	80.7	97.9	97.0	-	88.7	90.0	-	-	-	-	-	-	2016	CVPR
Liu et al. [19] : STAR-Net	Authors presented the STAR-Net for irregular scene text recognition. The proposed network used STN to remove the distortions of texts, which reduced the difficulty of recognition module. The residue convolutional blocks were employed for extracting image-based features, and words were decoded by BLSTM and CTC.	×	×	√	×	×	√	×	97.7	94.5	83.3	95.5	83.6	96.9	95.3	-	89.9	89.1	94.3	83.6	73.5	-	-	-	2016	BMVC
*Yang et al. [20]	Authors presented a robust end-to-end neural-based model to attentively recognize text in natural images. An auxiliary dense character detection task was introduced that helped to learn text specific visual patterns. A recurrent decoder network based on the attention mechanism that generated target sequences. Besides, they proposed a synthetic dataset containing perspective distortion, curvature, etc.	×	×	√	×	√	×	√	97.8	96.1	-	95.2	-	97.7	-	-	-	-	93.0	80.2	75.8	69.3	-	-	2017	IJCAI
Yin et al. [21]	Authors proposed a novel system for scene text recognition. The method simultaneously detected and recognized characters by sliding the text line image with character models. CNN was used to extract image-based feature and the final recognition results were decoded with CTC based algorithm.	×	√	×	×	×	√	×	98.7	96.1	78.2	95.1	72.5	97.6	96.5	-	81.1	81.4	-	-	-	-	-	-	2017	ICCV
*Cheng et al. [22] : FAN	Authors proposed the concept of 'attention drift' and a novel method called FAN for scene text recognition. A focusing network (FN) was introduced, which could focus AN’s deviated attention back on the target areas.	×	√	×	×	√	×	√	99.3	97.5	87.4	97.1	85.9	99.2	97.3	-	94.2	93.3	-	-	-	-	@85.3	-	2017	ICCV
Cheng et al. [23] : AON	Authors developed the arbitrary orientation network (AON) to directly capture the deep features of irregular texts. The proposed network extracted scene text features in four directions and the character placement clues. A filter gate (FG) was designed for fusing four-direction features with the learned placement clues.	×	×	√	×	×	×	√	99.6	98.1	87.0	96.0	82.8	98.5	97.1	-	91.5	-	94.0	83.7	73.0	76.8	68.2	-	2018	CVPR
Gao et al. [24]	Authors proposed a novel scene text recognition system. In order to suppress the background noise, the residual attention modules were incorporated into a small densely connected network to improve the discriminability of CNN features. Besides, the proposed system speed up recognition by removing the RNN.	×	√	×	×	×	√	√	99.1	97.9	81.8	97.4	82.7	98.7	96.7	-	89.2	88.0	-	-	-	-	-	-	2017	arXiv
Liu et al. [25] : Char-Net	Authors presented a Char-Net for recognizing distorted scene text. The proposed system extracted the single-character feature region and removed distortion of it. Finally, the recurrent decoder network based on the attention mechanism generated target sequences.	×	×	√	√	×	×	√	-	-	83.6	-	84.4	-	93.3	-	91.5	90.8	-	-	73.5	-	60.0	-	2018	AAAI
*Liu et al. [26] : SqueezedText	Authors proposed SqueezedText for real-time scene text recognition. The front-end B-CEDNet was trained under binary constraints with significant compression . Hence, the proposed method lead to both remarkable inference run-time speedup as well as memory usage reduction.	×	√	×	×	√	×	×	97.0	94.1	87.0	95.2	-	98.8	97.9	93.8	93.1	92.9	-	-	-	-	-	-	2018	AAAI
*Bai et al. [27] : EP	Authors focused on the the misalignment between the ground truth strings and the attention’s output sequences of probability distribution and proposed EP for scene text recognition. The advantage of using edit probability was that the training process could focus on the missing, superfluous and unrecognized characters, and thus the impact of the misalignment problem could be alleviated or even overcome.	×	√	×	×	√	×	√	99.5	97.9	88.3	96.6	87.5	98.7	97.9	-	94.6	^94.4	-	-	-	-	73.9	-	2018	CVPR
Liu et al. [28]	Authors addressed the problem of image feature learning for scene text recognition. The novel multi-task network was presented with an encoder-generator-discriminator-decoder architecture that guided image feature learning by using clean images.	×	√	×	×	×	√	×	97.3	96.1	89.4	96.8	87.1	98.1	97.5	-	94.7	94.0	-	-	73.9	62.5	-	-	2018	ECCV
Gao et al. [29]	Authors proposed a novel scene text recognition system. In order to suppress the background noise, the residual attention modules were incorporated into a small densely connected network to improve the discriminability of CNN features. Finally, a recurrent decoder based on CTC algorithm generated the recognition results.	×	√	×	×	×	√	√	99.1	97.2	83.6	97.7	83.9	98.6	96.6	-	91.4	89.5	-	-	-	-	-	-	2018	ICIP
Shi et al. [30] : ASTER	Authors proposed ASTER for irregular scene text recognition. The proposed system was an improved version of [17]. The input images were firstly rectified via TPS, extracted features by Res-Net, and then decoded by a recurrent network based on attention mechanism.	√	×	√	×	×	×	√	99.6	98.8	93.4	97.4	89.5	98.8	98.0	-	94.5	91.8	-	-	78.5	79.5	76.1	-	2018	TPAMI
Luo et al. [46] : MORAN	Authors designed MORAN for rectifying images that contain irregular text. The proposed method predicted the offsets of different regions in text images, decreased the difficulty of recognition and enabled the attention-based sequence recognition network to more easily read irregular text. Compared with TPS, MORAN was simpler and more usable.	√	×	√	×	×	×	√	97.9	96.2	91.2	96.6	88.3	98.7	97.8	-	95.0	92.4	94.3	86.7	76.1	77.4	68.8	-	2019	PR
Xie et al. [47] : CAN	Authors presented CAN for unconstrained scene text recognition. A Res-Net was used to extract image-based features .CNN and GLU were incorporated as the decoder instead of RNN to obtain the final recognition result.	×	√	×	×	×	×	√	97.0	94.2	80.5	96.9	83.4	98.4	97.8	-	91.0	90.5	-	-	-	-	-	-	2019	ACM
*Liao et al.[48] : CA-FCN	CA-FCN was devised for recognizing the text of arbitrary shapes. The proposed method was conducted from two-dimensional perspective and could simultaneously recognize the script and predict the position of each character. Besides, the proposed method required character-level labeling.	×	×	√	√	√	×	√	^99.8	^98.9	92.0	^98.8	82.1	-	-	-	-	91.4	-	-	-	78.1	-	-	2019	AAAI
*Li et al. [49] : SAR	Authors proposed SAR for irregular scene text recognition. A Res-Net was used to extract image features and the decoder network based on the 2D-attention mechanism generated target sequences.	×	×	√	×	√	×	√	99.4	98.2	95.0	98.5	^91.2	-	-	-	-	94.0	^95.8	^91.2	^86.4	^89.6	^78.8	^66.8	2019	AAAI

2.3 Chinese Scene Text Recognition Results

Chinese Scene Text Recognition Results
Method	RCTW	MTWI	CTW	Time	Source
Method	RCTW	MTWI	CTW	Time	Source
Zheqi He,Yongtao Wang : Foo & Bar	82.0 (end-to-end)	-	-	2017	RCTW competition
IFLYTEK : nelslip(iflytek&ustc)	-	85.8 (AR)	-	2018	MTWI competition
Yuan et al.[42] : CTW	-	-	80.5 (AR)	2018	CTW
Liu et al. [43] : CTW-1500	-	-	-	2017	CTW-1500

3. Field Survey

[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper

4. OCR Service

OCR	API	Free
Tesseract OCR Engine	×	√
Azure	√	√
ABBYY	√	√
OCR Space	√	√
SODA PDF OCR	√	√
Free Online OCR	√	√
Online OCR	√	√
Super Tools	√	√
在线中文识别	√	√
Calamari OCR	×	√
腾讯OCR	√	×

5. References

[1] [ICCV-2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of International Conference on Computer Vision (ICCV), pages 1457–1464, 2011. paper

[2] [BMVC-2012] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In Proceedings of British Machine Vision Conference (BMVC), pages 1–11, 2012. paper dataset

[3] [ICPR-2012] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proceedings of International Conference on Pattern Recognition (ICPR), pages 3304–3308, 2012. paper

[4] [ICDAR-2013] V. Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pages 398–402, 2013. paper

[5] [ICCV-2013] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of International Conference on Computer Vision (ICCV), pages 785–792, 2013. paper

[6] [ICCV-2013] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan. Recognizing text with perspective distortion in natural scenes.In Proceedings of International Conference on Computer Vision (ICCV), pages 569–576, 2013. paper

[7] [ICLR-2014] O. Alsharif and J. Pineau, End-to-end text recognition with hybrid HMM maxout models, in: Proceedings of International Conference on Learning Representations (ICLR), 2014. paper

[8] [TPAMI-2014] J. Almaz ́ an, A. Gordo, A. Forn ́ es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE Trans.Pattern Anal. Mach. Intell ., 36(12):2552–2566, 2014. paper code

[9] [CVPR-2014] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4042–4049, 2014. paper

[10] [IJCV-2015] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision (IJCV) , 113(3):193–207, 2015. paper

[11] [ECCV-2014] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Proceedings of European Conference on Computer Vision (ECCV), pages 512–528, 2014. paper code

[12] [ACCV-2014] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In Proceedings of Asian Conference on Computer Vision (ACCV), pages 35–48, 2014. paper

[13] [CVPR-2015] A. Gordo. Supervised mid-level features for word image representation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2956–2964, 2015. paper

[14] [IJCV-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. Int. J.Comput. Vision, 2015. paper code

[15] [ICLR-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Deep structured output learning for unconstrained text recognition, in: Proceedings of International Conference on Learning Representations (ICLR), 2015. paper

[16] [TPAMI-2017] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2017. paper code-Torch7 code-Pytorch

[17] [CVPR-2016] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016. paper

[18] [CVPR-2016] C.-Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2231–2239, 2016. paper

[19] [BMVC-2016] W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han. STAR-Net: A spatial attention residue network for scene text recognition. In Proceedings of British Machine Vision Conference (BMVC), page 7, 2016. paper

[20] [IJCAI-2017] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 2017. paper

[21] [ICCV-2017] F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu. Scene text recognition with sliding convolutional character models. In Proceedings of International Conference on Computer Vision (ICCV), 2017. paper

[22] [ICCV-2017] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of International Conference on Computer Vision (ICCV), pages 5086–5094, 2017. paper

[23] [CVPR-2018] Cheng Z, Xu Y, Bai F, et al. AON: Towards Arbitrarily-Oriented Text Recognition.In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 5571-5579, 2018. paper

[24] [arXiv-2017] Gao Y, Chen Y, Wang J, et al. Reading Scene Text with Attention Convolutional Sequence Modeling[J]. arXiv preprint arXiv:1709.04303, 2017. paper

[25] [AAAI-2018] Liu W, Chen C, Wong K Y K. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition[C]//AAAI. 2018. paper

[26] [AAAI-2018] Liu Z, Li Y, Ren F, et al. SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network[C]//AAAI. 2018. paper

[27] [CVPR-2018] Bai, F, Cheng, Z, Niu, Y, Pu, S and Zhou,S. Edit probability for scene text recognition, pages 1508-1516, 2018. paper

[28] [ECCV-2018] Liu Y, Wang Z, Jin H, et al. Synthetically Supervised Feature Learning for Scene Text Recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 435-451. paper

[29] [ICIP-2018] Gao Y, Chen Y, Wang J, et al. Dense Chained Attention Network for Scene Text Recognition[C]//2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018: 679-683. paper

[30] [TPAMI-2018] Shi B, Yang M, Wang X, et al. Aster: An attentional scene text recognizer with flexible rectification[J]. IEEE transactions on pattern analysis and machine intelligence, 2018. paper code

[31] [CVPR-2012] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up cues for scene text recognition. In CVPR, 2012. paper

[32] [ICCV-2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011. paper

[33] [IJDAR-2005] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young,K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao,J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions:entries, results,and future directions. IJDAR, 7(2-3):105–122, 2005. paper

[34] [ICDAR-2013] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda,S. R. Mestre, J. Mas, D. F. Mota, J. Almaz ́ an, and L. de las Heras. ICDAR 2013 robust reading competition. In ICDAR, 2013. paper

[35] [ICCV-2013] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, 2013. paper

[36] [Expert Syst.Appl-2014] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41(18):8027–8048, 2014. paper

[37] [ICDAR-2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160,2015. paper

[38] [arXiv-2016] Veit A, Matera T, Neumann L, et al. Coco-text: Dataset and benchmark for text detection and recognition in natural images[J]. arXiv preprint arXiv:1601.07140, 2016. paper

[39] [ICDAR-2017] Ch'ng C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition[C]//Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942. paper

[40] [ICDAR-2017] Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17)[C]//Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 1429-1434. paper

[41] [ICPR-2018] He M, Liu Y, Yang Z, et al. ICPR2018 Contest on Robust Reading for Multi-Type Web Images[C]//2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018: 7-12. paper

[42] [arXiv-2018] Yuan T L, Zhu Z, Xu K, et al. Chinese Text in the Wild[J]. arXiv preprint arXiv:1803.00085, 2018. paper

[43] [arXiv-2017] Yuliang L, Lianwen J, Shuaitao Z, et al. Detecting curve text in the wild: New dataset and new solution[J]. arXiv preprint arXiv:1712.02170, 2017. paper

[44] [ECCV-2018] Yao C, Wu W. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes.//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 71-88. paper

[45] [NIPS-WORKSHOP-2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco,Bo Wu, and Andrew YNg. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011. paper

[46] [PR-2019] C. Luo, L. Jin, and Z. Sun, “MORAN: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109–118, 2019. paper code

[47] [ACM-2019] Xie H, Fang S, Zha Z J, et al, “Convolutional Attention Networks for Scene Text Recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, pp. 3 2019. paper

[48] [AAAI-2019] Liao M, Zhang J, Wan Z, et al, “Scene text recognition from two-dimensional perspective,” //AAAI. 2019. paper

[49] [AAAI-2019] Li H, Wang P, Shen C, et al, “Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition,” //AAAI. 2019. paper

[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper

[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper

[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper

[53] [NIPS-WORKSHOP-2014] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, in: Proceedings of Advances in Neural Information Processing Deep Learn. Workshop (NIPS-W).2014. paper

[54] [CVPR-2016] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2315–2324. paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The State of the Art of Scene Text Recognition

1. Datasets

1.1 Regular Scene Text Datasets

1.2 Irregular Scene Text Datasets

1.3 Chinese Scene Text Datasets

1.4 Synthetic Datasets

1.5 Comparison of Datasets

2. Summary of Scene Text Recognition Results

2.1 Introduction

2.2 Summary of Scene Text Recognition Results

2.3 Chinese Scene Text Recognition Results

3. Field Survey

4. OCR Service

5. References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

HCIILAB/Scene-Text-Recognition

Folders and files

Latest commit

History

Repository files navigation

The State of the Art of Scene Text Recognition

1. Datasets

1.1 Regular Scene Text Datasets

1.2 Irregular Scene Text Datasets

1.3 Chinese Scene Text Datasets

1.4 Synthetic Datasets

1.5 Comparison of Datasets

2. Summary of Scene Text Recognition Results

2.1 Introduction

2.2 Summary of Scene Text Recognition Results

2.3 Chinese Scene Text Recognition Results

3. Field Survey

4. OCR Service

5. References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages