Author:陈晓雪
- 1. Datasets
- 2. Summary of Scene Text Recognition Results
- 3. Field Survey
- 4. OCR Service
- 5. References
- IIIT5K[31]:
- Introduction: There are 5000 images in total, 2000 for training and 3000 for testing. Text instances are mostly horizontal. Each image is associated with a short, 50-word lexicon and a long, 1000-word lexicon. (The lexicon consists of the groundtruth word and other random words.)
- Link: IIIT5K-download
- SVT[32]:
- Introduction: There are 647 images of cropped words. Text instances are mostly horizontal. Many images are severely corrupted by noise, blur, and low resolution. SVT is collected from the Google Street View, and each image is associated with a 50-word lexicon. It only has word-level annotations.
- Link: SVT-download
- ICDAR 2003(IC03)[33]:
- Introduction: There are 509 images in total, 258 for training and 251 for testing. It contains 867 test images of cropped word after filtering. Text instances are mostly horizontal. Each image is associated with a 50-word lexicon and a full-word lexicon. (The full lexicon combines all lexicon words.)
- Link: IC03-download
- ICDAR 2013(IC13)[34]:
- Introduction: There are 1015 images of cropped word. Most images of IC13 are inherits from IC03. The text are mostly horizontal. No lexicon is provided.
- Link: IC13-download
- SVHN[45]:
- Introduction: There are 600000 digits of house numbers in natural scenes. The digits are mostly horizontal. SVHN is collected from the Google View images, and is used to digit recognition.
- Link: SVHN-download
- SVT-P[35]:
- Introduction: There are 639 images of cropped word. Many images are heavily distorted by the non-frontal view angle. SVT-P is collected from the side-view images in Google Street View. Each image is associated with a 50-word lexicon and a full-word lexicon.
- Link: SVT-P-download (Extraction code : vnis)
- CUTE80[36]:
- Introduction: There are 80 high-resolution images taken in natural scenes. It contains 288 test images of cropped word after filtering and focuses on curved text. No lexicon is provided.
- Link: CUTE80-download
- ICDAR 2015(IC15)[37]:
- Introduction: There are 1500 images in total, 1000 for training and 500 for testing. It contains 2077 test images of cropped word including more than 200 irregular text. No lexicon is provided.
- Link: IC15-download
- COCO-Text[38]:
- Introduction: There are 63686 images in total. It contains 145859 test images of cropped word including handwritten and printed, clear and blur, English and non-English.
- Link: COCO-Text-download
- Total-Text[39]:
- Introduction: There are 1555 images in total. It contains 11459 test images of cropped word with more than three different text orientations: horizontal, multi-oriented and curved.
- Link: Total-Text-download
- CTW-1500[43]:
- Introduction: There are 1500 images in total, 1000 for training and 500 for testing. It contains 10751 test images of cropped word. Text instances are multi-oriented and curved. The main languages are Chinese and English.
- Link: CTW-1500-download
- CTW-12k(RCTW competition,ICDAR17)[40]:
- Introduction: There are 12514 images in total, 11514 for training and 1000 for testing. Most images in CTW-12k is collected by camera or mobile phone, and others are generated images. Each image contains more than one text line. The competition tasks include text detection and end-to-end text recognition.
- Link: CTW-12K-download
- MTWI(competition)[41]:
- Introduction: There are 20000 images. Text instances are mainly Chinese or English web text. The competition tasks include web text recognition, web text detection and end-to-end web text detection and recognition.
- Link: MTWI-download (Extraction code:gox9)
- CTW[42]:
- Introduction: There are 32285 high resolution street view images of Chinese text, with 1018402 character instances in total. CTW contains planar text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc.
- Link: CTW-download
- Synth90k [53] :
- Introduction: There are 9 million images of cropped word generated from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects.
- Link: Synth90k-download
- SynthText [54] :
- Introduction: There are 6 million images of cropped word. The generation process is similar to that of Synth90k.
- Link: SynthText-download
| Comparison of Datasets | ||||||||||||||
| Datasets | Language | Images | Lexicon | Label | Types | |||||||||
| Pictures | Instances | Training Pictures | Training Instances | Testing Pictures | Testing Instances | 50 | 1k | Full | None | Char | Word | |||
| IIIT5K[31] | English | 5000 | 5000 | 2000 | 2000 | 3000 | 3000 | √ | √ | × | × | √ | √ | regular |
| SVT[32] | English | 350 | - | 100 | - | 250 | 647 | √ | × | × | × | × | √ | regular |
| IC03[33] | English | 509 | - | 258 | - | 251 | 867 | √ | √ | √ | × | √ | √ | regular |
| IC13[34] | English | 462 | - | 229 | - | 233 | 1015 | × | × | × | √ | √ | √ | regular |
| SVHN[45] | Digits | 600000 | 600000 | 573968 | 573968 | 26032 | 26032 | × | × | × | √ | √ | √ | regular |
| SVT-P[35] | English | 238 | 639 | - | - | 238 | 639 | √ | × | √ | × | × | √ | irregular |
| CUTE80[36] | English | 80 | 288 | - | - | 80 | 288 | × | × | × | √ | × | √ | irregular |
| IC15[37] | English | 1500 | - | 1000 | - | 500 | 2077 | × | × | × | √ | × | √ | irregular |
| COCO-Text[38] | English | 63686 | 145859 | 43686 | 118309 | 2000 | 27550 | × | × | × | √ | × | √ | regular |
| Total-Text[39] | English | 1555 | 11459 | - | - | 1555 | 11459 | × | × | × | √ | × | √ | irregular |
| CTW-1500[43] | Chinese/English | 1500 | 10751 | 1000 | - | 500 | - | × | × | × | √ | × | √ | irregular |
| CTW-12K[40] | Chinese | 12514 | - | 11514 | - | 1000 | - | × | × | × | √ | × | √ | regular |
| MTWI[41] | Chinese | 20000 | - | 10000 | - | 10000 | - | × | × | × | √ | × | √ | regular |
| CTW[42] | Chinese | 32285 | 1018402 | 25887 | 812872 | 3269 | 103519 | × | × | × | √ | √ | √ | regular |
As shown in Table "Summary of Scene Text Recognition Results", we summarize the main recognition algorithms in community from 2011 to the present. The content of the table includes the sources, highlights, codes, types and recognition performance. The '*' in the Method indicates the use of extra datasets. The bold in the numeric represents the highest recognition results; ''^" represents the highest result of using the extra datasets; "@" represents different evaluation method which only uses 1811 test images.
| Summary of Scene Text Recognition Results | ||||||||||||||||||||||||||
| Method | Highlight | Code | Regular | Irregular | Segmentation | Extra data | CTC | Attention | IIIT5K | SVT | IC03 | IC13 | SVT-P | CUTE80 | IC15(2077) | COCO-TEXT | Time | Source | ||||||||
| 50 | 1K | None | 50 | None | 50 | Full | 50k | None | None | 50 | Full | None | None | None | None | |||||||||||
| Wang et al. [1]: ABBYY | Authors propose a two-stage text recognition system. The system consists of a state-of-the-art text detection followed by a leading OCR engine that shows outstanding performance. | √ | √ | × | √ | × | × | × | 24.3 | - | - | 35.0 | - | 56.0 | 55.0 | - | - | - | 40.5 | 26.1 | - | - | - | - | 2011 | ICCV |
| Wang et al. [1] : SYNTH+PLEX | Authors established a baseline for scene text recognition. The results showed that an object recognition-based pipeline outperform conventional OCR engines and do so without explicit use of a text detector, which significantly simplifying the recognition pipeline. | √ | √ | × | × | × | × | × | - | - | - | 57.0 | - | 76.0 | 62.0 | - | - | - | - | - | - | - | - | - | 2011 | ICCV |
| Mishra et al. [2] | Authors presented a new framework, which used CRF and some or all of the English dictionary as the priors to obtain the recognition results. Besides, they introduced the IIIT5K-word datasets. | × | √ | × | √ | × | × | × | 64.1 | 57.5 | - | 73.2 | - | 81.8 | 67.8 | - | - | - | 45.7 | 24.7 | - | - | - | - | 2012 | BMVC |
| Wang et al. [3] | Authors built a new recognition system for scene text recognition task. They combined the representational power of multi-layer CNN, NMS and beam search in a end-to-end, lexicon-driven, scene text recognition system. | √ | √ | × | √ | × | × | × | - | - | - | 70.0 | - | 90.0 | 84.0 | - | - | - | 40.2 | 32.4 | - | - | - | - | 2012 | ICPR |
| Goel et al. [4] : wDTW | Authors presented a holistic word recognition framework which did not require explicit character segmentation. They generated synthetic images from lexicon words and recognized the text in the image by matching the scene and synthetic images features with wDTW. | × | √ | × | √ | × | × | × | - | - | - | 77.3 | - | 89.7 | - | - | - | - | - | - | - | - | - | - | 2013 | ICDAR |
| Bissacco et al. [5] : PhotoOCR | Authors presented a two-stage text recognition system. The system was based on HOG features and recognized by a 5-layer CNN. Besides, they built a self-supervision mechanism to construct additional datasets. | × | √ | × | √ | × | × | × | - | - | - | 90.4 | 78.0 | - | - | - | - | 87.6 | - | - | - | - | - | - | 2013 | ICCV |
| Phan et al. [6] | Authors proposed a two-stage text recognition system for scene text recognition with perspective distortion. The system used MSERs to detect characters, SIFT descriptors to extract features, and SVM clustering to recognize words. Besides, the SVT-P datasets was introduced. | × | × | √ | √ | × | × | × | - | - | - | 73.7 | - | 82.2 | - | - | - | - | 62.3 | 42.2 | - | - | - | - | 2013 | ICCV |
| Alsharif et al. [7] : HMM/Maxout | Authors constructed a two-stage text recognition system based on segmentation method. The system recognized words by leveraging the convolutional Maxout networks along with hybrid HMM models. Besides, author showed the performance of end-to-end recognition. | × | √ | × | √ | × | × | × | - | - | - | 74.3 | - | 93.1 | 88.6 | 85.1 | - | - | - | - | - | - | - | - | 2014 | ICLR |
| Almazan et al [8] : KCSR | Authors embedded the word images and text strings in a common vectorial subspace, and regarded the recognition task as a nearest neighbor problem. The proposed representation had a fixed length, low dimensional and was fast to compute. | √ | √ | × | × | × | × | × | 88.6 | 75.6 | - | 87.0 | - | - | - | - | - | - | - | - | - | - | - | - | 2014 | TPAMI |
| Yao et al. [9] : Strokelets | Authors proposed a novel multi-scale representation 'Strokelets' for scene text recognition. Strokelets possed four distinctive advantages: usability, robustness, generality and expressivity. | × | √ | × | √ | × | × | × | 80.2 | 69.3 | - | 75.9 | - | 88.5 | 80.3 | - | - | - | - | - | - | - | - | - | 2014 | CVPR |
| R.-Serrano et al.[10] : Label embedding | Authors embedded word labels and word images into a common Euclidean space. The advantages of proposed method was simple and effective. It could combine with any descriptor and did not require costly pre-/post-processing operations. It also allowed for the recognition of never-seen before words. | × | √ | × | × | × | × | × | 76.1 | 57.4 | - | 70.0 | - | - | - | - | - | - | - | - | - | - | - | - | 2015 | IJCV |
| Jaderberg et al. [11] | Authors proposed a novel CNN classifier that enabled efficient feature sharing for text detection, character case-sensitive and insensitive classification, and bi-gram classification. Besides, they proposed a method of automated data mining of Flickr. | √ | √ | × | √ | × | × | × | - | - | - | 86.1 | - | 96.2 | 91.5 | - | - | - | - | - | - | - | - | - | 2014 | ECCV |
| Su and Lu [12] | Authors proposed a novel method that recognized the whole word images without character-level segmentation. The proposed method was based on HOG features, using BLSTM and CTC to achieve text recognition. | × | √ | × | × | × | √ | × | - | - | - | 83.0 | - | 92.0 | 82.0 | - | - | - | - | - | - | - | - | - | 2014 | ACCV |
| Gordo[13] : Mid-features | Authors proposed a descriptive, robust, and compact fixed-length representation: mid-level features. The proposed features could be paired with word attributes framework to improve recognition performance. | × | √ | × | √ | × | × | × | 93.3 | 86.6 | - | 91.8 | - | - | - | - | - | - | - | - | - | - | - | - | 2015 | CVPR |
| Jaderberg et al. [14] | Authors presented an end-to-end system for text localizing and recognizing in natural scene images and a synthetic dataset containing 9 million images. The proposed system used CNN to identify text, and the categories was 90k, which covered almost all words. Besides, the recognition result of the system can be used to update the detector. | √ | √ | × | × | × | × | × | 97.1 | 92.7 | - | 95.4 | 80.7 | 98.7 | 98.6 | 93.3 | 93.1 | 90.8 | - | - | - | - | - | - | 2015 | IJCV |
| Jaderberg et al. [15] | Authors proposed a model incorporating CNN and CRF for the unconstrained recognition of words in natural images. The entire model could be jointly optimized by back-propagating the structured output loss. The results set the baseline for lexicon-free. | × | √ | × | × | × | × | × | 95.5 | 89.6 | - | 93.2 | 71.7 | 97.8 | 97.0 | 93.4 | 89.6 | 81.8 | - | - | - | - | - | - | 2015 | ICLR |
| Shi, Bai, and Yao [16] : CRNN | Authors modeled scene text recognition as a sequence problem by integrating the advantages of both deep CNN and RNN. They proposed an end-to-end trainable framework where transcription was made by CTC. | √ | √ | × | × | × | √ | × | 97.8 | ^95.0 | 81.2 | 97.5 | 82.7 | 98.7 | 98.0 | 95.7 | 91.9 | 89.6 | - | - | - | - | - | - | 2017 | TPAMI |
| Shi et al. [17] : RARE | Authors proposed RARE, a recognition model for irregular text. The input images were firstly rectified via STN, extracted features by CNN, and then decoded by a recurrent network based on attention mechanism. | × | × | √ | × | × | × | √ | 96.2 | 93.8 | 81.9 | 95.5 | 81.9 | 98.3 | 96.2 | 94.8 | 90.1 | 88.6 | 91.2 | 77.4 | 71.8 | 59.2 | - | - | 2016 | CVPR |
| Lee and Osindero [18] : R2AM | Authors proposed recursive CNN, which allowed for parametrically efficient and effective image feature extraction. They constructed R2AM, which implicitly learned character-level language model. The use of a soft-attention mechanism allowed the model to decode selectively. | × | √ | × | × | × | × | √ | 96.8 | 94.4 | 78.4 | 96.3 | 80.7 | 97.9 | 97.0 | - | 88.7 | 90.0 | - | - | - | - | - | - | 2016 | CVPR |
| Liu et al. [19] : STAR-Net | Authors presented the STAR-Net for irregular scene text recognition. The proposed network used STN to remove the distortions of texts, which reduced the difficulty of recognition module. The residue convolutional blocks were employed for extracting image-based features, and words were decoded by BLSTM and CTC. | × | × | √ | × | × | √ | × | 97.7 | 94.5 | 83.3 | 95.5 | 83.6 | 96.9 | 95.3 | - | 89.9 | 89.1 | 94.3 | 83.6 | 73.5 | - | - | - | 2016 | BMVC |
| *Yang et al. [20] | Authors presented a robust end-to-end neural-based model to attentively recognize text in natural images. An auxiliary dense character detection task was introduced that helped to learn text specific visual patterns. A recurrent decoder network based on the attention mechanism that generated target sequences. Besides, they proposed a synthetic dataset containing perspective distortion, curvature, etc. | × | × | √ | × | √ | × | √ | 97.8 | 96.1 | - | 95.2 | - | 97.7 | - | - | - | - | 93.0 | 80.2 | 75.8 | 69.3 | - | - | 2017 | IJCAI |
| Yin et al. [21] | Authors proposed a novel system for scene text recognition. The method simultaneously detected and recognized characters by sliding the text line image with character models. CNN was used to extract image-based feature and the final recognition results were decoded with CTC based algorithm. | × | √ | × | × | × | √ | × | 98.7 | 96.1 | 78.2 | 95.1 | 72.5 | 97.6 | 96.5 | - | 81.1 | 81.4 | - | - | - | - | - | - | 2017 | ICCV |
| *Cheng et al. [22] : FAN | Authors proposed the concept of 'attention drift' and a novel method called FAN for scene text recognition. A focusing network (FN) was introduced, which could focus AN’s deviated attention back on the target areas. | × | √ | × | × | √ | × | √ | 99.3 | 97.5 | 87.4 | 97.1 | 85.9 | 99.2 | 97.3 | - | 94.2 | 93.3 | - | - | - | - | @85.3 | - | 2017 | ICCV |
| Cheng et al. [23] : AON | Authors developed the arbitrary orientation network (AON) to directly capture the deep features of irregular texts. The proposed network extracted scene text features in four directions and the character placement clues. A filter gate (FG) was designed for fusing four-direction features with the learned placement clues. | × | × | √ | × | × | × | √ | 99.6 | 98.1 | 87.0 | 96.0 | 82.8 | 98.5 | 97.1 | - | 91.5 | - | 94.0 | 83.7 | 73.0 | 76.8 | 68.2 | - | 2018 | CVPR |
| Gao et al. [24] | Authors proposed a novel scene text recognition system. In order to suppress the background noise, the residual attention modules were incorporated into a small densely connected network to improve the discriminability of CNN features. Besides, the proposed system speed up recognition by removing the RNN. | × | √ | × | × | × | √ | √ | 99.1 | 97.9 | 81.8 | 97.4 | 82.7 | 98.7 | 96.7 | - | 89.2 | 88.0 | - | - | - | - | - | - | 2017 | arXiv |
| Liu et al. [25] : Char-Net | Authors presented a Char-Net for recognizing distorted scene text. The proposed system extracted the single-character feature region and removed distortion of it. Finally, the recurrent decoder network based on the attention mechanism generated target sequences. | × | × | √ | √ | × | × | √ | - | - | 83.6 | - | 84.4 | - | 93.3 | - | 91.5 | 90.8 | - | - | 73.5 | - | 60.0 | - | 2018 | AAAI |
| *Liu et al. [26] : SqueezedText | Authors proposed SqueezedText for real-time scene text recognition. The front-end B-CEDNet was trained under binary constraints with significant compression . Hence, the proposed method lead to both remarkable inference run-time speedup as well as memory usage reduction. | × | √ | × | × | √ | × | × | 97.0 | 94.1 | 87.0 | 95.2 | - | 98.8 | 97.9 | 93.8 | 93.1 | 92.9 | - | - | - | - | - | - | 2018 | AAAI |
| *Bai et al. [27] : EP | Authors focused on the the misalignment between the ground truth strings and the attention’s output sequences of probability distribution and proposed EP for scene text recognition. The advantage of using edit probability was that the training process could focus on the missing, superfluous and unrecognized characters, and thus the impact of the misalignment problem could be alleviated or even overcome. | × | √ | × | × | √ | × | √ | 99.5 | 97.9 | 88.3 | 96.6 | 87.5 | 98.7 | 97.9 | - | 94.6 | ^94.4 | - | - | - | - | 73.9 | - | 2018 | CVPR |
| Liu et al. [28] | Authors addressed the problem of image feature learning for scene text recognition. The novel multi-task network was presented with an encoder-generator-discriminator-decoder architecture that guided image feature learning by using clean images. | × | √ | × | × | × | √ | × | 97.3 | 96.1 | 89.4 | 96.8 | 87.1 | 98.1 | 97.5 | - | 94.7 | 94.0 | - | - | 73.9 | 62.5 | - | - | 2018 | ECCV |
| Gao et al. [29] | Authors proposed a novel scene text recognition system. In order to suppress the background noise, the residual attention modules were incorporated into a small densely connected network to improve the discriminability of CNN features. Finally, a recurrent decoder based on CTC algorithm generated the recognition results. | × | √ | × | × | × | √ | √ | 99.1 | 97.2 | 83.6 | 97.7 | 83.9 | 98.6 | 96.6 | - | 91.4 | 89.5 | - | - | - | - | - | - | 2018 | ICIP |
| Shi et al. [30] : ASTER | Authors proposed ASTER for irregular scene text recognition. The proposed system was an improved version of [17]. The input images were firstly rectified via TPS, extracted features by Res-Net, and then decoded by a recurrent network based on attention mechanism. | √ | × | √ | × | × | × | √ | 99.6 | 98.8 | 93.4 | 97.4 | 89.5 | 98.8 | 98.0 | - | 94.5 | 91.8 | - | - | 78.5 | 79.5 | 76.1 | - | 2018 | TPAMI |
| Luo et al. [46] : MORAN | Authors designed MORAN for rectifying images that contain irregular text. The proposed method predicted the offsets of different regions in text images, decreased the difficulty of recognition and enabled the attention-based sequence recognition network to more easily read irregular text. Compared with TPS, MORAN was simpler and more usable. | √ | × | √ | × | × | × | √ | 97.9 | 96.2 | 91.2 | 96.6 | 88.3 | 98.7 | 97.8 | - | 95.0 | 92.4 | 94.3 | 86.7 | 76.1 | 77.4 | 68.8 | - | 2019 | PR |
| Xie et al. [47] : CAN | Authors presented CAN for unconstrained scene text recognition. A Res-Net was used to extract image-based features .CNN and GLU were incorporated as the decoder instead of RNN to obtain the final recognition result. | × | √ | × | × | × | × | √ | 97.0 | 94.2 | 80.5 | 96.9 | 83.4 | 98.4 | 97.8 | - | 91.0 | 90.5 | - | - | - | - | - | - | 2019 | ACM |
| *Liao et al.[48] : CA-FCN | CA-FCN was devised for recognizing the text of arbitrary shapes. The proposed method was conducted from two-dimensional perspective and could simultaneously recognize the script and predict the position of each character. Besides, the proposed method required character-level labeling. | × | × | √ | √ | √ | × | √ | ^99.8 | ^98.9 | 92.0 | ^98.8 | 82.1 | - | - | - | - | 91.4 | - | - | - | 78.1 | - | - | 2019 | AAAI |
| *Li et al. [49] : SAR | Authors proposed SAR for irregular scene text recognition. A Res-Net was used to extract image features and the decoder network based on the 2D-attention mechanism generated target sequences. | × | × | √ | × | √ | × | √ | 99.4 | 98.2 | 95.0 | 98.5 | ^91.2 | - | - | - | - | 94.0 | ^95.8 | ^91.2 | ^86.4 | ^89.6 | ^78.8 | ^66.8 | 2019 | AAAI |
| Chinese Scene Text Recognition Results | |||||
| Method | RCTW | MTWI | CTW | Time | Source |
| Zheqi He,Yongtao Wang : Foo & Bar | 82.0 (end-to-end) | - | - | 2017 | RCTW competition |
| IFLYTEK : nelslip(iflytek&ustc) | - | 85.8 (AR) | - | 2018 | MTWI competition |
| Yuan et al.[42] : CTW | - | - | 80.5 (AR) | 2018 | CTW |
| Liu et al. [43] : CTW-1500 | - | - | - | 2017 | CTW-1500 |
[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper
[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper
[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper
| OCR | API | Free |
|---|---|---|
| Tesseract OCR Engine | × | √ |
| Azure | √ | √ |
| ABBYY | √ | √ |
| OCR Space | √ | √ |
| SODA PDF OCR | √ | √ |
| Free Online OCR | √ | √ |
| Online OCR | √ | √ |
| Super Tools | √ | √ |
| 在线中文识别 | √ | √ |
| Calamari OCR | × | √ |
| 腾讯OCR | √ | × |
[1] [ICCV-2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of International Conference on Computer Vision (ICCV), pages 1457–1464, 2011. paper
[2] [BMVC-2012] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In Proceedings of British Machine Vision Conference (BMVC), pages 1–11, 2012. paper dataset
[3] [ICPR-2012] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proceedings of International Conference on Pattern Recognition (ICPR), pages 3304–3308, 2012. paper
[4] [ICDAR-2013] V. Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pages 398–402, 2013. paper
[5] [ICCV-2013] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of International Conference on Computer Vision (ICCV), pages 785–792, 2013. paper
[6] [ICCV-2013] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan. Recognizing text with perspective distortion in natural scenes.In Proceedings of International Conference on Computer Vision (ICCV), pages 569–576, 2013. paper
[7] [ICLR-2014] O. Alsharif and J. Pineau, End-to-end text recognition with hybrid HMM maxout models, in: Proceedings of International Conference on Learning Representations (ICLR), 2014. paper
[8] [TPAMI-2014] J. Almaz ́ an, A. Gordo, A. Forn ́ es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE Trans.Pattern Anal. Mach. Intell ., 36(12):2552–2566, 2014. paper code
[9] [CVPR-2014] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4042–4049, 2014. paper
[10] [IJCV-2015] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision (IJCV) , 113(3):193–207, 2015. paper
[11] [ECCV-2014] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In Proceedings of European Conference on Computer Vision (ECCV), pages 512–528, 2014. paper code
[12] [ACCV-2014] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In Proceedings of Asian Conference on Computer Vision (ACCV), pages 35–48, 2014. paper
[13] [CVPR-2015] A. Gordo. Supervised mid-level features for word image representation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2956–2964, 2015. paper
[14] [IJCV-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. Int. J.Comput. Vision, 2015. paper code
[15] [ICLR-2015] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Deep structured output learning for unconstrained text recognition, in: Proceedings of International Conference on Learning Representations (ICLR), 2015. paper
[16] [TPAMI-2017] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2017. paper code-Torch7 code-Pytorch
[17] [CVPR-2016] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016. paper
[18] [CVPR-2016] C.-Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2231–2239, 2016. paper
[19] [BMVC-2016] W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han. STAR-Net: A spatial attention residue network for scene text recognition. In Proceedings of British Machine Vision Conference (BMVC), page 7, 2016. paper
[20] [IJCAI-2017] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 2017. paper
[21] [ICCV-2017] F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu. Scene text recognition with sliding convolutional character models. In Proceedings of International Conference on Computer Vision (ICCV), 2017. paper
[22] [ICCV-2017] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of International Conference on Computer Vision (ICCV), pages 5086–5094, 2017. paper
[23] [CVPR-2018] Cheng Z, Xu Y, Bai F, et al. AON: Towards Arbitrarily-Oriented Text Recognition.In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 5571-5579, 2018. paper
[24] [arXiv-2017] Gao Y, Chen Y, Wang J, et al. Reading Scene Text with Attention Convolutional Sequence Modeling[J]. arXiv preprint arXiv:1709.04303, 2017. paper
[25] [AAAI-2018] Liu W, Chen C, Wong K Y K. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition[C]//AAAI. 2018. paper
[26] [AAAI-2018] Liu Z, Li Y, Ren F, et al. SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network[C]//AAAI. 2018. paper
[27] [CVPR-2018] Bai, F, Cheng, Z, Niu, Y, Pu, S and Zhou,S. Edit probability for scene text recognition, pages 1508-1516, 2018. paper
[28] [ECCV-2018] Liu Y, Wang Z, Jin H, et al. Synthetically Supervised Feature Learning for Scene Text Recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 435-451. paper
[29] [ICIP-2018] Gao Y, Chen Y, Wang J, et al. Dense Chained Attention Network for Scene Text Recognition[C]//2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018: 679-683. paper
[30] [TPAMI-2018] Shi B, Yang M, Wang X, et al. Aster: An attentional scene text recognizer with flexible rectification[J]. IEEE transactions on pattern analysis and machine intelligence, 2018. paper code
[31] [CVPR-2012] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up cues for scene text recognition. In CVPR, 2012. paper
[32] [ICCV-2011] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011. paper
[33] [IJDAR-2005] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young,K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao,J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions:entries, results,and future directions. IJDAR, 7(2-3):105–122, 2005. paper
[34] [ICDAR-2013] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda,S. R. Mestre, J. Mas, D. F. Mota, J. Almaz ́ an, and L. de las Heras. ICDAR 2013 robust reading competition. In ICDAR, 2013. paper
[35] [ICCV-2013] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, 2013. paper
[36] [Expert Syst.Appl-2014] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41(18):8027–8048, 2014. paper
[37] [ICDAR-2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160,2015. paper
[38] [arXiv-2016] Veit A, Matera T, Neumann L, et al. Coco-text: Dataset and benchmark for text detection and recognition in natural images[J]. arXiv preprint arXiv:1601.07140, 2016. paper
[39] [ICDAR-2017] Ch'ng C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition[C]//Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942. paper
[40] [ICDAR-2017] Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17)[C]//Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 1429-1434. paper
[41] [ICPR-2018] He M, Liu Y, Yang Z, et al. ICPR2018 Contest on Robust Reading for Multi-Type Web Images[C]//2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018: 7-12. paper
[42] [arXiv-2018] Yuan T L, Zhu Z, Xu K, et al. Chinese Text in the Wild[J]. arXiv preprint arXiv:1803.00085, 2018. paper
[43] [arXiv-2017] Yuliang L, Lianwen J, Shuaitao Z, et al. Detecting curve text in the wild: New dataset and new solution[J]. arXiv preprint arXiv:1712.02170, 2017. paper
[44] [ECCV-2018] Yao C, Wu W. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes.//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 71-88. paper
[45] [NIPS-WORKSHOP-2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco,Bo Wu, and Andrew YNg. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011. paper
[46] [PR-2019] C. Luo, L. Jin, and Z. Sun, “MORAN: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109–118, 2019. paper code
[47] [ACM-2019] Xie H, Fang S, Zha Z J, et al, “Convolutional Attention Networks for Scene Text Recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, pp. 3 2019. paper
[48] [AAAI-2019] Liao M, Zhang J, Wan Z, et al, “Scene text recognition from two-dimensional perspective,” //AAAI. 2019. paper
[49] [AAAI-2019] Li H, Wang P, Shen C, et al, “Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition,” //AAAI. 2019. paper
[50] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper
[51] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper
[52] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper
[53] [NIPS-WORKSHOP-2014] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, in: Proceedings of Advances in Neural Information Processing Deep Learn. Workshop (NIPS-W).2014. paper
[54] [CVPR-2016] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: Proceedings of Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2315–2324. paper