Recognize text in images with ML Kit on iOS

  • ML Kit's Text Recognition API can recognize text within images and videos, supporting various scripts like Latin, Chinese, Devanagari, Japanese, and Korean.

  • To use this API, you need to include the necessary ML Kit pods in your Podfile, initialize a TextRecognizer instance, and process the image by passing a UIImage or CMSampleBufferRef.

  • After processing, you can extract recognized text from blocks, lines, and elements, accessing information like text content and bounding coordinates.

  • For optimal accuracy, ensure input images have sufficient pixel data for text (ideally 16x16 pixels per character) and are in focus.

  • Improve performance by processing video frames synchronously using results(in:), rendering images and overlays in a single step, capturing images at lower resolutions, and avoiding concurrent use of multiple TextRecognizer instances with different scripts.

You can use ML Kit to recognize text in images or video, such as the text of a street sign. The main characteristics of this feature are:

Text Recognition v2 API
DescriptionRecognize text in images or videos, support for Latin, Chinese, Devanagari, Japanese and Korean scripts and a wide range of languages.
SDK namesGoogleMLKit/TextRecognition
GoogleMLKit/TextRecognitionChinese
GoogleMLKit/TextRecognitionDevanagari
GoogleMLKit/TextRecognitionJapanese
GoogleMLKit/TextRecognitionKorean
ImplementationAssets are statically linked to your app at build time
App size impactAbout 38 MB per script SDK
PerformanceReal-time on most devices for Latin script SDK, slower for others.

Try it out

  • Play around with the sample app to see an example usage of this API.
  • Try the code yourself with the codelab.

Before you begin

  1. Include the following ML Kit pods in your Podfile:
     # To recognize Latin script pod 'GoogleMLKit/TextRecognition', '8.0.0' # To recognize Chinese script pod 'GoogleMLKit/TextRecognitionChinese', '8.0.0' # To recognize Devanagari script pod 'GoogleMLKit/TextRecognitionDevanagari', '8.0.0' # To recognize Japanese script pod 'GoogleMLKit/TextRecognitionJapanese', '8.0.0' # To recognize Korean script pod 'GoogleMLKit/TextRecognitionKorean', '8.0.0' 
  2. After you install or update your project's Pods, open your Xcode project using its .xcworkspace. ML Kit is supported in Xcode version 12.4 or greater.

1. Create an instance of TextRecognizer

Create an instance of TextRecognizer by calling +textRecognizer(options:), passing the options related to the SDK you declared as dependency on above:

Swift

// When using Latin script recognition SDK let latinOptions = TextRecognizerOptions() let latinTextRecognizer = TextRecognizer.textRecognizer(options:options) // When using Chinese script recognition SDK let chineseOptions = ChineseTextRecognizerOptions() let chineseTextRecognizer = TextRecognizer.textRecognizer(options:options) // When using Devanagari script recognition SDK let devanagariOptions = DevanagariTextRecognizerOptions() let devanagariTextRecognizer = TextRecognizer.textRecognizer(options:options) // When using Japanese script recognition SDK let japaneseOptions = JapaneseTextRecognizerOptions() let japaneseTextRecognizer = TextRecognizer.textRecognizer(options:options) // When using Korean script recognition SDK let koreanOptions = KoreanTextRecognizerOptions() let koreanTextRecognizer = TextRecognizer.textRecognizer(options:options)

Objective-C

// When using Latin script recognition SDK MLKTextRecognizerOptions *latinOptions = [[MLKTextRecognizerOptions alloc] init]; MLKTextRecognizer *latinTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options]; // When using Chinese script recognition SDK MLKChineseTextRecognizerOptions *chineseOptions = [[MLKChineseTextRecognizerOptions alloc] init]; MLKTextRecognizer *chineseTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options]; // When using Devanagari script recognition SDK MLKDevanagariTextRecognizerOptions *devanagariOptions = [[MLKDevanagariTextRecognizerOptions alloc] init]; MLKTextRecognizer *devanagariTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options]; // When using Japanese script recognition SDK MLKJapaneseTextRecognizerOptions *japaneseOptions = [[MLKJapaneseTextRecognizerOptions alloc] init]; MLKTextRecognizer *japaneseTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options]; // When using Korean script recognition SDK MLKKoreanTextRecognizerOptions *koreanOptions = [[MLKKoreanTextRecognizerOptions alloc] init]; MLKTextRecognizer *koreanTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options];

2. Prepare the input image

Pass the image as a UIImage or a CMSampleBufferRef to the TextRecognizer's process(_:completion:) method:

Create a VisionImage object using a UIImage or a CMSampleBuffer.

If you use a UIImage, follow these steps:

  • Create a VisionImage object with the UIImage. Make sure to specify the correct .orientation.

    Swift

    let image = VisionImage(image: UIImage) visionImage.orientation = image.imageOrientation

    Objective-C

    MLKVisionImage *visionImage = [[MLKVisionImage alloc] initWithImage:image]; visionImage.orientation = image.imageOrientation;

If you use a CMSampleBuffer, follow these steps:

  • Specify the orientation of the image data contained in the CMSampleBuffer.

    To get the image orientation:

    Swift

    func imageOrientation(  deviceOrientation: UIDeviceOrientation,  cameraPosition: AVCaptureDevice.Position ) -> UIImage.Orientation {  switch deviceOrientation {  case .portrait:  return cameraPosition == .front ? .leftMirrored : .right  case .landscapeLeft:  return cameraPosition == .front ? .downMirrored : .up  case .portraitUpsideDown:  return cameraPosition == .front ? .rightMirrored : .left  case .landscapeRight:  return cameraPosition == .front ? .upMirrored : .down  case .faceDown, .faceUp, .unknown:  return .up  } }  

    Objective-C

    - (UIImageOrientation)  imageOrientationFromDeviceOrientation:(UIDeviceOrientation)deviceOrientation  cameraPosition:(AVCaptureDevicePosition)cameraPosition {  switch (deviceOrientation) {  case UIDeviceOrientationPortrait:  return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationLeftMirrored  : UIImageOrientationRight;  case UIDeviceOrientationLandscapeLeft:  return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationDownMirrored  : UIImageOrientationUp;  case UIDeviceOrientationPortraitUpsideDown:  return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationRightMirrored  : UIImageOrientationLeft;  case UIDeviceOrientationLandscapeRight:  return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationUpMirrored  : UIImageOrientationDown;  case UIDeviceOrientationUnknown:  case UIDeviceOrientationFaceUp:  case UIDeviceOrientationFaceDown:  return UIImageOrientationUp;  } }  
  • Create a VisionImage object using the CMSampleBuffer object and orientation:

    Swift

    let image = VisionImage(buffer: sampleBuffer) image.orientation = imageOrientation(  deviceOrientation: UIDevice.current.orientation,  cameraPosition: cameraPosition)

    Objective-C

     MLKVisionImage *image = [[MLKVisionImage alloc] initWithBuffer:sampleBuffer];  image.orientation =  [self imageOrientationFromDeviceOrientation:UIDevice.currentDevice.orientation  cameraPosition:cameraPosition];

3. Process the image

Then, pass the image to the process(_:completion:) method:

Swift

textRecognizer.process(visionImage) { result, error in guard error == nil, let result = result else { // Error handling return } // Recognized text }

Objective-C

[textRecognizer processImage:image  completion:^(MLKText *_Nullable result,  NSError *_Nullable error) {  if (error != nil || result == nil) {  // Error handling  return;  }  // Recognized text }];

4. Extract text from blocks of recognized text

If the text recognition operation succeeds, it returns a Text object. A Text object contains the full text recognized in the image and zero or more TextBlock objects.

Each TextBlock represents a rectangular block of text, which contain zero or more TextLine objects. Each TextLine object contains zero or more TextElement objects, which represent words and word-like entities such as dates and numbers.

For each TextBlock, TextLine, and TextElement object, you can get the text recognized in the region and the bounding coordinates of the region.

For example:

Swift

let resultText = result.text for block in result.blocks { let blockText = block.text let blockLanguages = block.recognizedLanguages let blockCornerPoints = block.cornerPoints let blockFrame = block.frame for line in block.lines { let lineText = line.text let lineLanguages = line.recognizedLanguages let lineCornerPoints = line.cornerPoints let lineFrame = line.frame for element in line.elements { let elementText = element.text let elementCornerPoints = element.cornerPoints let elementFrame = element.frame } } }

Objective-C

NSString *resultText = result.text; for (MLKTextBlock *block in result.blocks) { NSString *blockText = block.text;  NSArray<MLKTextRecognizedLanguage *> *blockLanguages = block.recognizedLanguages;  NSArray<NSValue *> *blockCornerPoints = block.cornerPoints; CGRect blockFrame = block.frame; for (MLKTextLine *line in block.lines) { NSString *lineText = line.text;  NSArray<MLKTextRecognizedLanguage *> *lineLanguages = line.recognizedLanguages;  NSArray<NSValue *> *lineCornerPoints = line.cornerPoints; CGRect lineFrame = line.frame; for (MLKTextElement *element in line.elements) { NSString *elementText = element.text;  NSArray<NSValue *> *elementCornerPoints = element.cornerPoints; CGRect elementFrame = element.frame; } } }

Input image guidelines

  • For ML Kit to accurately recognize text, input images must contain text that is represented by sufficient pixel data. Ideally, each character should be at least 16x16 pixels. There is generally no accuracy benefit for characters to be larger than 24x24 pixels.

    So, for example, a 640x480 image might work well to scan a business card that occupies the full width of the image. To scan a document printed on letter-sized paper, a 720x1280 pixel image might be required.

  • Poor image focus can affect text recognition accuracy. If you aren't getting acceptable results, try asking the user to recapture the image.

  • If you are recognizing text in a real-time application, you should consider the overall dimensions of the input images. Smaller images can be processed faster. To reduce latency, ensure that the text occupies as much of the image as possible, and capture images at lower resolutions (keeping in mind the accuracy requirements mentioned above). For more information, see Tips to improve performance.

Tips to improve performance

  • For processing video frames, use the results(in:) synchronous API of the detector. Call this method from the AVCaptureVideoDataOutputSampleBufferDelegate's captureOutput(_, didOutput:from:) function to synchronously get results from the given video frame. Keep AVCaptureVideoDataOutput's alwaysDiscardsLateVideoFrames as true to throttle calls to the detector. If a new video frame becomes available while the detector is running, it will be dropped.
  • If you use the output of the detector to overlay graphics on the input image, first get the result from ML Kit, then render the image and overlay in a single step. By doing so, you render to the display surface only once for each processed input frame. See the updatePreviewOverlayViewWithLastFrame in the ML Kit quickstart sample for an example.
  • Consider capturing images at a lower resolution. However, also keep in mind this API's image dimension requirements.
  • To avoid potential performance degradation, do not run multiple TextRecognizer instances with different script options concurrently.