Python | OCR on All the Images present in a Folder Simultaneously
Last Updated : 11 Nov, 2019
If you have a folder full of images that has some text which needs to be extracted into a separate folder with the corresponding image file name or in a single file, then this is the perfect code you are looking for. This article not only gives you the basis of
OCR (Optical Character Recognition) but also helps you to create
output.txt file for every image inside the main folder and save it in some predetermined direction. Libraries Needed -
pip3 install pillow pip3 install os-sys
You will also need the
tesseract-oct and
pytesseract library. The
tesseract-ocr can be downloaded and installed from
here and the
pytesseract can be installed using
pip3 install pytesseract Below is the Python implementation -
Python3 1== # Python program to extract text from all the images in a folder # storing the text in corresponding files in a different folder from PIL import Image import pytesseract as pt import os def main(): # path for the folder for getting the raw images path ="E:\\GeeksforGeeks\\images" # path for the folder for getting the output tempPath ="E:\\GeeksforGeeks\\textFiles" # iterating the images inside the folder for imageName in os.listdir(path): inputPath = os.path.join(path, imageName) img = Image.open(inputPath) # applying ocr using pytesseract for python text = pt.image_to_string(img, lang ="eng") # for removing the .jpg from the imagePath imagePath = imagePath[0:-4] fullTempPath = os.path.join(tempPath, 'time_'+imageName+".txt") print(text) # saving the text for every image in a separate .txt file file1 = open(fullTempPath, "w") file1.write(text) file1.close() if __name__ == '__main__': main()
Input Image :
image_sample1 Output : geeksforgeeks geeksforgeeks
If you want to store all the text from the images in a single output file then the code will be a little different. The main difference is that the mode of the file in which we will be writing will change to "
+a" to append the text and create the
output.txt file if it is not present already.
Python3 1== # extract text from all the images in a folder # storing the text in a single file from PIL import Image import pytesseract as pt import os def main(): # path for the folder for getting the raw images path ="E:\\GeeksforGeeks\\images" # link to the file in which output needs to be kept fullTempPath ="E:\\GeeksforGeeks\\output\\outputFile.txt" # iterating the images inside the folder for imageName in os.listdir(path): inputPath = os.path.join(path, imageName) img = Image.open(inputPath) # applying ocr using pytesseract for python text = pt.image_to_string(img, lang ="eng") # saving the text for appending it to the output.txt file # a + parameter used for creating the file if not present # and if present then append the text content file1 = open(fullTempPath, "a+") # providing the name of the image file1.write(imageName+"\n") # providing the content in the image file1.write(text+"\n") file1.close() # for printing the output file file2 = open(fullTempPath, 'r') print(file2.read()) file2.close() if __name__ == '__main__': main()
Input Image :
image_sample1
image_sample2 Output: 
It gave an output of the single file created after extracting all the information from the image inside the folder. The format of the file goes like this -
Name of the image Content of the image Name of the next image and so on .....
Explore
Machine Learning Basics
Python for Machine Learning
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advanced Techniques
Machine Learning Practice
My Profile