Imaging and OCR
K.T.Anuradha
National Centre for Science Information
Indian Institute of Science
Bangalore – 560 012
(E-Mail: anu@ncsi.iisc.ernet.in)
15-20 April 2002 Imaging and OCR PI-3 1
Goals of This Presentation
To give an overview of Imaging and
Optical Character Recognition process
15-20 April 2002 Imaging and OCR PI-3 2
1
What Will You Learn?
You will get an overview of Imaging and
OCR process
What you need to do in the lab:
Scan some specific documents and using a few
OCR software installed, convert the scanned
images to text
15-20 April 2002 Imaging and OCR PI-3 3
Historical Perspective
M. Sheppard's invention, GISMO - A Robot Reader-
Writer in 1951
J. Rainbow developed a prototype machine in 1954
able to read uppercase typewritten output at the
“fantastic” speed of one character per minute
IBM, Recognition Equipment, Inc., Farrington,
Control Data, and Optical Scanning Corp, marketed
OCR systems by 1967
NASA used imaging system to enhance and
manipulate satellite images
15-20 April 2002 Imaging and OCR PI-3 4
2
Historical Perspective
Several standards were developed
Character Set for Optical Character Recognition (OCR-A).
ANSI X3.17-81
Character Set for Optical Character Recognition (OCR-B).
ANSI X3.49-75
Paper Used in Optical Character Recognition Systems.
ANSI X3.62-87. Several standards were developed
Optical Character Recognition (OCR) Inks. ANSI X3.86-80.
Optical Character Recognition (OCR) Character Position.
ANSI X3.93-81
15-20 April 2002 Imaging and OCR PI-3 5
Applications
Industries and Institutions in which control of
large amounts of paper work is critical
Banking, Credit cards, Insurance industries
The medical community
To capture, store and transmit radiology images
Libraries and archives
For conservation and preservation of vulnerable
documents and for the provision of access to
source documents
15-20 April 2002 Imaging and OCR PI-3 6
3
Glossary
Glyph – the image of a character rendered in pixels.
Raster – the scanned image created by a kinescope (a
CRT, Cathode Ray Tube, such as that used in computer
displays)
Text image – the content of a text record, often the
contents of a page of text.
Pixel – (Picture ELements) or pels (Picture ELements), an
image sample area that is almost always square. Arranged
in a grid, pixels form a raster image. A scanned page of a
paper or microform document creates a digital image that
is a raster of pixels.
15-20 April 2002 Imaging and OCR PI-3 7
More about Pixels
All pixels are identical in size and arrangement.
All pixels are processed the same way.
All pixels are scanned, displayed, and printed the
same way.
Each pixel has a location and a colour.
Both given as numbers.
Location: latitude and longitude
Color: Amount of Red Green and Blue
Max on all 3 is white, minimum on all 3 is black
15-20 April 2002 Imaging and OCR PI-3 8
4
Bit-Mapped Images
A bit-mapped image is a raster of
pixels.
Printed as a raster.
Can be created by raster scanning.
Can be created by a RIP (Raster
Image Processor) in a printer.
15-20 April 2002 Imaging and OCR PI-3 9
How many shades
Five main types of image shades
One-bit black and white or bi-tonal: no shades
between black and white
4 bit gray scale: 16 shades of gray
8 bit gray scale: 256 shades of gray
8 bit colour: each bit can be one of 256 colours
24 bit colour: 16.8 million colours
32 & 42 bit colours: not used much; opted by
photographers
15-20 April 2002 Imaging and OCR PI-3 10
5
Resolution
Number of dots per inch (dpi) determines the
resolution
Higher the dpi, larger is the size
1 bit black and white image at 100 dpi
requires 10 Kb of storage and 24 bit colour
image at 400 dpi requires 475 Kb of storage
15-20 April 2002 Imaging and OCR PI-3 11
Image trasmission and Access
On the Net via standard protocol such as TCP/IP
Transferring a single archival image over 56 Kbps
line require about 18 minutes, thumb nail within
seconds. LAN should support 10 Mbps to 100 Mbps
Colour Monitor of 19 inch size that support 1024 by
768 line resolution is ideal.
Desktop laser printers for monochrome with 300 to
600 dpi to the more expensive gray scale and colour
laser printers
15-20 April 2002 Imaging and OCR PI-3 12
6
Types of images
Thumbnail
Allows to judge in viewing the image; requires about 10-
35 Kb of storage space for each image
Service
Designed to convey information; typically are
compressed, requires up to 300 Kb for each image
Archival
Uncompressed image free of the artifacts resulting from
compression; highest quality images requires several Mb
each
15-20 April 2002 Imaging and OCR PI-3 13
Indexing of Images
Images are indexed to identify and retrieve
images
Eg. Purchace order number, Policy number,
account number, profile number, ISSN number
MARC format for bibliographic records has
some limitations in indexing images
Two alternatives to MARC are Dublin Core
and EAD (Encoded Archival Description)
15-20 April 2002 Imaging and OCR PI-3 14
7
Image formats
Raster Vector
bit mapped graphics and is mathematically defined with
composed of coloured dots.
coded instructions that
Common formats include .tiff define the angles and
(tagged image file format:
relationships between every
basis for all image files), .jpg
(joint photo- graphic experts line in the image.
group for gray line images), Common vector formats
.gif (for colour images), mpg include .wmf and .cgm
(motion picture experts images are edited in drawing
group), .bmp, .pdf programs like Adobe
images are edited in paint and Illustrator and CorelDraw.
photo programs like Adobe
PhotoShop and Metacreations
Painter
15-20 April 2002 Imaging and OCR PI-3 15
Image formats: uses
and advantages
Raster Vector
In continuous tone images Logos with a few solid
eg photographs; on the web colours and need to be
where there are no vector shown at a variety of sizes;
formats currently supported Creating specialized text
Only format that will show effects; 3D and CAD
smooth gradients and subtle programs
detail necessary in Resolution independent;
photographic images; Allow Smooth curves; Small file
for color correction much sizes
easier then vector images
15-20 April 2002 Imaging and OCR PI-3 16
8
Image capture interfaces
IDE
Widely used, low cost, poorest seek time
SCSI
Faster seek time, costs more, 40Mb-160Mb/sec
USB (Universal Serials Bus)
Ease of setup, 15Mb/sec
IEEE 1394
Initially developed by Apple, 3.2Gb/sec, not all pcs
support
15-20 April 2002 Imaging and OCR PI-3 17
Image Drivers
An image driver is required for an image capture
device to communicate with software applications.
Two standards are available
ISIS
Proprietary product developed by Pixel Translation
TWAIN
Developed and designed by TWAIN Working Group in
1999 adopted TWAIN 1.7 as the current standard
15-20 April 2002 Imaging and OCR PI-3 18
9
Selecting Imaging System
Imaging systems selection depends on the type of
application
Workflow or transaction processing system: Focus on
processing of documents and automating the process;
Capturing and storing images without alteration. Eg.
Purchase orders, invoices, credit card charges and
insurance policies
Storage and retrieval systems: Store and retrieve large
number of documents in a variety of types and formats.
Capturing and inhancing them to facilitate readability Eg.
Medical, Library community
15-20 April 2002 Imaging and OCR PI-3 19
Types of Imaging System
Drum Scanners: High-end scanners
Use photo multipliers
Expensive and sensitive devices
Flatbed Scanners
Ideal for odd-sized images
Sheetfed Scanners
Can scan only loose sheets
Compact in size and easy to install
Handheld scanners
Provide portability and functionality at the low cost
15-20 April 2002 Imaging and OCR PI-3 20
10
What, Why and When of OCR
Allows to scan printed, typewritten or hand
written text (numerals, letters or symbols)
and/or convert scanned image to a
computer process able format, either in the
form of a plain text or a word document or
an excel spread sheet, which can be edited,
used or reused in other documents
It uses raster images
15-20 April 2002 Imaging and OCR PI-3 21
What, Why and When of OCR
OCR is used when recreating a document in
electronic form takes more time
The converted text files take less space than
the original image file and can be indexed
Bridges the gap between the paperless and
the papered
15-20 April 2002 Imaging and OCR PI-3 22
11
How of OCR
It has three components:
Image scanner, OCR hardware/software, Output
interface
15-20 April 2002 Imaging and OCR PI-3 23
How of OCR
15-20 April 2002 Imaging and OCR PI-3 24
12
How of OCR
Scanner has 4 components:
A detector, An illumination source, A scan lens
and a document transport
OCR hardware/software performs three
operational steps:
Document analysis, Character recognition,
Contextual processing
15-20 April 2002 Imaging and OCR PI-3 25
How of OCR
Output Interface
Allows character recognition results to be
electronically transferred into the domain that
uses the results
15-20 April 2002 Imaging and OCR PI-3 26
13
Types of OCRs
Two types of OCRs
Task specific readers
General purpose readers
Task specific readers
Reads only specific documents: bank cheques, mail
address
used primarily for high-volume applications which
require high system throughput: Assigning ZIP Codes to
letter mail, Reading data entered in forms, e.g., tax
forms, Automatic accounting procedures used in
processing utility bills
15-20 April 2002 Imaging and OCR PI-3 27
Types of OCRs
General purpose page readers
High end OCR (usually for offices)
Speed and Accuracy are important
Format preservation
Good proof reading solutions
Low end OCR (usually for house use)
Speed is not required
Proof reading is done manually
15-20 April 2002 Imaging and OCR PI-3 28
14
Factors affecting OCR quality
Scanner quality
Scan resolution
Type of printed documents, whether laser printer
outputs or photocopied
Paper quality
Fonts used in the text
Linguistic complexities
Dictionary used
15-20 April 2002 Imaging and OCR PI-3 29
Evaluating OCRs
Neat interface
Easy-to-use wizards
Accurate recognition
Scan resolution setting (600 dpi is advisable)
Time taken from scanning to deliver the final
product
Enhanced usability of the product
Ability to modify the scan setting
15-20 April 2002 Imaging and OCR PI-3 30
15
Summarizing
We learnt basics of imaging system and images
Different steps involved in OCR technique and
scanning
Conversion of raster image to text using OCR
techniques
Types of imaging system and OCR software
Evaluation of imaging system and OCR software
15-20 April 2002 Imaging and OCR PI-3 31
References
Web Sites:
www.archivebuilders.com
Sunsite.berkeley.edu
www.cedar.buffalo.edu/Publications/TechReps/OCR/ocr.htm
navigatela.lacity.org/samples/start/
Journals
Chip July 2000
Pcquest Product review column
15-20 April 2002 Imaging and OCR PI-3 32
16
Questions?
Comments?
Discussions?
(Pl. fill the feedback form)
Thank You!
15-20 April 2002 Imaging and OCR PI-3 33
17