0

Of course, almost all PDFs 'contain text' in the sense of having text that you can read, but I'm talking here about the difference between those in which that's just a bitmap that only gets interpreted as text by the brain of the human looking at the screen, versus those which also contain text as far as the computer is concerned.

This can be nonobvious in the case of a PDF scanned from paper. Sometimes, what you see on the screen looks like a blurry imperfect picture of text straight from the scan, but it turns out that the PDF has been through OCR, so even though you are being shown the original blurry bitmap, even though you are not gaining the benefit of the OCR while normally reading, the embedded text is still there, hidden in the file. Two ways in which it may manifest in a PDF reader:

  1. Try to select text with the mouse.

  2. Try searching for a word.

Of course it can also happen that some of the text has been OCR'd but not all.

In cases where both the above tests come up negative, is it then possible to say "no, this PDF does not contain embedded text," or can embedded text still be hidden in the file?

For example:

https://pdf.datasheetcatalog.com/datasheets/2300/45014_DS.pdf

As far as I can tell, the above PDF is all bitmaps, no embedded text. Is that correct, or am I still missing something?

2
  • 1
    am I still missing something? ....... Pretty much the obvious. Acrobat has the ability to create editable text. That is roughly how it is done. Commented May 18, 2024 at 11:28
  • @John Right, but I am mainly dealing with PDFs that were scanned from paper. Commented May 18, 2024 at 12:40

1 Answer 1

1

An approach is to extract text and test if it is empty. For example, in bash we have

# Create a PDF containing text. $ echo Text | pandoc -o t.pdf # Extract text and do the required test. $ mutool draw -F text t.pdf | sed -n '/[[:graph:]]/q1' && echo NoGraph || echo Graph Graph # # Create a PDF that contains no text. $ echo NoText | magick text:- nt.pdf # The same test $ mutool draw -F text nt.pdf | sed -n '/[[:graph:]]/q1' && echo NoGraph || echo Graph NoGraph 

The regular expression [[:graph:]] matches visible characters only, that is, any characters except spaces, control characters, and so on. Maybe you want to be more restrictive and use [[:print:]] (visible characters and spaces).

mutool is part of MuPDF. Of course, you can use pdftotext file.pdf - (from poppler-tools) instead, or any other.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.