5

Sometimes when I do pdftotext it results in perfect text. I assume this is because the actual unicode text data is embedded directly in the PDF itself, and simply read out.

But other times (around half or more of documents that aren't just straight up scanned images) it results in ~strange glyphs~ in place of things like diacritics and accent marks, or sometimes even what seem to be blurry letters.

For example, this Yoruba dictionary PDF has these problems. If you run this:

pdftotext yoruba.pdf yoruba.txt 

You end up with these words scattered about:

expected actual -------- ------ lairotẹle lairot4ille ikọsilẹ ikljlsil4il logó logb 

Notice the accented ó became the letter b. But it's not as if every ó becomes a b in the doc. Many do, but not all. Same with the being a 4il. Many become like this, probably all of them. Most of the time (my sense is saying) the more obscure accent marks / diacritics like get converted into stranger characters or character sequences.

Why is this? Is it an OCR thing? Or does the PDF actually have the plain text embedded in it (i.e. it's not a scanned document to an image)? And yet, it's somehow not being properly decoded. I would like to know the answer to this, so at least I know it's either an OCR problem or an encoding/decoding problem.

If it's an encoding problem, that would be interesting. Then my question is, can I tell pdftotext to use some obscure decoding technique? Or what.

I bring this up partially because I've discovered some webpages recently that are encoded in either ucs2 or latin1, some even in some strange windows2255 or some encoding. So I've had to tinker with the encoding/decoding to properly extract the text in HTML documents. I'm wondering if the same thing applies to PDFs in this case.

Another document that suffers this problem is the Navajo dictionary. I don't know if it's an OCR thing or an encoding thing. Another document that is strange is "Zulu-English Dictionary by Forgotten Books" (which I would link to but straight downloads instead of being rendered in the browser). If you copy/paste the text, each letter is spaced 1 or 2 spaces from each other in seemingly random fashion. I have no idea why, would like to have a better sense.

1 Answer 1

0

The short answer is that there is NO error in PDFtoText it is faithfully reporting the embedded OCR and it was Google scanner that corrupted the initial scanned input.

enter image description here

This is a natural failing of OCR where it has to guess the language diacritics and acescents [A substance liable to become sour] in context. and since the page is multiple languages will with very high probability make uneducated stupid mistakes.

The less mixed languages in a page the better the chance of recognising English word combinations compared to mashing a guestimate.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.