page.get_text("blocks") on the attached PDFs returns empty array of blocks, could you please help me? #4213

krish-tech02 · 2025-01-08T12:28:10Z

krish-tech02
Jan 8, 2025

@JorjMcKie I am trying to read the text of the attached PDFs by using below code:

Extract text blocks and process them

output = [] for page in doc: output += page.get_text("blocks")

But the output seems to be empty, could you please help?
Alcohol Withdrawal 1-5-2025.pdf
Back Exercises, Lumbar 1-3-2025.pdf

Answered by JorjMcKie

Jan 8, 2025

What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this:
To draw capital letter "A", draw the lines "/", "-", "\" to achieve "/-\". Similar for any character with curved lines ... you get the argument.
The only way to access the text is using OCR.

doc=pymupdf.open("Alcohol.Withdrawal.1-5-2025.pdf") page=doc[0] tp = page.get_textpage_ocr(dpi=150, full=True) print(page.get_text(textpage=tp,sort=True)) eis Advocate Health Care | © Aurora Health Care Understanding Alcohol Withdrawal Alcohol affects your brain and body. When you stop drinking alcohol after regular or heavy drinking, changes happen in your body. T…

View full answer

JorjMcKie · 2025-01-08T12:51:31Z

JorjMcKie
Jan 8, 2025
Maintainer

What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this:
To draw capital letter "A", draw the lines "/", "-", "\" to achieve "/-\". Similar for any character with curved lines ... you get the argument.
The only way to access the text is using OCR.

doc=pymupdf.open("Alcohol.Withdrawal.1-5-2025.pdf") page=doc[0] tp = page.get_textpage_ocr(dpi=150, full=True) print(page.get_text(textpage=tp,sort=True)) eis Advocate Health Care | © Aurora Health Care Understanding Alcohol Withdrawal Alcohol affects your brain and body. When you stop drinking alcohol after regular or heavy drinking, changes happen in your body. This can lead to withdrawal symptoms. Quitting alcohol may be tough. There is supportto help you. ...

1 reply

krish-tech02 Jan 8, 2025
Author

Thanks @JorjMcKie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

page.get_text("blocks") on the attached PDFs returns empty array of blocks, could you please help me? #4213

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

page.get_text("blocks") on the attached PDFs returns empty array of blocks, could you please help me? #4213

Uh oh!

krish-tech02 Jan 8, 2025

Extract text blocks and process them

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Jan 8, 2025 Maintainer

Uh oh!

krish-tech02 Jan 8, 2025 Author

krish-tech02
Jan 8, 2025

Replies: 1 comment 1 reply

JorjMcKie
Jan 8, 2025
Maintainer

krish-tech02 Jan 8, 2025
Author