page.get_text("blocks") on the attached PDFs returns empty array of blocks, could you please help me? #4213
Answered by JorjMcKie
krish-tech02 asked this question in Looking for help
-
| @JorjMcKie I am trying to read the text of the attached PDFs by using below code: Extract text blocks and process themBut the output seems to be empty, could you please help? |
Beta Was this translation helpful? Give feedback.
Answered by JorjMcKie Jan 8, 2025
Replies: 1 comment 1 reply
-
| What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this: doc=pymupdf.open("Alcohol.Withdrawal.1-5-2025.pdf") page=doc[0] tp = page.get_textpage_ocr(dpi=150, full=True) print(page.get_text(textpage=tp,sort=True)) eis Advocate Health Care | © Aurora Health Care Understanding Alcohol Withdrawal Alcohol affects your brain and body. When you stop drinking alcohol after regular or heavy drinking, changes happen in your body. This can lead to withdrawal symptoms. Quitting alcohol may be tough. There is supportto help you. ... |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by krish-tech02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this:
To draw capital letter "A", draw the lines
"/","-","\"to achieve"/-\". Similar for any character with curved lines ... you get the argument.The only way to access the text is using OCR.