How to load all pages of the pdf file to the program

alicenguyen · Jun-28-2022, 06:51 AM

I'm working with the code to summarize the text using BERT. I am stopped at the step "loading all pages of the pdf file to the program". My code below just loads only one page. Please help for the instruction. I am a new coder.

f= open('/content/Example.pdf', 'rb') pdf = PdfFileReader(f) page = pdf.getPage(6) text = page.extractText()

***snippsat*** · Jun-28-2022, 09:31 AM

You should tell what library you use,bye the look it's PyPDF2
It's a common task to do, so if you search you will find different solution.
If i do a quick test this works.

import PyPDF2 file_pdf = 'sample.pdf' with open(file_pdf, mode='rb') as f: reader = PyPDF2.PdfFileReader(f) for page in range(reader.numPages): p = reader.getPage(page) print(p.extract_text())

alicenguyen · Jun-28-2022, 09:41 AM

Yes. I use Pypdf2.
I will try with your recommend code. Thanks.

alicenguyen

It is successful now. But the output looks not the best. Could you review and give me your advice?

Output:
Streaming output truncated to the last 5000 lines. o i n t t ..... ect

***snippsat*** · (This post was last modified: Jun-28-2022, 10:09 AM by snippsat.)

Don't post the whole oput if it 100's of lines long.
I have shorting it out.
Are you using my code with with one pdf file?
Here is the file sample.pdf i test with.

alicenguyen · Jun-29-2022, 02:49 AM

I used that code but with another file. Your file is OK.
See the attached file.

.pdf

test.pdf (Size: 272.01 KB / Downloads: 207)

alicenguyen · Jul-01-2022, 02:43 AM

(Jun-29-2022, 02:49 AM)alicenguyen Wrote: I used that code but with another file. Your file is OK.
See the attached file.

Could you help with the answer?

Pedroski55 · Jul-01-2022, 06:01 AM

I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.

PyPDf2 is good, but can't get some text for reasons I don't know or understand

# importing required modules import pdfplumber path2pdf = '/home/pedro/pdfExtractedPages/The_Knights_Tale_Modern_English.pdf' path2text = '/home/pedro/temp/The_Knights_Tale_Middle_English.txt' """ >>> test = enumerate(pages, 1) # 1 starts counting at 1, set 0 to count from zero >>> test <enumerate object at 0x7f586adb5880> >>> for t in test:	print(t) (1, <Page:1>) (2, <Page:2>) (3, <Page:3>) """ text_pages = [] with pdfplumber.open(pdf_file) as pdf: pages = pdf.pages for pg in range(len(pages)): text = pdf.pages[pg].extract_text() text_pages.append(text) print('text_pages is now', len(text_pages), 'long. Now joining tthe list to a string ... ') textstring = ''.join(text_pages) with open(path2text, 'w') as tf: tf.write(textstring)

Gets the bits other modules can't reach! Worked for me, thank you snippsat!

**buran** · Jul-01-2022, 07:06 AM

You can try using https://pypi.org/project/camelot-py/

alicenguyen · Jul-04-2022, 03:14 AM

(Jul-01-2022, 06:01 AM)Pedroski55 Wrote: I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.

PyPDf2 is good, but can't get some text for reasons I don't know or understand

Thanks for supporting the code. The result is better now.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Load a Folium map into a pdf-file	Thats_Leet	0	2,113	Jan-01-2025, 08:13 PM Last Post: Thats_Leet
	Json File more pages #pandas #dataframe	nio74maz	0	2,532	Dec-30-2020, 05:32 AM Last Post: nio74maz
	Phyton code to load a comma separated csv file in to a dict and then in to a dB	mrsenorchuck	2	4,059	Nov-29-2019, 10:59 AM Last Post: mrsenorchuck
	Load and format a CSV file	fioranosnake	11	9,669	Oct-30-2019, 12:32 PM Last Post: perfringo
	Load JSON file data into mongodb using pymongo	klllmmm	1	13,956	Jun-28-2019, 12:47 AM Last Post: klllmmm
	Fatal Python error: Py_Initialize: unable to load the file system codec	ecg1g15	0	4,749	Feb-12-2019, 12:16 PM Last Post: ecg1g15
	Cant seem to load my image file	jamshaid1997	0	3,722	Jan-18-2019, 02:54 PM Last Post: jamshaid1997
	Download entire web pages and save them as html file with urllib.request	fyec	2	18,840	Jul-13-2018, 10:12 AM Last Post: Larz60+
	Using asyncio to read text file and load GUI	QueenSvetlana	1	5,858	Nov-09-2017, 02:55 PM Last Post: heiner55
	Program that outputs HQ addresses of companies from Google + Local pages	frenchgirl1309	2	6,442	Nov-14-2016, 10:11 PM Last Post: Ofnuts

How to load all pages of the pdf file to the program

User Panel Messages

Announcements