pdfminer vs pdfplumber

pprod · Jan-30-2021, 09:39 AM

Hi,
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.

Does anyone know of a more concise way to do that in pdfminer than shown below:

from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice fp = open('file', 'rb') parser = PDFParser(fp) document = PDFDocument(parser) rsrcmgr = PDFResourceManager() device = PDFDevice(rsrcmgr) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(document): interpreter.process_page(page)

Thanks!

**Larz60+** · Jan-30-2021, 12:17 PM

Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of

images
pure text
tables
text as images (which can only be extracted using some form of OCR)

And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html

pprod · Jan-30-2021, 01:35 PM

(Jan-30-2021, 12:17 PM)Larz60+ Wrote: Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
images

pure text

tables

text as images (which can only be extracted using some form of OCR)

And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html

I'll try to get my head around pdfminer.six but I'm struggling to understand how I can make it extract text page by page instead of the whole document at once. For that purpose I recommend pdfplumber:

with pdfplumber.open (r'...\file.pdf') as pdf: for page_nr in range(2): page = pdf.pages[page_nr] text = page.extract_text() print(text)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	PDFminer outputs unreadable text during conversion from PDF to TXT	Gromila131	6	3,053	Aug-06-2024, 08:20 AM Last Post: Pedroski55
	Extracting Data into Columns using pdfplumber	arvin	17	36,914	Dec-17-2022, 11:59 AM Last Post: arvin
	pdfminer package: module isn't found	Pavel_47	25	22,971	Sep-18-2022, 08:40 PM Last Post: Larz60+
	pdfminer to csv	mfernandes	2	4,322	Jun-16-2021, 10:54 AM Last Post: mfernandes
	PDFplumber	pprod	2	10,066	Jan-26-2021, 06:12 PM Last Post: pprod
	pdfminer.six: search for complete documentation	Pavel_47	3	7,585	Jan-25-2021, 04:41 PM Last Post: buran
	pdfminer package: can't find exgtract_text function	Pavel_47	7	9,149	Jan-25-2021, 03:31 PM Last Post: Pavel_47
	PDFplumber	pprod	2	5,164	Nov-10-2020, 02:37 PM Last Post: pprod
	PDFplumber	pprod	2	3,278	Nov-06-2020, 08:34 AM Last Post: pprod
	install pdfminer	tkj80	2	13,154	Jan-12-2018, 12:39 AM Last Post: sparkz_alot

pdfminer vs pdfplumber

User Panel Messages

Announcements