Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pdfminer vs pdfplumber
#1
Hi,
I have been testing pdfplumber and pdfminer and at this stage I am not sure which one I prefer. Pdfminer does a better a job at extracting text from an unstructured pdf but it doesn't seem to be easy to use. It looks like it takes a lot more code to open a pdf on a per page basis with pdfminer than with pdfplumber.

Does anyone know of a more concise way to do that in pdfminer than shown below:

from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice fp = open('file', 'rb') parser = PDFParser(fp) document = PDFDocument(parser) rsrcmgr = PDFResourceManager() device = PDFDevice(rsrcmgr) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(document): interpreter.process_page(page)
Thanks!
Reply
#2
Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
  • images
  • pure text
  • tables
  • text as images (which can only be extracted using some form of OCR)
And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html
pprod likes this post
Reply
#3
(Jan-30-2021, 12:17 PM)Larz60+ Wrote: Most use pdfminer.six now.

I'm not familiar with pdfplumber, but it looks interesting. Let us know your experience with it.

Please keep in mind that a pdf file is a very complicated object, and can take many forms
for example contents can be any combination of
  • images
  • pure text
  • tables
  • text as images (which can only be extracted using some form of OCR)
And I probably missed some.

The documents for pdfminer.six show some rather simple methods: https://pdfminersix.readthedocs.io/en/la...level.html


I'll try to get my head around pdfminer.six but I'm struggling to understand how I can make it extract text page by page instead of the whole document at once. For that purpose I recommend pdfplumber:

with pdfplumber.open (r'...\file.pdf') as pdf: for page_nr in range(2): page = pdf.pages[page_nr] text = page.extract_text() print(text) 
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PDFminer outputs unreadable text during conversion from PDF to TXT Gromila131 6 3,053 Aug-06-2024, 08:20 AM
Last Post: Pedroski55
  Extracting Data into Columns using pdfplumber arvin 17 36,914 Dec-17-2022, 11:59 AM
Last Post: arvin
  pdfminer package: module isn't found Pavel_47 25 22,971 Sep-18-2022, 08:40 PM
Last Post: Larz60+
  pdfminer to csv mfernandes 2 4,322 Jun-16-2021, 10:54 AM
Last Post: mfernandes
  PDFplumber pprod 2 10,066 Jan-26-2021, 06:12 PM
Last Post: pprod
  pdfminer.six: search for complete documentation Pavel_47 3 7,585 Jan-25-2021, 04:41 PM
Last Post: buran
  pdfminer package: can't find exgtract_text function Pavel_47 7 9,149 Jan-25-2021, 03:31 PM
Last Post: Pavel_47
  PDFplumber pprod 2 5,164 Nov-10-2020, 02:37 PM
Last Post: pprod
  PDFplumber pprod 2 3,278 Nov-06-2020, 08:34 AM
Last Post: pprod
  install pdfminer tkj80 2 13,154 Jan-12-2018, 12:39 AM
Last Post: sparkz_alot

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.