Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pdfminer to csv
#1
Dear Python users,
I am currently learning python and using python 3 version.
I am trying to convert several pdf files into 1 csv file.
pdfminer seems to be the best package for converting pdfs. Here is the code that I have written so far:
import io import os from IPython.core.display import display from pdfminer3.converter import TextConverter from pdfminer3.converter import PDFPageAggregator from pdfminer3.layout import LAParams, LTTextBox from pdfminer3.pdfinterp import PDFPageInterpreter from pdfminer3.pdfinterp import PDFResourceManager from pdfminer3.pdfpage import PDFPage import pandas as pd # possibly necessary to convert into csv import csv # Setting up the pdf file for processing and extracting the text in it into a string resource_manager = PDFResourceManager() out_text = io.StringIO() converter = TextConverter(resource_manager, filehandler, laparams=LAParams()) page_interpreter = PDFPageInterpreter(resource_manager, converter) def searchpdf(): pathextension = r'/where I have the pdfs saved' # -----> Folder where all the files are stored for path in os.listdir(pathextension): full_path = os.path.join(pathextension, path) # Checks the folder and then the extension of the file if os.path.isfile(full_path) and os.path.splitext(path)[1] == ".pdf": # Opens each path and associated pdf file with open(full_path, 'rb') as searchfullpdf: # Running scans over each file for page in PDFPage.get_pages(searchfullpdf, caching=True, check_extractable=True): page_interpreter.process_page(page) textfound = out_text.getvalue() # Returns the values found in each file
My doubt is how I should continue to save my results into csv. Adding:
[input_file = csv.DictReader(open("pdfdata.csv"))]
does not work and seems too trivial.
My objective is to obtain a csv file that looks like:
file; text;
file-xxx; Here is some information;
file-yyy; Here is more information;
...

To obtain the name of the files into csv I need to code:
[f=open("C:/Users/mydirectory/output.csv",'r+') w=csv.writer(f) for path, dirs, files in os.walk("C:/Users/mydirectory"): for filename in files: w.writerow([filename])]
Reply
#2
csv.DictReader does work, very well actually, but requires a header record (first record in file), which you can add.
simply make sure all columns are included, and use a format (with same delimiter as the rest of the file)
like:
"id","code","local_code","name","continent","iso_country","wikipedia_link","keywords"
of course replacing with your column names, and replacing comma with your delimiter (if different).

so first entry something like:
myfile.write(f'"id","code","local_code","name","continent","iso_country","wikipedia_link","keywords"\n')
Reply
#3
Lar60+, thank you for your comment. I apologize but I couldn't already figure out a complete solution for my code. The problem is that I am in a very begginer stage.
I tried to have a 2nd look to csv.DictWrite and coded (to transpose the pdfs results into csv):
name_of_output_file = "/where output is saved/output.csv" with open(name_of_output_file, 'w') as csvfile: fieldnames = ['text'] writer = csv.DictWriter(textfound, fieldnames=fieldnames) writer.writeheader()
But my output.csv file continues empty.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PDFminer outputs unreadable text during conversion from PDF to TXT Gromila131 6 3,053 Aug-06-2024, 08:20 AM
Last Post: Pedroski55
  pdfminer package: module isn't found Pavel_47 25 22,971 Sep-18-2022, 08:40 PM
Last Post: Larz60+
  pdfminer vs pdfplumber pprod 2 13,332 Jan-30-2021, 01:35 PM
Last Post: pprod
  pdfminer.six: search for complete documentation Pavel_47 3 7,585 Jan-25-2021, 04:41 PM
Last Post: buran
  pdfminer package: can't find exgtract_text function Pavel_47 7 9,149 Jan-25-2021, 03:31 PM
Last Post: Pavel_47
  install pdfminer tkj80 2 13,154 Jan-12-2018, 12:39 AM
Last Post: sparkz_alot

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.