Python Forum
Extracting Data into Columns using pdfplumber
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting Data into Columns using pdfplumber
#1
I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells.

How could I do that?

For your reference:
Quote:
15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr
11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr
11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr

Thanks in advance
Reply
#2
Looks like each line of your data is separated by a space.

If that is so, just use .split() on each line to get a list of each item of data in the line.

The list can then be written to Excel or, I remember Menator mentioned a module called tabulate which formats text data nicely!
Reply
#3
I thought of the same thing, but some lines contain more data and some lines contain less.
Reply
#4
Which method do you use to extract the data from a page?
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#5
Quote:table = pdf.pages[0].extract_table()
df = pd.DataFrame(table[1:])
Reply
#6
What exactly is table? Actual text? Bytes? Or some kind of pdfplumber object?

Are column headers included in table?
Reply
#7
[Image: 1.png]
Reply
#8
I'm starting to understand your problem!

Using reportlab I made a very simple pdf with 1 table of 4 rows and 5 columns. The table had borders.

The data for the table looks like this:

Quote:data= [['00', '', '02', '', '04'],
['10', '', '12', '', '14'],
['20', '', '22', '', '24'],
['30', '', '32', '', '34']]

As you can see, the columns 2 and 4 contain no data, just ''

Using this, I got the data from the table:

import pdfplumber # Parse pdf with pdfplumber.open(path2pdf + savename2) as pdf: # Get the first page of the object page = pdf.pages[0] # Get the text data of the page text = page.extract_text() # Get all the tabular data of this page tables = page.extract_tables() # Traversing table for t_index in range(len(tables)): table = tables[t_index] # Traversing each row of data for data in table: print(data)
Which gives this as output, you can see the empty values, so they should be in your pandas df as NaN, I suppose.

Output:
['00', '', '02', '', '04'] ['10', '', '12', '', '14'] ['20', '', '22', '', '24'] ['30', '', '32', '', '34']
But when I make a table with no borders, there is no table, just text:

Quote:>>> tables
[]
>>> text
'00 02 04\n11 13\n20 22 23 24\n30 31 32 34'
>>>

Now how to know which columns in which rows have nothing in them??

Think think!
Reply
#9
I also tried Tabula, but I had no luck.
Reply
#10
Try this then look at table!

import pdfplumber import re path2pdf = '/home/pedro/myPython/reportlab_python/my_pdfs/' savename1 = "simple_table1.pdf" # no borders pdf = pdfplumber.open(path2pdf + savename1) for page in pdf.pages: print(page.extract_text()) for pdf_table in page.extract_tables(): table = [] cells = [] for row in pdf_table: if not any(row): # If a line is all empty, it is considered to be the end of a record if any(cells): table.append(cells) cells = [] elif all(row): # If the line is not empty, then this line is a new line, the last one ends if any(cells): table.append(cells) cells = [] table.append(row) else: if len(cells) == 0: cells = row else: for i in range(len(row)): if row[i] is not None: cells[i] = row[i] if cells[i] is None else cells[i] + row[i] for row in table: print([re.sub(' +', ' ', cell) if cell is not None else None for cell in row]) #print([re.sub(' +', ' ', cell) if cell is not None else ' ' for cell in row]) print('---------- Dividing line ----------') pdf.close()
Output:
>>> table [['00', '', '02', '', '04'], ['10', '', '12', '', '14'], ['20', '', '22', '', '24'], ['30', '', '32', '', '34']] >>>
Then you can put table in pandas, I hope!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  extracting data from a user-completed fillable pdf Perry 2 1,541 Sep-25-2025, 01:49 PM
Last Post: DeaD_EyE
  Extracting data from bank statement PDFs (Accountant) a4avinash 4 15,378 Feb-27-2025, 01:53 PM
Last Post: griffinhenry
  Confused by the different ways of extracting data in DataFrame leea2024 1 1,304 Aug-17-2024, 01:34 PM
Last Post: deanhystad
  Extracting the correct data from a CSV file S2G 6 2,948 Jun-03-2024, 04:50 PM
Last Post: snippsat
  How to properly format rows and columns in excel data from parsed .txt blocks jh67 7 5,263 Dec-12-2022, 08:22 PM
Last Post: jh67
  How to keep columns header on excel without change after export data to excel file? ahmedbarbary 0 2,373 May-03-2022, 05:46 PM
Last Post: ahmedbarbary
  Extracting Data from tables DataExtrator 0 2,120 Nov-02-2021, 12:24 PM
Last Post: DataExtrator
  Merging spreadsheets with the same columns and extracting rows with matching entries johnbernard 3 14,875 Aug-19-2021, 03:08 PM
Last Post: johnbernard
  extracting data ajitnayak1987 1 2,550 Jul-29-2021, 06:13 AM
Last Post: bowlofred
  Extracting and printing data ajitnayak1987 0 2,224 Jul-28-2021, 09:30 AM
Last Post: ajitnayak1987

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.