Posts: 22 Threads: 5 Joined: Dec 2022 I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells. How could I do that? For your reference: Quote: 15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr 11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr 11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr Thanks in advance Posts: 1,210 Threads: 146 Joined: Jul 2017 Looks like each line of your data is separated by a space. If that is so, just use .split() on each line to get a list of each item of data in the line. The list can then be written to Excel or, I remember Menator mentioned a module called tabulate which formats text data nicely! Posts: 22 Threads: 5 Joined: Dec 2022 I thought of the same thing, but some lines contain more data and some lines contain less. Posts: 741 Threads: 122 Joined: Dec 2017 Which method do you use to extract the data from a page? Paul It is more important to do the right thing, than to do the thing right.(P.Drucker) Better is the enemy of good. (Montesquieu) = French version for 'kiss'. Posts: 22 Threads: 5 Joined: Dec 2022 Quote:table = pdf.pages[0].extract_table() df = pd.DataFrame(table[1:]) Posts: 1,210 Threads: 146 Joined: Jul 2017 What exactly is table? Actual text? Bytes? Or some kind of pdfplumber object? Are column headers included in table? Posts: 22 Threads: 5 Joined: Dec 2022 Posts: 1,210 Threads: 146 Joined: Jul 2017 I'm starting to understand your problem! Using reportlab I made a very simple pdf with 1 table of 4 rows and 5 columns. The table had borders. The data for the table looks like this: Quote:data= [['00', '', '02', '', '04'], ['10', '', '12', '', '14'], ['20', '', '22', '', '24'], ['30', '', '32', '', '34']] As you can see, the columns 2 and 4 contain no data, just '' Using this, I got the data from the table: import pdfplumber # Parse pdf with pdfplumber.open(path2pdf + savename2) as pdf: # Get the first page of the object page = pdf.pages[0] # Get the text data of the page text = page.extract_text() # Get all the tabular data of this page tables = page.extract_tables() # Traversing table for t_index in range(len(tables)): table = tables[t_index] # Traversing each row of data for data in table: print(data) Which gives this as output, you can see the empty values, so they should be in your pandas df as NaN, I suppose. Output: ['00', '', '02', '', '04'] ['10', '', '12', '', '14'] ['20', '', '22', '', '24'] ['30', '', '32', '', '34']
But when I make a table with no borders, there is no table, just text: Quote:>>> tables [] >>> text '00 02 04\n11 13\n20 22 23 24\n30 31 32 34' >>> Now how to know which columns in which rows have nothing in them?? Think think! Posts: 22 Threads: 5 Joined: Dec 2022 I also tried Tabula, but I had no luck. Posts: 1,210 Threads: 146 Joined: Jul 2017 Try this then look at table! import pdfplumber import re path2pdf = '/home/pedro/myPython/reportlab_python/my_pdfs/' savename1 = "simple_table1.pdf" # no borders pdf = pdfplumber.open(path2pdf + savename1) for page in pdf.pages: print(page.extract_text()) for pdf_table in page.extract_tables(): table = [] cells = [] for row in pdf_table: if not any(row): # If a line is all empty, it is considered to be the end of a record if any(cells): table.append(cells) cells = [] elif all(row): # If the line is not empty, then this line is a new line, the last one ends if any(cells): table.append(cells) cells = [] table.append(row) else: if len(cells) == 0: cells = row else: for i in range(len(row)): if row[i] is not None: cells[i] = row[i] if cells[i] is None else cells[i] + row[i] for row in table: print([re.sub(' +', ' ', cell) if cell is not None else None for cell in row]) #print([re.sub(' +', ' ', cell) if cell is not None else ' ' for cell in row]) print('---------- Dividing line ----------') pdf.close()Output: >>> table [['00', '', '02', '', '04'], ['10', '', '12', '', '14'], ['20', '', '22', '', '24'], ['30', '', '32', '', '34']] >>>
Then you can put table in pandas, I hope! |