Python Forum
Extracting data from bank statement PDFs (Accountant)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting data from bank statement PDFs (Accountant)
#1
Hi everyone,

I'm a complete beginner to Python and I'm facing challenges extracting data from bank statement PDFs. I'm trying to identify transaction details like dates, descriptions, withdrawals, and deposits.

I've been using AI tools to generate Python scripts, but they've been inconsistent in accurately determining withdrawal and deposit amounts. I'm struggling with ambiguous cases where the transaction type isn't clearly stated.

I'm hoping to get some guidance on how to refine the AI-generated scripts or explore alternative Python approaches. I understand that many forums discourage AI-generated content, but I'm hoping for some guidance since I'm struggling with this task. I'm not looking to become a Python expert, but I do need to automate this process to be more efficient.

Any suggestions or resources would be greatly appreciated! Even if you can't provide a complete solution, any pointers or general advice would be tremendously helpful!

Attached Sample Files

Attached Files

.xlsx   Book1 (Output How I want).xlsx (Size: 9.66 KB / Downloads: 299)
.pdf   SAMPLE.pdf (Size: 168.25 KB / Downloads: 595)
.csv   ExtractedData.csv (Size: 199.77 KB / Downloads: 268)
.py   pdf-to-csv-extractor-v4.py (Size: 3.17 KB / Downloads: 364)
Reply
#2
Get the text like this:

import pymupdf import re path2pdf = '/home/pedro/Downloads/BANK_SAMPLE.pdf' savepath = '/home/pedro/Downloads/BANK_SAMPLE.text' # there are no tables in the pdf # below does not return anything ##for page in doc: ## tabs = page.find_tables() # the pdf is text and has no images # image blocks are type 1 # below returns an empty list ##d = page.get_text("dict") # big because contains images ##blocks = d["blocks"] ##imgblocks = [b for b in blocks if b["type"] == 1] # read all text in the document with pymupdf.open(path2pdf) as doc: # open document text = chr(12).join([page.get_text() for page in doc]) # write as a binary file to support non-ASCII characters with open(savepath, 'wb') as outfile: outfile.write(text.encode()) with open(savepath, 'r') as infile: data = infile.readlines() # copy and paste from the extracted text to get the headers headers = 'DATE PARTICULARS CHQ.NO. WITHDRAWALS DEPOSITS BALANCE'.split() # make a headers row for the output csv headers_string = ','.join(headers) # find all lines that start with a date in format xx-xx-xxspace e = re.compile(r'(\d\d-\d\d-\d\d\s)') for i in range(len(data)): res = e.match(data[i]) if res: print(f'line number is {i}, date is {res.group(1)}') 
You can now get everything from the line number where a date starts to the line above where the next data starts.

The data is left aligned in columns, but it there is no actual table. You can get the position of the columns from one complete line, or from one line with WITHDRAWALS and another line with DEPOSITS.

Something along these lines should do the trick!
a4avinash likes this post
Reply
#3
The main issue with the provided output CSV file is that the CREDIT and DEBIT columns are not being populated correctly. The values in these columns should reflect the actual credit and debit amounts for each transaction, but the current output shows all zeros.
This indicates that the script is not able to correctly identify and assign the credit and debit values based on the information in the "Particulars" (Transaction Reference) column of the bank statement.


import pandas as pd import PyPDF2 import re import os from datetime import datetime print("Script started") def extract_text_from_pdf(pdf_path): print(f"Attempting to read PDF: {pdf_path}") try: with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = '' for page in reader.pages: text += page.extract_text() print(f"Successfully extracted {len(text)} characters from PDF") return text except Exception as e: print(f"Error reading PDF: {e}") return None def extract_transaction_data(pdf_content): print("Extracting transaction data...") if not pdf_content: print("No PDF content to process") return None pattern = r'(\d{2}-\d{2}-\d{2})\s+(.*?)\s+(\d+\.\d{2}|\-)\s+(\d+\.\d{2}|\-)' matches = re.findall(pattern, pdf_content) print(f"Found {len(matches)} transaction entries") data = [] for match in matches: date, particulars, credit, debit = match credit = 0.0 if credit == '-' else float(credit) debit = 0.0 if debit == '-' else float(debit) # Identify the transaction type and assign credit/debit accordingly if particulars.startswith('UPI/DR/'): credit = 0.0 debit = debit elif particulars.startswith('CR_DR'): credit = credit debit = 0.0 else: credit = credit debit = debit data.append([date, particulars, credit, debit]) df = pd.DataFrame(data, columns=['Date', 'Particulars', 'CREDIT', 'DEBIT']) df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%y') print(f"Created DataFrame with {len(df)} rows") return df def process_bank_statement(input_path): print(f"Processing PDF: {input_path}") pdf_content = extract_text_from_pdf(input_path) if not pdf_content: return print("First 500 characters of PDF content:") print(pdf_content[:500]) df = extract_transaction_data(pdf_content) if df is None or df.empty: print("No transaction data found in the PDF.") return output_folder = os.path.dirname(input_path) output_path = os.path.join(output_folder, 'processed_bank_statement.csv') df.to_csv(output_path, index=False) print(f"Processed data has been saved to: {output_path}") print("\nFirst few rows of the processed data:") print(df.head().to_string(index=False)) def main(): print("Bank Statement Processor") print("========================") default_path = r"C:\Users\csc\Desktop\Ashok SBI\apr23.pdf" if os.path.exists(default_path): print(f"Found default PDF file: {default_path}") input_path = default_path else: print("Default PDF file not found.") input_path = input("Enter the full path to your PDF file: ").strip('"') if not os.path.exists(input_path): print(f"Error: File not found - {input_path}") input("Press Enter to exit...") return process_bank_statement(input_path) print("Processing complete.") input("Press Enter to exit...") if __name__ == "__main__": try: main() except Exception as e: print(f"An error occurred: {e}") import traceback print(traceback.format_exc()) input("Press Enter to exit...") print("Script ended")
buran write Sep-10-2024, 06:41 AM:
New thread merged into old thread

Attached Files

.pdf   apr23 (pdf.io).pdf (Size: 123.2 KB / Downloads: 227)
.csv   processed_bank_statement.csv (Size: 5.52 KB / Downloads: 148)
Reply
#4
(Sep-08-2024, 08:33 AM)Pedroski55 Wrote: Get the text like this:

import pymupdf import re path2pdf = '/home/pedro/Downloads/BANK_SAMPLE.pdf' savepath = '/home/pedro/Downloads/BANK_SAMPLE.text' # there are no tables in the pdf # below does not return anything ##for page in doc: ## tabs = page.find_tables() # the pdf is text and has no images # image blocks are type 1 # below returns an empty list ##d = page.get_text("dict") # big because contains images ##blocks = d["blocks"] ##imgblocks = [b for b in blocks if b["type"] == 1] # read all text in the document with pymupdf.open(path2pdf) as doc: # open document text = chr(12).join([page.get_text() for page in doc]) # write as a binary file to support non-ASCII characters with open(savepath, 'wb') as outfile: outfile.write(text.encode()) with open(savepath, 'r') as infile: data = infile.readlines() # copy and paste from the extracted text to get the headers headers = 'DATE PARTICULARS CHQ.NO. WITHDRAWALS DEPOSITS BALANCE'.split() # make a headers row for the output csv headers_string = ','.join(headers) # find all lines that start with a date in format xx-xx-xxspace e = re.compile(r'(\d\d-\d\d-\d\d\s)') for i in range(len(data)): res = e.match(data[i]) if res: print(f'line number is {i}, date is {res.group(1)}') 
You can now get everything from the line number where a date starts to the line above where the next data starts.

The data is left aligned in columns, but it there is no actual table. You can get the position of the columns from one complete line, or from one line with WITHDRAWALS and another line with DEPOSITS.

Something along these lines should do the trick!

Thank you very much. The issue is resolved.
Reply
#5
import pymupdf # PyMuPDF import re # Define file paths path2pdf = '/home/pedro/Downloads/BANK_SAMPLE.pdf' savepath = '/home/pedro/Downloads/BANK_SAMPLE.text' # Open the PDF and extract text with pymupdf.open(path2pdf) as doc: text = chr(12).join([page.get_text() for page in doc]) # Save extracted text to a file with open(savepath, 'wb') as outfile: outfile.write(text.encode()) # Read the saved text file with open(savepath, 'r') as infile: data = infile.readlines() # Define headers for CSV output headers = 'DATE PARTICULARS CHQ.NO. WITHDRAWALS DEPOSITS BALANCE'.split() headers_string = ','.join(headers) # Regex pattern to find lines starting with a date (xx-xx-xx format) date_pattern = re.compile(r'^(\d{2}-\d{2}-\d{2})\s') # Process and print matching lines for i, line in enumerate(data): match = date_pattern.match(line) if match: print(f'Line number: {i}, Date: {match.group(1)}')
buran write Feb-27-2025, 05:27 PM:
Please, use proper tags when post code, traceback, output, etc. This time I have added tags for you.
See BBcode help for more info.
buran write Feb-27-2025, 05:26 PM:
Spam link removed
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  extracting data from a user-completed fillable pdf Perry 2 1,541 Sep-25-2025, 01:49 PM
Last Post: DeaD_EyE
  Confused by the different ways of extracting data in DataFrame leea2024 1 1,304 Aug-17-2024, 01:34 PM
Last Post: deanhystad
  Extracting the correct data from a CSV file S2G 6 2,948 Jun-03-2024, 04:50 PM
Last Post: snippsat
  Comparing PDFs CaseCRS 5 5,870 Apr-01-2023, 05:46 AM
Last Post: DPaul
  Extracting Data into Columns using pdfplumber arvin 17 36,823 Dec-17-2022, 11:59 AM
Last Post: arvin
  Extracting Data from tables DataExtrator 0 2,120 Nov-02-2021, 12:24 PM
Last Post: DataExtrator
  extracting data ajitnayak1987 1 2,552 Jul-29-2021, 06:13 AM
Last Post: bowlofred
  Extracting and printing data ajitnayak1987 0 2,224 Jul-28-2021, 09:30 AM
Last Post: ajitnayak1987
  Extracting unique pairs from a data set based on another value rybina 2 3,613 Feb-12-2021, 08:36 AM
Last Post: rybina
Thumbs Down extracting data/strings from Word doc mikkelibsen 1 3,066 Feb-10-2021, 11:06 AM
Last Post: Larz60+

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.