Extract hyperlinks from PDF in Python

Extract hyperlinks from PDF in Python

Extracting hyperlinks from a PDF file can be a bit tricky, but it's possible with the help of libraries such as PyMuPDF (also known as fitz), pdfplumber, or PyPDF2. Among these, PyMuPDF is quite powerful for working with PDF files, including extracting text, images, and links.

Here's an example of how you can extract hyperlinks from a PDF using PyMuPDF:

First, install the PyMuPDF library if you haven't already:

pip install pymupdf 

Then, you can use the following Python script to extract the URLs:

import fitz # PyMuPDF def extract_hyperlinks(pdf_path): # Open the PDF file pdf_document = fitz.open(pdf_path) links = [] # Iterate over each page for page_num in range(len(pdf_document)): # Get the page page = pdf_document[page_num] # Get the list of link dictionaries link_dict = page.get_links() for link in link_dict: uri = link.get("uri") if uri: links.append(uri) pdf_document.close() return links # Specify the path to your PDF pdf_path = 'your_pdf_file.pdf' hyperlinks = extract_hyperlinks(pdf_path) # Print the list of hyperlinks for url in hyperlinks: print(url) 

Replace 'your_pdf_file.pdf' with the path to your actual PDF file. This script opens the PDF, iterates through each page, and collects all hyperlinks into a list.

If you have a PDF that contains annotations with links, those links can be extracted similarly with the annotations method of the page object.

Please note that the structure of PDFs can be complex and not all hyperlinks might be extractable via automated tools, especially if they are embedded in images or formatted in non-standard ways.


More Tags

cucumber-java triangle-count pyqt5 android-database delimiter bounds docker-registry algorithm maven-3 appearance

More Programming Guides

Other Guides

More Programming Examples