Top 23 Python PDF Projects

markitdown

1 18 84,406 8.7 Python

Python tool for converting files and office documents to Markdown.

Project mention: Show HN: MarkdownConverters – Convert any file format to clean Markdown | news.ycombinator.com | 2025-10-20
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
MinerU

2 6 50,750 9.9 Python

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

Project mention: Show HN: OCR Arena – A playground for OCR models | news.ycombinator.com | 2025-11-24

cool UI and lets anyone upload a doc. but lacks https://github.com/opendatalab/mineru
docling

3 48 47,414 9.7 Python

Get your documents ready for gen AI

Project mention: Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction | news.ycombinator.com | 2025-12-18
paperless-ngx

4 217 35,122 9.9 Python

A community-supported supercharged document management system: scan, index and archive all your documents

Project mention: Review for Synology DiskStation DS925+: A feature-packed NAS | dev.to | 2025-10-30

Borg Backup - I use it to automatically back up my main hosted Docker services. I have publicly hosted instances of Immich, and Paperless-NGX using Docker containers. I periodically make a backup of their data folder using Borg and store it in a Borg repo. The advantage of storing the backups in a Borg repo is that it is a deduplicating archival program. So no matter how many backups you make, it will not take any extra space than the first backup, provided nothing has changed. If there is a change, only that changed chunk is backed up, just like git. Also, you can easily encrypt and/or compress while backing up. Restoring a backup is also as easy as running a single Borg command.
OCRmyPDF

5 87 32,049 8.7 Python

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Project mention: Llama-Scan: Convert PDFs to Text W Local LLMs | news.ycombinator.com | 2025-08-17
h2ogpt

6 31 11,976 6.9 Python

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
PyPDF2

7 32 9,680 9.5 Python

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Project mention: AI-Powered Cover Letter Generator | dev.to | 2025-10-24

The following ResumeService extracts the content from a PDF using pypdf
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
pdfplumber

8 31 9,347 7.7 Python

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
PyMuPDF

9 8 8,710 9.7 Python

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
WeasyPrint

10 56 8,459 9.7 Python

The awesome document factory

Project mention: WeasyPrint | news.ycombinator.com | 2025-10-12
MegaParse

11 5 7,249 9.4 Python

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

Project mention: MegaParse: Your One-Stop Solution for Effortless Document Parsing | dev.to | 2025-02-23

View the Project on GitHub
pdfminer.six

12 14 6,832 7.2 Python

Community maintained fork of pdfminer - we fathom PDF
pdfarranger

13 93 4,977 8.2 Python

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.
pdf-craft

14 3 4,101 9.0 Python

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.

Project mention: PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR | news.ycombinator.com | 2025-12-21
malicious-pdf

15 13 3,559 5.3 Python

💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
borb

16 66 3,550 8.9 Python

borb is a library for reading, creating and manipulating PDF files in python.
Camelot

17 10 3,550 7.9 Python

A Python library to extract tabular data from PDFs
text-extract-api

18 2 2,961 7.7 Python

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

Project mention: PDF Extract API Using Ollama with Anonymization and PII Removal | news.ycombinator.com | 2025-01-07
Papermerge

19 33 2,837 4.2 Python

Open Source Document Management System for Digital Archives (Scanned Documents)
pikepdf

20 4 2,554 9.3 Python

A Python library for reading and writing PDF, powered by QPDF
xhtml2pdf

21 2 2,363 5.4 Python

A library for converting HTML into PDFs using ReportLab
tabula-py

22 4 2,302 6.8 Python

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
pdftabextract

23 1 2,253 0.0 Python

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python PDF discussion

Python PDF related posts

PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR

2 projects | news.ycombinator.com | 21 Dec 2025
Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction

1 project | news.ycombinator.com | 18 Dec 2025
Docling

1 project | news.ycombinator.com | 14 Dec 2025
Show HN: OCR Arena – A playground for OCR models

2 projects | news.ycombinator.com | 24 Nov 2025
Testing the Unofficial Docling Hierarchical PDF Processor

2 projects | dev.to | 21 Nov 2025
Stop Fighting PDF Forms: Automate Everything with PyPDFForm

1 project | dev.to | 16 Nov 2025
Show HN: My open-source project PdfDing is receiving a grant

1 project | news.ycombinator.com | 27 Oct 2025
A note from our sponsor - SaaSHub
www.saashub.com | 25 Dec 2025

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source PDF projects in Python? This list will help you:

#	Project	Stars
1	markitdown	84,406
2	MinerU	50,750
3	docling	47,414
4	paperless-ngx	35,122
5	OCRmyPDF	32,049
6	h2ogpt	11,976
7	PyPDF2	9,680
8	pdfplumber	9,347
9	PyMuPDF	8,710
10	WeasyPrint	8,459
11	MegaParse	7,249
12	pdfminer.six	6,832
13	pdfarranger	4,977
14	pdf-craft	4,101
15	malicious-pdf	3,559
16	borb	3,550
17	Camelot	3,550
18	text-extract-api	2,961
19	Papermerge	2,837
20	pikepdf	2,554
21	xhtml2pdf	2,363
22	tabula-py	2,302
23	pdftabextract	2,253

Python PDF

Top 23 Python PDF Projects

Python PDF discussion

Python PDF related posts

PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR

Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction

Docling

Show HN: OCR Arena – A playground for OCR models

Testing the Unofficial Docling Hierarchical PDF Processor

Stop Fighting PDF Forms: Automate Everything with PyPDFForm

Show HN: My open-source project PdfDing is receiving a grant

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?