SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python PDF Projects
- Project mention: Show HN: MarkdownConverters – Convert any file format to clean Markdown | news.ycombinator.com | 2025-10-20
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
MinerU
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
Project mention: Show HN: OCR Arena – A playground for OCR models | news.ycombinator.com | 2025-11-24cool UI and lets anyone upload a doc. but lacks https://github.com/opendatalab/mineru
- Project mention: Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction | news.ycombinator.com | 2025-12-18
-
paperless-ngx
A community-supported supercharged document management system: scan, index and archive all your documents
Borg Backup - I use it to automatically back up my main hosted Docker services. I have publicly hosted instances of Immich, and Paperless-NGX using Docker containers. I periodically make a backup of their data folder using Borg and store it in a Borg repo. The advantage of storing the backups in a Borg repo is that it is a deduplicating archival program. So no matter how many backups you make, it will not take any extra space than the first backup, provided nothing has changed. If there is a change, only that changed chunk is backed up, just like git. Also, you can easily encrypt and/or compress while backing up. Restoring a backup is also as easy as running a single Borg command.
-
-
h2ogpt
Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
-
PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
The following ResumeService extracts the content from a PDF using pypdf
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
-
MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
Project mention: MegaParse: Your One-Stop Solution for Effortless Document Parsing | dev.to | 2025-02-23View the Project on GitHub
-
-
pdfarranger
Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.
-
pdf-craft
PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.
Project mention: PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR | news.ycombinator.com | 2025-12-21 -
malicious-pdf
💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
-
-
-
text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
Project mention: PDF Extract API Using Ollama with Anonymization and PII Removal | news.ycombinator.com | 2025-01-07 -
-
-
-
-
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python PDF discussion
Python PDF related posts
-
PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR
-
Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction
-
Docling
-
Show HN: OCR Arena – A playground for OCR models
-
Testing the Unofficial Docling Hierarchical PDF Processor
-
Stop Fighting PDF Forms: Automate Everything with PyPDFForm
-
Show HN: My open-source project PdfDing is receiving a grant
- A note from our sponsor - SaaSHub www.saashub.com | 25 Dec 2025
Index
What are some of the best open-source PDF projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | markitdown | 84,406 |
| 2 | MinerU | 50,750 |
| 3 | docling | 47,414 |
| 4 | paperless-ngx | 35,122 |
| 5 | OCRmyPDF | 32,049 |
| 6 | h2ogpt | 11,976 |
| 7 | PyPDF2 | 9,680 |
| 8 | pdfplumber | 9,347 |
| 9 | PyMuPDF | 8,710 |
| 10 | WeasyPrint | 8,459 |
| 11 | MegaParse | 7,249 |
| 12 | pdfminer.six | 6,832 |
| 13 | pdfarranger | 4,977 |
| 14 | pdf-craft | 4,101 |
| 15 | malicious-pdf | 3,559 |
| 16 | borb | 3,550 |
| 17 | Camelot | 3,550 |
| 18 | text-extract-api | 2,961 |
| 19 | Papermerge | 2,837 |
| 20 | pikepdf | 2,554 |
| 21 | xhtml2pdf | 2,363 |
| 22 | tabula-py | 2,302 |
| 23 | pdftabextract | 2,253 |