Python PDF

Open-source Python projects categorized as PDF

Top 23 Python PDF Projects

  1. markitdown

    Python tool for converting files and office documents to Markdown.

    Project mention: Show HN: MarkdownConverters – Convert any file format to clean Markdown | news.ycombinator.com | 2025-10-20
  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. MinerU

    Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

    Project mention: Show HN: OCR Arena – A playground for OCR models | news.ycombinator.com | 2025-11-24

    cool UI and lets anyone upload a doc. but lacks https://github.com/opendatalab/mineru

  4. docling

    Get your documents ready for gen AI

    Project mention: Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction | news.ycombinator.com | 2025-12-18
  5. paperless-ngx

    A community-supported supercharged document management system: scan, index and archive all your documents

    Project mention: Review for Synology DiskStation DS925+: A feature-packed NAS | dev.to | 2025-10-30

    Borg Backup - I use it to automatically back up my main hosted Docker services. I have publicly hosted instances of Immich, and Paperless-NGX using Docker containers. I periodically make a backup of their data folder using Borg and store it in a Borg repo. The advantage of storing the backups in a Borg repo is that it is a deduplicating archival program. So no matter how many backups you make, it will not take any extra space than the first backup, provided nothing has changed. If there is a change, only that changed chunk is backed up, just like git. Also, you can easily encrypt and/or compress while backing up. Restoring a backup is also as easy as running a single Borg command.

  6. OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

    Project mention: Llama-Scan: Convert PDFs to Text W Local LLMs | news.ycombinator.com | 2025-08-17
  7. h2ogpt

    Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/

  8. PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

    Project mention: AI-Powered Cover Letter Generator | dev.to | 2025-10-24

    The following ResumeService extracts the content from a PDF using pypdf

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

  11. PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  12. WeasyPrint

    The awesome document factory

    Project mention: WeasyPrint | news.ycombinator.com | 2025-10-12
  13. MegaParse

    File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

    Project mention: MegaParse: Your One-Stop Solution for Effortless Document Parsing | dev.to | 2025-02-23

    View the Project on GitHub

  14. pdfminer.six

    Community maintained fork of pdfminer - we fathom PDF

  15. pdfarranger

    Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

  16. pdf-craft

    PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.

    Project mention: PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR | news.ycombinator.com | 2025-12-21
  17. malicious-pdf

    💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

  18. borb

    borb is a library for reading, creating and manipulating PDF files in python.

  19. Camelot

    A Python library to extract tabular data from PDFs

  20. text-extract-api

    Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

    Project mention: PDF Extract API Using Ollama with Anonymization and PII Removal | news.ycombinator.com | 2025-01-07
  21. Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

  22. pikepdf

    A Python library for reading and writing PDF, powered by QPDF

  23. xhtml2pdf

    A library for converting HTML into PDFs using ReportLab

  24. tabula-py

    Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

  25. pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python PDF discussion

Python PDF related posts

  • PDF Craft – Open-Source PDF to eBook Converter Powered by DeepSeek-OCR

    2 projects | news.ycombinator.com | 21 Dec 2025
  • Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction

    1 project | news.ycombinator.com | 18 Dec 2025
  • Docling

    1 project | news.ycombinator.com | 14 Dec 2025
  • Show HN: OCR Arena – A playground for OCR models

    2 projects | news.ycombinator.com | 24 Nov 2025
  • Testing the Unofficial Docling Hierarchical PDF Processor

    2 projects | dev.to | 21 Nov 2025
  • Stop Fighting PDF Forms: Automate Everything with PyPDFForm

    1 project | dev.to | 16 Nov 2025
  • Show HN: My open-source project PdfDing is receiving a grant

    1 project | news.ycombinator.com | 27 Oct 2025
  • A note from our sponsor - SaaSHub
    www.saashub.com | 25 Dec 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source PDF projects in Python? This list will help you:

# Project Stars
1 markitdown 84,406
2 MinerU 50,750
3 docling 47,414
4 paperless-ngx 35,122
5 OCRmyPDF 32,049
6 h2ogpt 11,976
7 PyPDF2 9,680
8 pdfplumber 9,347
9 PyMuPDF 8,710
10 WeasyPrint 8,459
11 MegaParse 7,249
12 pdfminer.six 6,832
13 pdfarranger 4,977
14 pdf-craft 4,101
15 malicious-pdf 3,559
16 borb 3,550
17 Camelot 3,550
18 text-extract-api 2,961
19 Papermerge 2,837
20 pikepdf 2,554
21 xhtml2pdf 2,363
22 tabula-py 2,302
23 pdftabextract 2,253

Sponsored
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io

Did you know that Python is
the 2nd most popular programming language
based on number of references?