#

pdf

Here are 2,840 public repositories matching this topic...

microsoft / markitdown

Python tool for converting files and office documents to Markdown.

markdown pdf openai microsoft-office autogen langchain autogen-extension

Updated Sep 8, 2025
Python

opendatalab / MinerU

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Sep 29, 2025
Python

docling

docling-project / docling

Get your documents ready for gen AI

html markdown pdf ai convert xlsx pdf-converter docx documents pptx pdf-to-text tables document-parser pdf-to-json document-parsing

Updated Oct 7, 2025
Python

paperless-ngx / paperless-ngx

A community-supported supercharged document management system: scan, index and archive all your documents

pdf machine-learning django angular ocr archiving dms document-management optical-character-recognition document-management-system

Updated Oct 7, 2025
Python

OCRmyPDF

ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

python pdf ocr image-processing tesseract

Updated Sep 23, 2025
Python

PDFMathTranslate

Byaidu / PDFMathTranslate

PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译，支持 Google/DeepL/Ollama/OpenAI 等服务，提供 CLI/GUI/MCP/Docker/Zotero

python pdf latex translation math mcp japanese english openai translate document chinese edit modify russian korean zotero obsidian pdf2zh

Updated Oct 6, 2025
Python

h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/

pdf ai embeddings private gpt generative llm chatgpt gpt4all vectorstore privategpt llama2 mixtral

Updated May 25, 2025
Python

py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

python pdf help-wanted pdf-documents pypdf2 pdf-manipulation pdf-parsing pdf-parser

Updated Oct 6, 2025
Python

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

pdf pdf-parsing table-extraction

Updated Jul 20, 2025
Python

WeasyPrint

Kozea / WeasyPrint

The awesome document factory

css python html pdf converter weasyprint

Updated Oct 1, 2025
Python

PyMuPDF

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

python pdf font data-science ocr tesseract epub mupdf text-processing pdf-documents extract-data table-extraction text-shaping xps pymupdf

Updated Oct 7, 2025
Python

bytedance / Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

python pdf parser ocr pdf-converter document-analysis pdf-parser layout-analysis vlm-ocr

Updated Sep 30, 2025
Python

MegaParse

QuivrHQ / MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

pdf parser powerpoint docx llm

Updated Feb 21, 2025
Python

pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

python pdf parser

Updated May 6, 2025
Python

pdfarranger / pdfarranger

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

linux pdf gtk python3 gtk3

Updated Oct 5, 2025
Python

atlanhq / camelot

Camelot: PDF Table Extraction for Humans

pdf table extract for-humans

Updated Jan 5, 2023
Python

borb

borb-pdf / borb

borb is a library for reading, creating and manipulating PDF files in python.

python pdf library sdk typesetting pdf-converter python3 pdf-conversion pdf-generation pdf-library

Updated Oct 4, 2025
Python

malicious-pdf

jonaslejon / malicious-pdf

💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

python pdf scanner penetration-testing pentesting bugbounty pdf-generation redteaming redteam penetration-test pentesting-tools bugbounty-tool penetrationtesting

Updated Jul 3, 2025
Python

oomol-lab / pdf-craft

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.

pdf ocr ai document

Updated Sep 26, 2025
Python

caj2pdf / caj2pdf

Convert CAJ (China Academic Journals) files to PDF. 转换中国知网 CAJ 格式文献为 PDF。佛系转换，成功与否，皆是玄学。

python pdf python3 cnki caj

Updated Mar 20, 2024
Python

Improve this page

Add a description, image, and links to the pdf topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf topic, visit your repo's landing page and select "manage topics."