Python tool for converting files and office documents to Markdown.
- Updated
Sep 8, 2025 - Python
Python tool for converting files and office documents to Markdown.
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
Get your documents ready for gen AI
A community-supported supercharged document management system: scan, index and archive all your documents
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译,支持 Google/DeepL/Ollama/OpenAI 等服务,提供 CLI/GUI/MCP/Docker/Zotero
Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
borb is a library for reading, creating and manipulating PDF files in python.
💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
Add a description, image, and links to the pdf topic page so that developers can more easily learn about it.
To associate your repository with the pdf topic, visit your repo's landing page and select "manage topics."