Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
- Updated
Sep 29, 2025 - Python
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Web Crawler/Spider for NodeJS + server-side jQuery ;-)
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Open-source platform for extracting structured data from documents using AI.
Crawly, a high-level web crawling & scraping framework for Elixir.
Extract structured data from web sites. Web sites scraping.
A simple resume parser used for extracting information from resumes
Receipt scanner extracts information from your PDF or image receipts - built in NodeJS
Extract data from .trace documents generated by Instruments
Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina Reader API. Makes RAG, AI web scraping, image & webpage links extraction easy.
An R package for acquisition and processing of NASA SMAP data
extract data from html table
Library and cli for extracting data from HTML via CSS selectors
Extract colors from an image. Colors are grouped based on visual similarities using the CIE76 formula.
FBLYZE is a Facebook scraping system and analysis system.
Get Lyrics for any songs by just passing in the song name (spelled or misspelled) in less than 2 seconds using this awesome Python Library.
Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.
This program extracts insider trading data from the sec website and stores it in excel file for the specified time frame.
Add a description, image, and links to the extract-data topic page so that developers can more easily learn about it.
To associate your repository with the extract-data topic, visit your repo's landing page and select "manage topics."