-
- Notifications
You must be signed in to change notification settings - Fork 335
Description
The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.
Basic behavior
- <h1> → #, <h2> → ##, <h3> → ### - <p> → simple text line - <ul>/<ol> → Markdown lists - <a> → [text](url) - <img> →  (optional, configurable) - <pre><code> → fenced code blocks - Tables converted to Markdown or CSV fallback - Inline spans or styling without semantic meaning are discarded - Scripts, styles, and invisible nodes are ignored Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.
Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.