DEV Community

Cover image for Efficient HTML to Markdown Conversion for LLM Input with mq-crawler
Takahiro Sato
Takahiro Sato

Posted on

Efficient HTML to Markdown Conversion for LLM Input with mq-crawler

LLM applications require clean, structured input data. Web content often contains HTML markup, navigation elements, and irrelevant content that degrades LLM performance. Converting HTML to clean Markdown while extracting only relevant content remains a common challenge.

mq-crawler addresses this by combining web crawling with HTML-to-Markdown conversion and content filtering through mq queries.

demo

Single Binary Deployment

mq-crawler (distributed as mqcr) is a standalone binary. The single executable includes the complete web crawler, HTML parser, Markdown converter, and mq-lang query processor. This eliminates dependency management and environment setup issues common with multi-component solutions.

# Download and run immediately - no installation required curl -L https://github.com/harehare/mq/releases/latest/download/mqcr-linux-x86_64 -o mqcr chmod +x mqcr ./mqcr https://docs.example.com 
Enter fullscreen mode Exit fullscreen mode

Core Functionality

mq-crawler crawls websites, converts HTML to Markdown, and processes content with mq-lang queries. The tool respects robots.txt, implements rate limiting, and supports concurrent processing.

# Basic crawling with markdown output mqcr https://docs.example.com # Extract specific content sections mqcr -q '.h2 | select(contains("API"))' https://docs.example.com # Parallel crawling with 4 workers mqcr -c 4 -o ./output https://docs.example.com 
Enter fullscreen mode Exit fullscreen mode

HTML to Markdown Conversion

The conversion process handles complex HTML structures:

Table Processing

HTML tables convert to properly formatted Markdown tables with column alignment detection:

# Input HTML <table> <tr><th>Method</th><th>Endpoint</th><th>Description</th></tr> <tr><td>GET</td><td>/api/users</td><td>List users</td></tr> </table> # Output Markdown | Method | Endpoint | Description | | ------ | ---------- | ----------- | | GET | /api/users | List users | 
Enter fullscreen mode Exit fullscreen mode

Concurrent Processing

The crawler supports parallel processing for efficiency:

# Process multiple pages concurrently mqcr -c 8 --crawl-delay 1.0 https://docs.example.com # Configure timeouts for different operations mqcr --page-load-timeout 30 --script-timeout 10 --implicit-timeout 5 https://example.com 
Enter fullscreen mode Exit fullscreen mode

Timeout options control different aspects:

  • --page-load-timeout: Full page loading (default: 30s)
  • --script-timeout: JavaScript execution (default: 10s)
  • --implicit-timeout: Element finding (default: 5s)

Content Filtering with mq Queries

mq-crawler processes content through mq-lang queries for targeted extraction:

Code Example Collection

# Extract all code examples with context mqcr -q ' .code | { "language": attr("lang"), "code": to_text(), } ' https://tutorial.example.com 
Enter fullscreen mode Exit fullscreen mode

Ethical Crawling Features

robots.txt Compliance

# Respect robots.txt automatically mqcr https://example.com # Use custom robots.txt mqcr --robots-path ./custom-robots.txt https://example.com 
Enter fullscreen mode Exit fullscreen mode

Rate Limiting

# Configure crawl delays mqcr --crawl-delay 2.0 https://example.com # Respectful concurrent crawling mqcr -c 3 --crawl-delay 1.5 https://example.com 
Enter fullscreen mode Exit fullscreen mode

Installation and Setup

Package Installation

# Install via Homebrew brew install harehare/tap/mqcr # Download pre-built binary directly curl -L https://github.com/harehare/mq/releases/latest/download/mqcr-linux-x86_64 -o mqcr chmod +x mqcr # Or build from source cargo install https://github.com/harehare/mq.git mq-crawler 
Enter fullscreen mode Exit fullscreen mode

Results and Benefits

The mq-crawler approach provides:

  1. Single Binary Deployment: immediate execution
  2. Clean Markdown Output: Structured content without HTML noise
  3. Targeted Extraction: Query-based filtering for relevant content
  4. Ethical Compliance: Automated robots.txt respect and rate limiting
  5. Scalable Processing: Concurrent crawling with configurable limits
  6. LLM-Ready Format: Properly formatted Markdown

This combination reduces manual preprocessing overhead while maintaining content quality for LLM applications.

Installation

# Using Homebrew brew install harehare/tap/mqcr # Direct binary download (no dependencies) curl -L https://github.com/harehare/mq/releases/latest/download/mqcr-linux-x86_64 -o mqcr chmod +x mqcr 
Enter fullscreen mode Exit fullscreen mode

For other installation methods, including Docker and pre-built binaries, check the official installation guide.

Resources

Support

Top comments (0)