English | 简体中文
A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection.
- 🚀 Modern API Design - Clean functional and class-based APIs
- 🧠 Intelligent Filtering - Automatic page type detection with optimal filtering strategies
- 📝 High-Quality Markdown Conversion - Advanced HTML to Markdown transformation
- 🌐 Cross-Environment Support - Full compatibility with both browser and Node.js environments
- 🎯 Smart Presets - Optimized configurations for different content types
- 🔌 Plugin System - Extensible plugin architecture
- 📊 Automatic Detection - Smart detection of search engines, blogs, news, documentation, and more
npm install html-content-processorimport { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor'; // Convert HTML to Markdown const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>'); // Convert HTML to plain text const text = await htmlToText('<h1>Hello</h1><p>World</p>'); // Clean HTML content const clean = await cleanHtml('<div>Content</div><script>ads</script>');The library can automatically detect page types and apply optimal filtering strategies:
import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor'; // Automatic detection with URL context const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post'); // Clean HTML with automatic page type detection const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article'); // Extract content with detailed page type information const result = await extractContentAuto(html, 'https://docs.example.com/guide'); console.log('Detected page type:', result.pageType.type); console.log('Confidence:', result.pageType.confidence); console.log('Markdown:', result.markdown.content);import { HtmlProcessor } from 'html-content-processor'; // Method chaining const result = await HtmlProcessor .from(html) .withBaseUrl('https://example.com') .withAutoDetection() // Enable automatic page type detection .filter() .toMarkdown(); // Manual page type setting const processor = await HtmlProcessor .from(html) .withPageType('blog') // Manually set page type .filter(); const markdown = await processor.toMarkdown();import { htmlToArticleMarkdown, htmlToBlogMarkdown, htmlToNewsMarkdown } from 'html-content-processor'; // Optimized for different content types const articleMd = await htmlToArticleMarkdown(html, baseUrl); const blogMd = await htmlToBlogMarkdown(html, baseUrl); const newsMd = await htmlToNewsMarkdown(html, baseUrl);| Function | Description | Return Type |
|---|---|---|
htmlToMarkdown(html, options?) | Convert HTML to Markdown | Promise<string> |
htmlToMarkdownWithCitations(html, baseUrl?, options?) | Convert HTML to Markdown with citations | Promise<string> |
htmlToText(html, options?) | Convert HTML to plain text | Promise<string> |
cleanHtml(html, options?) | Clean HTML content | Promise<string> |
extractContent(html, options?) | Extract content fragments | Promise<string[]> |
| Function | Description | Benefits |
|---|---|---|
htmlToMarkdownAuto(html, url?, options?) | Auto-detect page type and convert to Markdown | Optimal filtering for each page type |
cleanHtmlAuto(html, url?, options?) | Auto-detect page type and clean HTML | Smart noise removal |
extractContentAuto(html, url?, options?) | Auto-detect and extract with detailed results | Comprehensive page analysis |
// Blog post detection const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post'); // Automatically applies blog-optimized filtering // News article detection const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article'); // Automatically applies news-optimized filtering // Documentation detection const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide'); // Automatically applies documentation-optimized filtering // Search engine results detection const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query'); // Automatically applies search-results-optimized filtering| Function | Optimized For |
|---|---|
htmlToArticleMarkdown() | Long-form articles |
htmlToBlogMarkdown() | Blog posts |
htmlToNewsMarkdown() | News articles |
strictCleanHtml() | Aggressive cleaning |
gentleCleanHtml() | Conservative cleaning |
// Create processor const processor = HtmlProcessor.from(html, options); // Configuration methods processor.withBaseUrl(url) // Set base URL processor.withOptions(options) // Update options processor.withAutoDetection(url?) // Enable auto-detection processor.withPageType(type) // Manually set page type // Processing methods await processor.filter(options?) // Apply filtering await processor.toMarkdown(options?) // Convert to Markdown await processor.toText() // Convert to plain text await processor.toArray() // Convert to fragment array processor.toString() // Get cleaned HTML // Information methods processor.getOptions() // Get current options processor.isProcessed() // Check if processed processor.getPageTypeResult() // Get page type detection result{ threshold?: number; // Filtering threshold (default: 2) strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic') ratio?: number; // Text density ratio (default: 0.48) minWords?: number; // Minimum word count (default: 0) preserveStructure?: boolean; // Preserve structure (default: false) keepElements?: string[]; // Elements to keep removeElements?: string[]; // Elements to remove }{ citations?: boolean; // Generate citations (default: true) ignoreLinks?: boolean; // Ignore links (default: false) ignoreImages?: boolean; // Ignore images (default: false) baseUrl?: string; // Base URL threshold?: number; // Filter threshold strategy?: 'fixed' | 'dynamic'; // Filter strategy ratio?: number; // Text density ratio }The library automatically detects and optimizes for these page types:
search-engine- Search engine result pagesblog- Blog posts and personal articlesnews- News articles and journalismdocumentation- Technical documentatione-commerce- E-commerce and product pagessocial-media- Social media contentforum- Forum discussions and Q&Aarticle- General articles and contentlanding-page- Marketing and landing pages
import { extractContentAuto } from 'html-content-processor'; const result = await extractContentAuto(html, url); console.log('Page Type:', result.pageType.type); console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%'); console.log('Detection Reasons:', result.pageType.reasons); console.log('Applied Filter Options:', result.pageType.filterOptions);npm install jsdom # Recommended for best performanceDirect support, no additional dependencies required.
<script src="https://unpkg.com/html-content-processor"></script> <script> // Global variable: window.htmlFilter htmlFilter.htmlToMarkdown(html).then(console.log); // Auto-detection example htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => { console.log('Auto-detected content:', result); }); </script>import { htmlToMarkdownAuto } from 'html-content-processor'; // Scrape and convert blog post const response = await fetch('https://blog.example.com/post-123'); const html = await response.text(); const markdown = await htmlToMarkdownAuto(html, response.url); // Automatically detects it's a blog and applies blog-specific filteringimport { extractContentAuto } from 'html-content-processor'; const result = await extractContentAuto(newsHtml, 'https://news.site.com/article'); if (result.pageType.type === 'news') { console.log('High-quality news content extracted'); console.log('Confidence:', result.pageType.confidence); }import { htmlToMarkdownAuto } from 'html-content-processor'; // Convert technical documentation const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api'); // Automatically preserves code blocks, headers, and technical content structure- ⚡ Fast Processing: Optimized algorithms for quick content extraction
- 💾 Memory Efficient: Minimal memory footprint
- 🔄 Batch Processing: Handle multiple documents efficiently
- 📊 Smart Caching: Automatic page type detection caching
MIT License