HTML Content Processor

A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection.

Features

🚀 Modern API Design - Clean functional and class-based APIs
🧠 Intelligent Filtering - Automatic page type detection with optimal filtering strategies
📝 High-Quality Markdown Conversion - Advanced HTML to Markdown transformation
🌐 Cross-Environment Support - Full compatibility with both browser and Node.js environments
🎯 Smart Presets - Optimized configurations for different content types
🔌 Plugin System - Extensible plugin architecture
📊 Automatic Detection - Smart detection of search engines, blogs, news, documentation, and more

Installation

npm install html-content-processor

Quick Start

Basic Usage

import { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor'; // Convert HTML to Markdown const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>'); // Convert HTML to plain text const text = await htmlToText('<h1>Hello</h1><p>World</p>'); // Clean HTML content const clean = await cleanHtml('<div>Content</div><script>ads</script>');

Automatic Page Type Detection (Recommended)

The library can automatically detect page types and apply optimal filtering strategies:

import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor'; // Automatic detection with URL context const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post'); // Clean HTML with automatic page type detection const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article'); // Extract content with detailed page type information const result = await extractContentAuto(html, 'https://docs.example.com/guide'); console.log('Detected page type:', result.pageType.type); console.log('Confidence:', result.pageType.confidence); console.log('Markdown:', result.markdown.content);

HtmlProcessor Class (Advanced Usage)

import { HtmlProcessor } from 'html-content-processor'; // Method chaining const result = await HtmlProcessor .from(html) .withBaseUrl('https://example.com') .withAutoDetection() // Enable automatic page type detection .filter() .toMarkdown(); // Manual page type setting const processor = await HtmlProcessor .from(html) .withPageType('blog') // Manually set page type .filter(); const markdown = await processor.toMarkdown();

Content-Specific Presets

import { htmlToArticleMarkdown, htmlToBlogMarkdown, htmlToNewsMarkdown } from 'html-content-processor'; // Optimized for different content types const articleMd = await htmlToArticleMarkdown(html, baseUrl); const blogMd = await htmlToBlogMarkdown(html, baseUrl); const newsMd = await htmlToNewsMarkdown(html, baseUrl);

API Reference

Core Functions

Function	Description	Return Type
`htmlToMarkdown(html, options?)`	Convert HTML to Markdown	`Promise<string>`
`htmlToMarkdownWithCitations(html, baseUrl?, options?)`	Convert HTML to Markdown with citations	`Promise<string>`
`htmlToText(html, options?)`	Convert HTML to plain text	`Promise<string>`
`cleanHtml(html, options?)`	Clean HTML content	`Promise<string>`
`extractContent(html, options?)`	Extract content fragments	`Promise<string[]>`

Automatic Detection Functions

Function	Description	Benefits
`htmlToMarkdownAuto(html, url?, options?)`	Auto-detect page type and convert to Markdown	Optimal filtering for each page type
`cleanHtmlAuto(html, url?, options?)`	Auto-detect page type and clean HTML	Smart noise removal
`extractContentAuto(html, url?, options?)`	Auto-detect and extract with detailed results	Comprehensive page analysis

Example: Using Auto-Detection

// Blog post detection const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post'); // Automatically applies blog-optimized filtering // News article detection  const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article'); // Automatically applies news-optimized filtering // Documentation detection const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide'); // Automatically applies documentation-optimized filtering // Search engine results detection const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query'); // Automatically applies search-results-optimized filtering

Content-Specific Presets

Function	Optimized For
`htmlToArticleMarkdown()`	Long-form articles
`htmlToBlogMarkdown()`	Blog posts
`htmlToNewsMarkdown()`	News articles
`strictCleanHtml()`	Aggressive cleaning
`gentleCleanHtml()`	Conservative cleaning

HtmlProcessor Class

// Create processor const processor = HtmlProcessor.from(html, options); // Configuration methods processor.withBaseUrl(url) // Set base URL processor.withOptions(options) // Update options processor.withAutoDetection(url?) // Enable auto-detection processor.withPageType(type) // Manually set page type // Processing methods await processor.filter(options?) // Apply filtering await processor.toMarkdown(options?) // Convert to Markdown await processor.toText() // Convert to plain text await processor.toArray() // Convert to fragment array processor.toString() // Get cleaned HTML // Information methods processor.getOptions() // Get current options processor.isProcessed() // Check if processed processor.getPageTypeResult() // Get page type detection result

Configuration Options

Filter Options (FilterOptions)

{ threshold?: number; // Filtering threshold (default: 2) strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic') ratio?: number; // Text density ratio (default: 0.48) minWords?: number; // Minimum word count (default: 0) preserveStructure?: boolean; // Preserve structure (default: false) keepElements?: string[]; // Elements to keep removeElements?: string[]; // Elements to remove }

Convert Options (ConvertOptions)

{ citations?: boolean; // Generate citations (default: true) ignoreLinks?: boolean; // Ignore links (default: false) ignoreImages?: boolean; // Ignore images (default: false) baseUrl?: string; // Base URL threshold?: number; // Filter threshold strategy?: 'fixed' | 'dynamic'; // Filter strategy ratio?: number; // Text density ratio }

Automatic Page Type Detection

The library automatically detects and optimizes for these page types:

search-engine - Search engine result pages
blog - Blog posts and personal articles
news - News articles and journalism
documentation - Technical documentation
e-commerce - E-commerce and product pages
social-media - Social media content
forum - Forum discussions and Q&A
article - General articles and content
landing-page - Marketing and landing pages

How Auto-Detection Works

import { extractContentAuto } from 'html-content-processor'; const result = await extractContentAuto(html, url); console.log('Page Type:', result.pageType.type); console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%'); console.log('Detection Reasons:', result.pageType.reasons); console.log('Applied Filter Options:', result.pageType.filterOptions);

Environment Support

Node.js

npm install jsdom # Recommended for best performance

Browser

Direct support, no additional dependencies required.

CDN

<script src="https://unpkg.com/html-content-processor"></script> <script> // Global variable: window.htmlFilter htmlFilter.htmlToMarkdown(html).then(console.log); // Auto-detection example htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => { console.log('Auto-detected content:', result); }); </script>

Real-World Examples

Web Scraping with Auto-Detection

import { htmlToMarkdownAuto } from 'html-content-processor'; // Scrape and convert blog post const response = await fetch('https://blog.example.com/post-123'); const html = await response.text(); const markdown = await htmlToMarkdownAuto(html, response.url); // Automatically detects it's a blog and applies blog-specific filtering

News Article Processing

import { extractContentAuto } from 'html-content-processor'; const result = await extractContentAuto(newsHtml, 'https://news.site.com/article'); if (result.pageType.type === 'news') { console.log('High-quality news content extracted'); console.log('Confidence:', result.pageType.confidence); }

Documentation Conversion

import { htmlToMarkdownAuto } from 'html-content-processor'; // Convert technical documentation const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api'); // Automatically preserves code blocks, headers, and technical content structure

Performance

⚡ Fast Processing: Optimized algorithms for quick content extraction
💾 Memory Efficient: Minimal memory footprint
🔄 Batch Processing: Handle multiple documents efficiently
📊 Smart Caching: Automatic page type detection caching

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
demo		demo
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
webpack.config.js		webpack.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTML Content Processor

Features

Installation

Quick Start

Basic Usage

Automatic Page Type Detection (Recommended)

HtmlProcessor Class (Advanced Usage)

Content-Specific Presets

API Reference

Core Functions

Automatic Detection Functions

Example: Using Auto-Detection

Content-Specific Presets

HtmlProcessor Class

Configuration Options

Filter Options (FilterOptions)

Convert Options (ConvertOptions)

Automatic Page Type Detection

How Auto-Detection Works

Environment Support

Node.js

Browser

CDN

Real-World Examples

Web Scraping with Auto-Detection

News Article Processing

Documentation Conversion

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

kamjin3086/html-content-processor

Folders and files

Latest commit

History

Repository files navigation

HTML Content Processor

Features

Installation

Quick Start

Basic Usage

Automatic Page Type Detection (Recommended)

HtmlProcessor Class (Advanced Usage)

Content-Specific Presets

API Reference

Core Functions

Automatic Detection Functions

Example: Using Auto-Detection

Content-Specific Presets

HtmlProcessor Class

Configuration Options

Filter Options (FilterOptions)

Convert Options (ConvertOptions)

Automatic Page Type Detection

How Auto-Detection Works

Environment Support

Node.js

Browser

CDN

Real-World Examples

Web Scraping with Auto-Detection

News Article Processing

Documentation Conversion

Performance

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages