Building a high-quality text-to-speech service that's completely free seemed impossible until I discovered Microsoft's Edge-TTS. Here's how I architected TTS-Free.Online using modern web technologies and why the technical decisions matter.
The Problem with Existing TTS Solutions
Most TTS APIs are expensive or have quality limitations:
- Google Cloud TTS: $4-16 per million characters
- Amazon Polly: $4 per million characters
- Azure Cognitive Services: $15 per million characters
- Free alternatives often have robotic voices
Discovering Edge-TTS
Edge-TTS is the engine behind Microsoft Edge's "Read Aloud" feature. Key advantages:
- Neural voices with natural prosody
- 40+ languages with regional variants
- SSML support for advanced control
- Completely free (though not officially documented for external use)
The challenge was making it accessible through a web interface.
Technical Architecture
Frontend: Next.js 14 with App Router
I chose Next.js for its full-stack capabilities and excellent developer experience:
// app/page.tsx - Main TTS interface 'use client' import { useState } from 'react' import { Voice, TTSOptions } from '@/types/tts' export default function TTSGenerator() { const [text, setText] = useState('') const [selectedVoice, setSelectedVoice] = useState<Voice>() const [isGenerating, setIsGenerating] = useState(false) const [audioUrl, setAudioUrl] = useState<string>() const handleGenerate = async () => { setIsGenerating(true) try { const response = await fetch('/api/generate', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ text, voice: selectedVoice?.value, options: { rate: '0%', pitch: '0%' } }) }) if (response.ok) { const blob = await response.blob() setAudioUrl(URL.createObjectURL(blob)) } } finally { setIsGenerating(false) } } return ( <div className="max-w-4xl mx-auto p-6"> <textarea value={text} onChange={(e) => setText(e.target.value)} placeholder="Enter your text here..." className="w-full h-40 p-4 border rounded-lg" /> <VoiceSelector onVoiceSelect={setSelectedVoice} selectedVoice={selectedVoice} /> <button onClick={handleGenerate} disabled={!text || !selectedVoice || isGenerating} className="bg-blue-500 text-white px-6 py-2 rounded-lg disabled:opacity-50" > {isGenerating ? 'Generating...' : 'Generate Speech'} </button> {audioUrl && ( <audio controls className="w-full mt-4"> <source src={audioUrl} type="audio/mpeg" /> </audio> )} </div> ) } Backend API: Edge Runtime with Streaming
The core TTS generation happens in a Cloudflare Pages function:
// app/api/generate/route.ts import { NextRequest } from 'next/server' export const runtime = 'edge' interface TTSRequest { text: string voice: string options?: { rate?: string pitch?: string volume?: string } } export async function POST(request: NextRequest) { try { const { text, voice, options }: TTSRequest = await request.json() // Validate input if (!text || text.length > 10000) { return new Response('Invalid text length', { status: 400 }) } if (!voice) { return new Response('Voice selection required', { status: 400 }) } // Generate TTS using Edge-TTS const audioBuffer = await generateTTS(text, voice, options) return new Response(audioBuffer, { headers: { 'Content-Type': 'audio/mpeg', 'Content-Disposition': 'attachment; filename="speech.mp3"', 'Cache-Control': 'public, max-age=3600' } }) } catch (error) { console.error('TTS generation failed:', error) return new Response('Generation failed', { status: 500 }) } } async function generateTTS( text: string, voice: string, options: TTSRequest['options'] = {} ): Promise<ArrayBuffer> { // Edge-TTS implementation const { Readable } = await import('stream') const EdgeTTS = await import('edge-tts') const tts = new EdgeTTS.default() // Configure voice settings await tts.setMetadata(voice, EdgeTTS.OUTPUT_FORMAT.AUDIO_24KHZ_48KBITRATE_MONO_MP3) // Apply SSML if options provided let ssmlText = text if (options.rate || options.pitch || options.volume) { ssmlText = `<speak><prosody${ options.rate ? ` rate="${options.rate}"` : '' }${ options.pitch ? ` pitch="${options.pitch}"` : '' }${ options.volume ? ` volume="${options.volume}"` : '' }>${text}</prosody></speak>` } const stream = tts.generateSpeech(ssmlText) // Convert stream to ArrayBuffer const chunks: Uint8Array[] = [] for await (const chunk of stream) { chunks.push(chunk) } const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0) const result = new Uint8Array(totalLength) let offset = 0 for (const chunk of chunks) { result.set(chunk, offset) offset += chunk.length } return result.buffer } Voice Management System
Dynamic voice loading with language categorization:
// lib/voices.ts export interface Voice { value: string label: string language: string gender: 'Male' | 'Female' locale: string } export const VOICE_CATEGORIES = { 'English': ['en-US', 'en-GB', 'en-AU', 'en-CA', 'en-IN'], 'Spanish': ['es-ES', 'es-MX', 'es-AR', 'es-CO'], 'French': ['fr-FR', 'fr-CA'], 'German': ['de-DE', 'de-AT', 'de-CH'], 'Chinese': ['zh-CN', 'zh-HK', 'zh-TW'], 'Japanese': ['ja-JP'], 'Korean': ['ko-KR'] // ... more languages } export async function getAvailableVoices(): Promise<Voice[]> { // In production, this would call Edge-TTS voice discovery // For now, return static list of known high-quality voices return [ { value: 'en-US-AriaNeural', label: 'Aria (US English, Female)', language: 'English', gender: 'Female', locale: 'en-US' }, { value: 'en-US-GuyNeural', label: 'Guy (US English, Male)', language: 'English', gender: 'Male', locale: 'en-US' } // ... more voices ] } // components/VoiceSelector.tsx import { useState, useEffect } from 'react' import { Voice, VOICE_CATEGORIES, getAvailableVoices } from '@/lib/voices' interface VoiceSelectorProps { onVoiceSelect: (voice: Voice) => void selectedVoice?: Voice } export default function VoiceSelector({ onVoiceSelect, selectedVoice }: VoiceSelectorProps) { const [voices, setVoices] = useState<Voice[]>([]) const [selectedLanguage, setSelectedLanguage] = useState('English') useEffect(() => { getAvailableVoices().then(setVoices) }, []) const filteredVoices = voices.filter(voice => VOICE_CATEGORIES[selectedLanguage]?.includes(voice.locale) ) return ( <div className="space-y-4"> <div> <label className="block text-sm font-medium mb-2">Language</label> <select value={selectedLanguage} onChange={(e) => setSelectedLanguage(e.target.value)} className="w-full p-2 border rounded-lg" > {Object.keys(VOICE_CATEGORIES).map(lang => ( <option key={lang} value={lang}>{lang}</option> ))} </select> </div> <div> <label className="block text-sm font-medium mb-2">Voice</label> <select value={selectedVoice?.value || ''} onChange={(e) => { const voice = voices.find(v => v.value === e.target.value) if (voice) onVoiceSelect(voice) }} className="w-full p-2 border rounded-lg" > <option value="">Select a voice...</option> {filteredVoices.map(voice => ( <option key={voice.value} value={voice.value}> {voice.label} </option> ))} </select> </div> </div> ) } Deployment: Cloudflare Pages
The entire application runs on Cloudflare's edge network:
// next.config.js /** @type {import('next').NextConfig} */ const nextConfig = { experimental: { runtime: 'edge' }, images: { unoptimized: true } } module.exports = nextConfig // package.json scripts { "scripts": { "dev": "next dev", "build": "next build", "pages:build": "@cloudflare/next-on-pages", "preview": "wrangler pages dev .vercel/output/static", "deploy": "wrangler pages deploy .vercel/output/static" } } Advanced Features Implementation
1. SSML Support for Voice Control
// lib/ssml.ts export function generateSSML(text: string, options: { rate?: string pitch?: string volume?: string emphasis?: 'strong' | 'moderate' | 'reduced' pauseAfter?: string }): string { let ssml = text // Wrap in prosody for voice modifications if (options.rate || options.pitch || options.volume) { const prosodyAttrs = [ options.rate && `rate="${options.rate}"`, options.pitch && `pitch="${options.pitch}"`, options.volume && `volume="${options.volume}"` ].filter(Boolean).join(' ') ssml = `<prosody ${prosodyAttrs}>${ssml}</prosody>` } // Add emphasis if (options.emphasis) { ssml = `<emphasis level="${options.emphasis}">${ssml}</emphasis>` } // Add pause if (options.pauseAfter) { ssml += `<break time="${options.pauseAfter}"/>` } return `<speak>${ssml}</speak>` } 2. Batch Processing for Long Texts
// lib/textProcessor.ts export function chunkText(text: string, maxLength: number = 3000): string[] { if (text.length <= maxLength) return [text] const chunks: string[] = [] const sentences = text.split(/[.!?]+/) let currentChunk = '' for (const sentence of sentences) { if ((currentChunk + sentence).length > maxLength && currentChunk) { chunks.push(currentChunk.trim()) currentChunk = sentence } else { currentChunk += sentence + '. ' } } if (currentChunk.trim()) { chunks.push(currentChunk.trim()) } return chunks } // app/api/generate-long/route.ts export async function POST(request: NextRequest) { const { text, voice, options } = await request.json() // Split long text into chunks const chunks = chunkText(text, 3000) const audioChunks: ArrayBuffer[] = [] for (const chunk of chunks) { const audio = await generateTTS(chunk, voice, options) audioChunks.push(audio) } // Combine audio chunks (simplified - would need proper audio concatenation) const totalLength = audioChunks.reduce((sum, chunk) => sum + chunk.byteLength, 0) const combined = new Uint8Array(totalLength) let offset = 0 for (const chunk of audioChunks) { combined.set(new Uint8Array(chunk), offset) offset += chunk.byteLength } return new Response(combined.buffer, { headers: { 'Content-Type': 'audio/mpeg', 'Content-Disposition': 'attachment; filename="long-speech.mp3"' } }) } 3. Real-time Audio Controls
// components/AudioControls.tsx import { useState, useRef, useEffect } from 'react' interface AudioControlsProps { audioUrl: string } export default function AudioControls({ audioUrl }: AudioControlsProps) { const audioRef = useRef<HTMLAudioElement>(null) const [isPlaying, setIsPlaying] = useState(false) const [currentTime, setCurrentTime] = useState(0) const [duration, setDuration] = useState(0) const [volume, setVolume] = useState(1) const [playbackRate, setPlaybackRate] = useState(1) useEffect(() => { const audio = audioRef.current if (!audio) return const updateTime = () => setCurrentTime(audio.currentTime) const updateDuration = () => setDuration(audio.duration) const handleEnd = () => setIsPlaying(false) audio.addEventListener('timeupdate', updateTime) audio.addEventListener('loadedmetadata', updateDuration) audio.addEventListener('ended', handleEnd) return () => { audio.removeEventListener('timeupdate', updateTime) audio.removeEventListener('loadedmetadata', updateDuration) audio.removeEventListener('ended', handleEnd) } }, [audioUrl]) const togglePlay = () => { const audio = audioRef.current if (!audio) return if (isPlaying) { audio.pause() } else { audio.play() } setIsPlaying(!isPlaying) } const handleSeek = (e: React.ChangeEvent<HTMLInputElement>) => { const audio = audioRef.current if (!audio) return const newTime = parseFloat(e.target.value) audio.currentTime = newTime setCurrentTime(newTime) } const handleVolumeChange = (e: React.ChangeEvent<HTMLInputElement>) => { const newVolume = parseFloat(e.target.value) setVolume(newVolume) if (audioRef.current) { audioRef.current.volume = newVolume } } const handleRateChange = (e: React.ChangeEvent<HTMLSelectElement>) => { const newRate = parseFloat(e.target.value) setPlaybackRate(newRate) if (audioRef.current) { audioRef.current.playbackRate = newRate } } return ( <div className="bg-gray-100 p-4 rounded-lg space-y-4"> <audio ref={audioRef} src={audioUrl} preload="metadata" /> <div className="flex items-center space-x-4"> <button onClick={togglePlay} className="bg-blue-500 text-white p-2 rounded-full" > {isPlaying ? '⏸️' : '▶️'} </button> <div className="flex-1"> <input type="range" min="0" max={duration || 0} value={currentTime} onChange={handleSeek} className="w-full" /> <div className="flex justify-between text-sm text-gray-600"> <span>{formatTime(currentTime)}</span> <span>{formatTime(duration)}</span> </div> </div> </div> <div className="flex items-center space-x-4"> <label className="flex items-center space-x-2"> <span>Volume:</span> <input type="range" min="0" max="1" step="0.1" value={volume} onChange={handleVolumeChange} className="w-20" /> </label> <label className="flex items-center space-x-2"> <span>Speed:</span> <select value={playbackRate} onChange={handleRateChange}> <option value="0.5">0.5x</option> <option value="0.75">0.75x</option> <option value="1">1x</option> <option value="1.25">1.25x</option> <option value="1.5">1.5x</option> <option value="2">2x</option> </select> </label> </div> </div> ) } function formatTime(seconds: number): string { const mins = Math.floor(seconds / 60) const secs = Math.floor(seconds % 60) return `${mins}:${secs.toString().padStart(2, '0')}` } Content Strategy with MDX
The site includes comprehensive educational content using MDX:
// mdx-components.tsx export function useMDXComponents(components: any) { return { h1: ({ children }: any) => ( <h1 className="text-4xl font-bold mb-6 text-gray-900">{children}</h1> ), h2: ({ children }: any) => ( <h2 className="text-3xl font-semibold mb-4 mt-8 text-gray-800">{children}</h2> ), p: ({ children }: any) => ( <p className="mb-4 leading-relaxed text-gray-700">{children}</p> ), code: ({ children }: any) => ( <code className="bg-gray-100 px-2 py-1 rounded text-sm font-mono">{children}</code> ), ...components, } } Performance Optimizations
1. Edge Caching Strategy
// middleware.ts import { NextResponse } from 'next/server' import type { NextRequest } from 'next/server' export function middleware(request: NextRequest) { const response = NextResponse.next() // Cache static assets if (request.nextUrl.pathname.startsWith('/api/voices')) { response.headers.set('Cache-Control', 'public, max-age=86400') // 24 hours } // Cache generated audio for 1 hour if (request.nextUrl.pathname.startsWith('/api/generate')) { response.headers.set('Cache-Control', 'public, max-age=3600') } return response } 2. Client-Side Optimization
// hooks/useTTSCache.ts import { useState, useCallback } from 'react' interface CacheEntry { audioUrl: string timestamp: number } const CACHE_DURATION = 1000 * 60 * 30 // 30 minutes export function useTTSCache() { const [cache, setCache] = useState<Map<string, CacheEntry>>(new Map()) const getCacheKey = (text: string, voice: string, options: any) => { return `${text}-${voice}-${JSON.stringify(options)}` } const getCachedAudio = useCallback((text: string, voice: string, options: any) => { const key = getCacheKey(text, voice, options) const entry = cache.get(key) if (entry && Date.now() - entry.timestamp < CACHE_DURATION) { return entry.audioUrl } return null }, [cache]) const setCachedAudio = useCallback((text: string, voice: string, options: any, audioUrl: string) => { const key = getCacheKey(text, voice, options) setCache(prev => new Map(prev).set(key, { audioUrl, timestamp: Date.now() })) }, []) return { getCachedAudio, setCachedAudio } } Monitoring and Analytics
// lib/analytics.ts export function trackTTSGeneration(voice: string, textLength: number, success: boolean) { // Analytics implementation if (typeof window !== 'undefined' && window.gtag) { window.gtag('event', 'tts_generation', { voice_used: voice, text_length_category: getTextLengthCategory(textLength), success: success }) } } function getTextLengthCategory(length: number): string { if (length < 100) return 'short' if (length < 500) return 'medium' if (length < 2000) return 'long' return 'very_long' } Key Technical Learnings
- Edge Runtime Limitations: Not all Node.js APIs are available in Cloudflare's edge runtime
- Audio Streaming: Implementing proper audio streaming for large files requires careful buffer management
- Voice Quality: Different voices perform better with different content types
- Caching Strategy: Balancing cache duration with storage costs and user experience
- Error Handling: Graceful fallbacks when Edge-TTS services are unavailable
Deployment Configuration
# .github/workflows/deploy.yml name: Deploy to Cloudflare Pages on: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Node.js uses: actions/setup-node@v3 with: node-version: '18' cache: 'npm' - name: Install dependencies run: npm ci - name: Build application run: npm run pages:build - name: Deploy to Cloudflare Pages uses: cloudflare/pages-action@v1 with: apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }} accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }} projectName: tts-free-online directory: .vercel/output/static Results and Impact
After 6 months of operation:
- Zero infrastructure costs (Cloudflare Pages free tier)
- Global edge deployment with <100ms response times
- 50,000+ monthly active users
- 500,000+ audio generations
- 99.5% uptime
Future Technical Improvements
- WebAssembly Integration: Moving Edge-TTS processing to client-side WASM
- Real-time Streaming: Implementing Server-Sent Events for progressive audio generation
- Voice Cloning: Adding custom voice training capabilities
- API Access: Public API with rate limiting and authentication
The combination of Edge-TTS, Next.js 14, and Cloudflare Pages created a powerful, scalable, and cost-effective solution that democratizes access to high-quality text-to-speech technology.
Try the Implementation
The complete source code demonstrates how modern web technologies can create powerful, free alternatives to expensive commercial services. Visit TTS-Free.Online to experience the result, or check out the implementation patterns for your own projects.
Building useful, accessible technology doesn't require massive infrastructure investments—sometimes the best solutions come from creative combinations of existing tools.
Top comments (0)