"75% - 85% of people watch videos on mute," several studies suggest. Silent viewing isn't the only reason to provide textual alternatives for a diverse audience.
Subtitles, closed captions, and transcripts are alternative channels for spoken content that benefit users and marketing all the same.
Social media and video hosting services not only care for cloud hosting and responsive delivery, but they might automatically generate text transcripts and subtitles. However, hosting on platforms like Instagram, TikTok, YouTube, Dailymotion or even PeerTube risks losing our viewers to unrelated content or get their data compromised violating privacy legislation.
We can use HTML 5 video and VTT closed caption files to implement web video accessibly, and there are some services that might help us to optimize, host, and label our content.
Closable Captions for Diverse Audiences
The most versatile and accessible way to provide alternative text to audio and video content (beyond a title attribute on a placeholder picture) are "closed captions" (CC), so called because they are closable , and often initially closed, as opposed to "open captions" which are always visible. Closed captions are stored in their own files, independent from video and audio tracks, so we can always add subtitles in additional languages without making any changes to our existing video files.
Subtitles vs. Captions
We can offer text files for different languages and add extended versions that describe sounds and other important aspects of our audiovisual content. Those enhanced versions are also known as "captions" or "subtitles for the hard of hearing", but they could also be read by a screen reader or by a person who happens to have their speakers muted.
While the words subtitles and captions are often used interchangeably in everyday language, there is a clear definition for HTML attributes (<track kind
values):
As quoted by Wikipedia, "HTML5 defines subtitles as a "transcription or translation of the dialogue when sound is available but not understood" by the viewer (for example, dialogue in a foreign language) and captions as a "transcription or translation of the dialogue, sound effects, relevant musical cues, and other relevant audio information when sound is unavailable or not clearly audible".
kind: subtitles
(transcript or translation)kind: captions
(dto. plus accessible sound description)
Many videos offer one text version per supported language, focused on what is said in a video or podcast. In situations like a moderated interview or professional marketing content, this might be enough to communicate the crucial content to most users and machines, which will also benefit search-engine optimization and marketing with AI assistants in mind.
I will focus on general-purpose closed captions and transcripts, technically called "subtitles" in HTML, but usually known as "captions" otherwise.
VTT Caption File Format
The most important file format for web video, WebVTT (Web Video Text Tracks), resembles markdown, in that its main content is text, and optional tags are used to add technical information and synchronize text to timestamps.
WEBVTT 00:00:00.500 --> 00:00:02.000 The Web<00:00:01.000> is always<00:00:01.500> changing 00:00:02.500 --> 00:00:04.300 and the way we access it is changing
We could (mis)use fine-grained inline time stamping and visual styling to create a Karaoke paint-on effect, but every accessibility expert I talked to recommended not to do that. We can, however, use CSS, to define colors and fonts to match our website style, without sacrificing accessibility, using the ::cue
pseudo-element and various vendor-prefixed properties. You can see an example in the code snippets at the end of this post.
Captioning Services
There are plenty of captioning services offering free plans for new customers, but their doubtful quality requires careful proofreading, adjusting, and verification. For a paid project, I recommend hiring a human team that uses automated tools with expertise and experience.
Here are some captioning providers that I found. I did not evaluate all of them thoroughly, so please do your own research based on your specific requirements!
Speechpad is an online service operated and controlled by human transcribers, which probably costs more time and money than fully automated AI services, but promises to provide a much higher level of accuracy. Rev follows a hybrid approach offering both human and automated services.
"Vibe Coding" VTT Files
Fully automated solutions claim to caption on the fly within seconds, providing a quick preview or even a basis for further review and refinement. But can they live up their promise?
Professional and complex tools often have a steeper learning curve and pay off for frequent users or ambitious developers. Kushal Magar used the Shotstack API in a 2022 DEV post. I will focus on some seemingly easy end-user solutions here. Seemingly, because I often find it surprisingly hard to get a single simple task done, trying to make sense of confusing and distracting user interfaces (and loads of errors in the browser console). Some tools might not even have an option to create a caption file.
Describe got my text mostly right, but didn't care for correct German punctuation and capitalization and didn't add timestamps. Veed, a top search result, also seems to focus on video editing and marketing, and its AI agent stated that it "can't generate a VTT captions file directly. However, it can add subtitles to the video. Would you like me to do that?" No, I don't!
Scriptme's button "Create VTT file" looked promising and unambiguous, but after uploading my video, I had to choose an action: transcription or subtitles? Maybe I shouldn't have skipped the tour for new users. "Subtitles" is the correct action for creating VTT files with Scriptme, and they even have a dropdown to select the language. We can change subtitles time by adjusting the blocks in the timeline and export the subtitles to different file formats: .srt
for free evaluation users, .vtt
and .stl
with a pro plan.
TurboScribe is another automated service that did a useful transcription with timestamps on the fly. Its default export formats are PDF, DOCX, TXT, and SRT, but VTT is an "advanced" option. We'll compare popular file formats later.
Lost in Transcription
Zubtitle didn't detect ask which language was spoken and failed to detect it, making up a funny fantasy language that we can see in the screenshot below.
Not a single service was flawless.
Marketing videos mixing standard language with technical terms and product names are likely to cause more or less subtle transcription errors, like a "Wardrobe Deep Dive" mistakenly captioned as "Workshop Deep Dive" or "Vortrag Deep Dive". Zubtitle embedded the correct "Wardrobe" into its fantasy language:
We'll do a shopping tour in Kanserbe Munichung with my wardrobe deep dive,then a Skipnitzner Heitigeras at creative. Fiberated.
That doesn't make sense at all in any language.
File Format Conversion
So what about the different formats?
- VTT (short for WEBVTT) is the standard format for adding timed text tracks to web video, natively supported by modern browsers.
- SRT is another popular plain-text subtitle format widely supported across many video players, editing software, and social media platforms.
- STL is a professional, binary subtitle format specifically designed for broadcast television and film industries, particularly in Europe.
VTT and SRT look quite similar at first glance, even the timestamp format only differs in the comma to delimit milliseconds in SRT versus a decimal dot in VTT.
Happyscribe converts an STL file to VTT format.
VTT Variations
Several services produced VTT files with text and timestamps. None mentioned the instrumental music. Some inserted line numbers, and each cut the text differently. In the diff below, we can tell which VTT was converted from SRT without removing the optional numbers that don't add any value in VTT, although descriptive identifiers would.
Splitting Text into Readable Chunks
Readability is probably the most obvious consideration to decide where to split text: make it fit two lines in a large-enough font size, and prevent cutting sentences or related groups of words.
It seems that we can't tune automated services like Scriptme or Turboscribe to improve their unfavorable splitting strategies. Their AI-assisted output is less than ideal for readability and flow.
So we're back on square one (like ever so often when using AI) unless we are willing to decide that those autogenerated caption files are good enough for our purpose.
Uploading Video and Caption Files
Here is a screenshot of a classic example video uploaded with subtitles in a custom WordPress theme:
"Big Buck Buny" is a computer-generated short film by the Blender Institute, famous not only for its cute cartoon characters, but also for the fact that it's open source, licensed under the Creative Commons Attribution license.
Implementing HTML Web Video
We can host a short video on our regular web server or its default CDN/cloud using HTML 5 web video markup with a source set to offer alternative streaming formats like webm or mp4 and adjust the size (width and height) and frame rate to ensure decent quality without wasting unnecessary bandwidth (loading time).
We can use web services, command-line tools like ffmpeg
or graphic ones like Handbrake to produce alternative files for the browser to choose. Always use the original top quality input file, not a compressed version, to prevent losing quality during subsequent encoding steps.
Optimizing Video Files
Resolution and frame rate considerations: 720 × 576 (DVD video) might sound outdated, but could be perfectly enough inside a fixed container within a website, although it's not nice to watch full-size on a laptop or a large monitor optimized for 3840 × 2160 (4K TV). Likewise, 30fps might sound like a flickering frame rate, but that's not only a classic TV broadcast quality, but also a typical iPhone video recording setting. Formats/codecs also matter. I have reduced a 130 MB mov
video (1:30 minutes, 1920x1080, 24fps) to an equivalent 54 MB mp4
version using common web settings like -c:a aac -c:v libx264 -pix_fmt yuv420p -profile:v baseline -level 3.0 -crf 22 output.mp4
, still with a perceived top quality.
Markup, Download, and Structured Data
HTML video elements can contain multiple source
elements and optional tracks, as child elements. Each track can provide a label
, a language attribute (srclang
), a file URL (src
), an attribute to set as default
, and a kind
attribute to distinguish subtitles (transcript or translation) and captions (with additional information).
MDN: Chrome and Opera ignore the
default
attribute on the<track>
element and will instead try to match the browser's language to the subtitle's language.
Together with download links for each video file, performance and display attributes like <video preload
, muted
, or autoplay
(the latter only works together with muted in modern browsers) and structured metadata in ld+json
format, useful for automatic parsing and indexing, our video section markup looks like this:
<section class="video__wrapper"> <video controls preload="metadata" class="video__player" width="640" height="360" poster="preview.jpg"> <source src="video.webm" type="video/webm"> <source src="video.mp4" type="video/mp4"> <track label="English subtitles" kind="subtitles" srclang="en" src="subtitles-english.vtt" > <track label="Deutsche Untertitel" kind="subtitles" srclang="de" src="subtitles-german.vtt" > <track label="English captions" kind="captions" srclang="en" src="captions-english.vtt" > <track label="Deutsche Captions" kind="captions" srclang="de" src="captions-german.vtt" > <p> The above video shows a giant cartoon rabbit climbing out of a hole in the ground, walking, and getting involved in a fight. </p> Download the video in <a href="video.webm">webm</a> or <a href="video.mp4">mp4</a> format. </video> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "VideoObject", "name": "big-buck-bunny_trailer", "description": "", "thumbnailUrl": "https://example.com/preview.jpg", "uploadDate": "2025-07-31T09:24:41+01:00", "duration": "PT0M32S", "publisher": { "@type": "Organization", "name": "openmindculture" }, "contentUrl": "https://example.com/video.webm" } </script> </section>
Styling
Check MDN and CanIUse for the latest CSS styling possibilities to avoid unnecessary vendor prefixes and !important
declarations!
video { width: 62rem; max-width: calc(100% - 2.5rem); height: auto; margin: 0 auto; border-radius: 1rem; } /* Chrome */ video::cue { background-color: var(--color-inverted-background) !important; color: var(--color-inverted-foreground) !important; font-family: var(--font-family-default); padding-left: 1rem; padding-right: 1rem; } /* Safari */ video::-webkit-media-text-track-display-backdrop, video::-webkit-media-text-track-background { background-color: var(--color-inverted-background) !important; } video::-webkit-media-text-track-display { color: var(--color-inverted-foreground) !important; font-family: var(--font-family-default); } video::-webkit-media-text-track-container { padding-left: 1rem; padding-right: 1rem; }
Conclusions
Closed captions are stored in separate files, independent from video and audio tracks. Changing captions or adding a new language version as text doesn't require uploading a new video. We can use CSS to style web caption fonts, colors, and padding.
Automated tools can help you as a newbie/solopreneur with limited knowledge and budget, or for getting a preview of possibilities and challenges, but beware of flaws and errors! You can write caption files on your own, if you have more time than money and don't mind a tedious task that doesn't tolerate typos. But to ensure getting a correct text that displays in a readable layout at the right time, you should definitely hire a professional expert!
In this post, I have covered the basics of accessible HTML 5 video and VTT closed caption files, and tried different online services to compare their accuracy and usability for captioning a short image film.
Here are some sources and further reading recommendations.
Top comments (3)
Automated transcriptions don't seem to handle existing "open" captions (visible subtitles in a video). In my case, there is one single line of German text, that gets obscured by automatically transcribed captions. Adjusting manually, I'd probably try to have no closed captions in German at that moment, but add one for every translated version.
I can't believe nobody yet mentioned a best-practice open-source solution for captioning that I overlooked?
If you turn on the sound, it's just random phonk or weird candy music which adds no value to the existing facebook or instagram video.