January 15, 2026 · 6 min read

Audio Accessibility (WCAG) — Complete Compliance Guide

Everything you need to know about WCAG audio accessibility requirements. Covers captions, transcripts, audio description, and practical implementation for 2026 compliance.

accessibilityWCAGcaptionsaudio

Adi Founder, Hearably

Audio accessibility is not optional. In the EU, the European Accessibility Act (EAA) took effect in June 2025, making WCAG 2.1 AA compliance legally required for most digital products and services. In the US, courts have consistently ruled that the ADA applies to websites, and WCAG is the de facto standard.

Beyond legal requirements, the numbers are straightforward: approximately 1.5 billion people worldwide experience some degree of hearing loss. Another 20% of internet users routinely watch video with captions turned on by choice — in noisy environments, while multitasking, or because they are non-native speakers.

This guide covers every WCAG success criterion related to audio, explains what each one requires in practical terms, and shows how to implement compliance.

WCAG Audio Requirements at a Glance

WCAG organizes requirements into three conformance levels: A (minimum), AA (standard), and AAA (enhanced). Most legal requirements mandate AA compliance.

Criterion	Level	Requirement
1.2.1	A	Captions OR transcript for prerecorded audio
1.2.2	A	Captions for prerecorded video with audio
1.2.3	A	Audio description OR full text alternative for prerecorded video
1.2.4	AA	Captions for live audio in synchronized media
1.2.5	AA	Audio description for prerecorded video
1.2.6	AAA	Sign language interpretation for prerecorded audio
1.2.7	AAA	Extended audio description for prerecorded video
1.2.8	AAA	Full text alternative for prerecorded synchronized media
1.2.9	AAA	Transcript for live audio
1.4.2	A	Audio control (pause, stop, or volume control for auto-playing audio)
1.4.7	AAA	Low or no background audio in speech recordings

Let us break down each one.

Captions (1.2.1, 1.2.2, 1.2.4)

What Counts as “Captions”

Captions are synchronized text alternatives for audio content. They must include:

All spoken dialogue — verbatim, not paraphrased
Speaker identification — when multiple speakers are present, label who is speaking
Relevant sound effects — [door slams], [phone rings], [audience laughing]
Music descriptions — [upbeat jazz music], [ominous orchestral score]
Timing — captions must appear and disappear in sync with the audio, with a tolerance of about 100ms

Captions vs Subtitles

These terms are often used interchangeably, but they are technically different:

Captions include non-speech audio information (sound effects, music descriptions) and are designed for deaf and hard-of-hearing users.
Subtitles contain only dialogue and are designed for users who can hear the audio but need text translation or reinforcement.

WCAG requires captions, not subtitles. If your “captions” only contain dialogue, they do not meet the standard.

Caption File Formats

The two standard formats for web captions are:

WebVTT (.vtt) — the native format for the HTML5 <track> element. Supports styling via CSS.

WEBVTT

00:00:01.000 --> 00:00:04.500
[SARAH] Welcome to episode forty-three.

00:00:05.000 --> 00:00:08.200
[SARAH] Today we're talking about
audio accessibility requirements.

00:00:09.000 --> 00:00:10.500
[upbeat intro music]

SRT (.srt) — widely supported by video platforms (YouTube, Vimeo). Simpler format, no styling.

1
00:00:01,000 --> 00:00:04,500
[SARAH] Welcome to episode forty-three.

2
00:00:05,000 --> 00:00:08,200
[SARAH] Today we're talking about
audio accessibility requirements.

Implementing Captions in HTML5

<video controls>
  <source src="episode.mp4" type="video/mp4">
  <track kind="captions" src="captions-en.vtt" srclang="en" label="English" default>
  <track kind="captions" src="captions-de.vtt" srclang="de" label="Deutsch">
</video>

The kind="captions" attribute tells assistive technology that this track includes non-speech information. Using kind="subtitles" is technically incorrect for accessibility purposes, though browsers render them the same way.

Generating Captions

Manual captioning costs $1-3 per minute and takes time. AI transcription has become accurate enough for first-draft captions that need only light editing:

Whisper (OpenAI’s open-source model) — 5-8% word error rate for clear English speech. Available as a local tool or through browser-based implementations like Hearably’s auto caption generator.
YouTube’s auto-captions — free, decent quality for English, but must be reviewed and corrected.
Professional services (Rev, 3Play Media) — human-reviewed captions with guaranteed accuracy rates above 99%.

For WCAG compliance, auto-generated captions must be reviewed for accuracy. Uncorrected AI captions with significant errors do not satisfy the requirement — the standard specifies that captions must be “equivalent” to the audio content.

Live Captions (1.2.4)

WCAG AA requires captions for live audio. This applies to:

Live webinars and web events
Live-streamed video
Real-time audio in web applications

Live captioning options include:

CART (Communication Access Realtime Translation) — human stenographers producing captions in real time with 98%+ accuracy. Costs $100-250/hour.
AI live captioning — services like Otter.ai, Google’s live caption, or browser-based solutions. Accuracy is lower (90-95%) but improving rapidly.
Browser extensions — tools like Hearably can generate live captions for any audio playing in the browser using on-device AI models, without sending audio to external servers.

For hearing-impaired users, browser-level live captions are particularly valuable because they work on any website, not just platforms that provide their own captioning.

Transcripts (1.2.1, 1.2.8, 1.2.9)

When Transcripts Are Required

At Level A, you need either captions OR a transcript for prerecorded audio-only content (like podcasts). At Level AAA, you need full text alternatives for all synchronized media and transcripts for live audio.

What a Compliant Transcript Includes

A transcript is a complete text version of the audio content. It must include:

All spoken content, attributed to speakers
Descriptions of relevant non-speech sounds
Any visual information that is necessary for understanding (for video transcripts)

A transcript does not need timestamps, but including them makes it more useful.

Transcript Best Practices

Place the transcript on the same page as the audio/video, or link to it immediately below the player.
Use semantic HTML — <details><summary>Transcript</summary>...</details> is a common pattern that keeps the page clean while making the transcript accessible.
Make it searchable — HTML text, not an image or PDF of text.
Include headings for long transcripts to allow navigation.

Audio Description (1.2.3, 1.2.5, 1.2.7)

What Audio Description Is

Audio description is a narration track that describes visual information in video that is not conveyed through dialogue or existing audio. It fills in the gaps for blind and low-vision users.

Examples of what audio description covers:

On-screen text that is not read aloud
Visual actions, gestures, and expressions
Scene changes and settings
Graphics, charts, and presentations

Implementation Approaches

Separate audio track: Create an additional audio file with descriptions inserted during pauses in dialogue. Link it as an alternative audio track.

Extended audio description (1.2.7, AAA): When there are not enough natural pauses for descriptions, the video pauses automatically while the description plays, then resumes. This is technically complex but required for AAA compliance.

Integrated description: Write your primary script to be self-describing. “As you can see on this chart, sales increased 40%” becomes “Sales increased 40% between January and March, as shown on the bar chart now on screen.” This approach is the most inclusive and avoids the need for a separate track.

Audio Control (1.4.2)

Any audio that plays automatically for more than 3 seconds must have a mechanism to pause, stop, or control its volume independently of the system volume.

This is a Level A requirement — meaning it is the bare minimum.

Common violations:

Auto-playing background music on a homepage with no pause button
Video ads that play audio automatically with no mute control
Notification sounds that cannot be silenced

Implementation: The simplest compliant approach is to never auto-play audio. If you must auto-play, provide a visible pause/mute button within the first 3 seconds and ensure it is keyboard-accessible.

Background Audio (1.4.7)

At Level AAA, prerecorded speech audio must have no background sounds, or background sounds must be at least 20 dB lower than the foreground speech. This means background music or ambient sound should be at most 1/10th the volume of the speech.

Practical guidance: For podcast intros and outros with music, the music bed should be at -20 dB relative to the voice or lower. During the body of the episode, eliminate background music entirely for AAA compliance.

Testing Audio Accessibility

Automated Testing

axe DevTools — scans for missing <track> elements on <video> and <audio> tags.
WAVE — flags media elements without text alternatives.
Lighthouse — checks for auto-playing audio without controls.

Automated tools can detect missing captions and controls, but they cannot evaluate caption quality or accuracy. Manual review is required.

Manual Testing Checklist

Play every video and audio element with sound muted. Can you understand the content from captions alone?
Read every transcript without playing the media. Is the content fully conveyed?
Navigate to all media players using only a keyboard. Can you play, pause, and adjust volume?
Test with a screen reader (VoiceOver, NVDA). Are media controls properly labeled?
Check caption timing — are captions synchronized within 100ms of the spoken audio?
Verify speaker identification in captions for multi-speaker content.
Confirm that auto-playing audio can be paused within 3 seconds.

Quick Compliance Checklist

Level A (Minimum):

Captions or transcript for all prerecorded audio
Captions for all prerecorded video with audio
Audio description or text alternative for prerecorded video
Pause/stop/volume control for any auto-playing audio

Level AA (Standard — legally required in most jurisdictions):

All Level A requirements
Captions for live audio in synchronized media
Audio description for all prerecorded video

Level AAA (Enhanced):

All Level AA requirements
Sign language for prerecorded audio
Extended audio description
Full text alternative for all synchronized media
Transcript for live audio
Background audio 20+ dB below speech

Audio accessibility is not just a checkbox for legal compliance. It is about making your content available to the widest possible audience — including the 20% of users who prefer captions even when they can hear perfectly fine.

Try Hearably for free

Volume boost, live captions, noise reduction, and more — all in your browser.

Add to Chrome — Free