Subtitles Workgroup Guidelines: Creating Consistent, Readable CaptionsCreating captions that are accurate, consistent, and easy to read requires a combination of clear policies, practical formatting rules, and regular quality checks. These guidelines are intended for a Subtitles Workgroup responsible for producing and reviewing subtitles across video content—educational materials, public-service announcements, entertainment, and corporate communications. They cover principles, style choices, technical specs, workflow, tools, and monitoring practices to ensure captions serve viewers who are deaf or hard of hearing, people watching without sound, and non-native speakers.
Purpose and audience
The primary purpose of captions is to provide full access to audio information. Captions should:
- Convey spoken dialogue verbatim where appropriate.
- Indicate speaker changes and essential non-speech audio (music, sound effects).
- Preserve meaning and tone while remaining readable at a natural reading speed.
Primary audiences include:
- Deaf and hard-of-hearing viewers who rely on captions for comprehension.
- Viewers in sound-off environments (public transit, workplaces).
- Non-native speakers and language learners.
- Search engines and automated indexing systems that use captions for metadata.
Core principles
- Accuracy: Match what is spoken, including key disfluencies only when they affect meaning.
- Clarity: Use plain language and standard spelling; avoid unnecessary punctuation.
- Readability: Keep line length and timing suited to typical reading speeds.
- Consistency: Follow one style for punctuation, speaker identification, and formatting across all content.
- Respectfulness: Avoid captions that stigmatize speech differences; label sounds neutrally.
Formatting and style
- Line length: Aim for 32–42 characters per line; maximum 42 characters for easy reading across devices.
- Lines per caption: 1–2 lines only; avoid 3-line captions unless unavoidable for long on-screen text.
- Timing: Display captions for a minimum of 1 second and a maximum of 7 seconds, with typical exposure around 2–6 seconds depending on reading complexity.
- Characters per second (CPS): Keep CPS below 17; for complex or technical content target 12–14 CPS.
- Breaks: Break lines at natural linguistic points—phrases, clauses, and after commas—never mid-word.
- Hyphenation: Avoid hyphenating words at line breaks.
- Capitalization: Use sentence case for most captions. Use ALL CAPS only for on-screen text that is presented that way or to indicate strong emphasis when necessary.
- Punctuation: Use punctuation to aid comprehension. Omit quotation marks for spoken dialogue unless needed for clarity.
- Numbers: Spell out numbers one through nine; use numerals for 10 and above, except when style or context dictates otherwise.
- Speaker identification:
- Use a speaker label when there is no visual cue (e.g., [HOST]: or Host:).
- For clear on-screen speakers, rely on positioning or arrows rather than labels when possible.
- Use consistent labels (e.g., Host, Interviewee, Announcer).
- Sound descriptions:
- Use concise, neutral descriptions in brackets for non-speech audio: [applause], [thunder], [cell phone vibrates].
- For music, indicate mood and lyrics when relevant: [somber piano music] or [choir singing: “Amazing grace”].
- Overlaps and interruptions:
- For overlapping speech, stagger captions and use ellipses to indicate cut-off or interruptions.
- For short interruptions, a dash at the beginning of the line signals interruption: — I thought you said…
- Two-line credit: When presenting names, titles, or location info, keep each item concise and on its own line if needed.
Transcription approach
- Verbatim vs. clean read:
- Use verbatim transcription for legal, technical, or documentary content where exact words matter.
- Use “clean read” (remove filler words, stutters, false starts) for entertainment or conversational content unless the speech characteristic is meaningful.
- Handling dialects and accents:
- Transcribe spoken words as pronounced when meaning could be altered by a literal spelling; otherwise transcribe standard spelling.
- Avoid phonetic spellings that mock accents; instead add a brief bracketed note if accent affects comprehension: [thick regional accent].
- Censoring profanity:
- Follow the publishing platform’s policy. Common options: full spelling, partial masking (f***), or substitution (beep). Document the chosen standard and apply it consistently.
- Foreign language and code-switching:
- If untranslated content is brief and important, provide the original with an English translation on the following caption line in italics or separated by brackets: [Spanish] “¿Dónde está?” → [English] “Where is it?”
- For substantial foreign-language segments, provide full translated captions and, where possible, retain a short label indicating the original language.
Technical specifications
- File formats: Support industry-standard formats—SRT for basic workflows, WebVTT for web playback with styling, and TTML/DFXP for broadcast and advanced styling needs.
- Timecodes: Use frame-accurate timecodes and match caption start to the earliest perceivable sound and end when it is no longer needed.
- Encoding: Save caption files in UTF-8 to support international characters.
- Styling:
- Use WebVTT/TTML when speaker position, color, or styling is necessary to indicate character or context.
- Avoid excessive styling that may reduce legibility (neon colors, tiny font).
- Quality checks: Run automated checks for overlapping captions, excessive CPS, and orphaned captions (very short single captions surrounded by silence).
Workflow and roles
- Workgroup composition:
- Lead editor: oversees style and final approval.
- Transcribers/captioners: produce initial captions.
- QA reviewers: check accuracy, timing, and adherence to style.
- Accessibility consultant: advises on reader needs and legal compliance.
- Review process:
- Transcription: create a timestamped draft.
- First pass edit: correct obvious errors, apply style.
- Sync pass: adjust timing and line breaks to meet CPS and exposure rules.
- QA pass: focus on accessibility, speaker IDs, and sound descriptions.
- Final approval: lead editor signs off.
- Turnaround times:
- Same-day turnaround for short content (<10 min) when necessary.
- Standard SLA examples: 24–48 hours for 30–60 minute programs; adjust for complexity and language needs.
- Training:
- Provide onboarding materials: style guide, sample projects, and scoring rubric.
- Hold periodic calibration sessions where multiple captioners caption the same clip and compare results.
Tools and automation
- Use ASR (automatic speech recognition) to speed initial transcript creation, but always apply human editing for timing, speaker labeling, and nuance.
- Recommended features: timestamp accuracy, speaker diarization, easy editing of line breaks, integrated CPS and exposure warnings.
- Machine translation: use cautiously; pair with professional translators and human QA for non-English content.
- Integrations: connect captioning tools to the CMS and video players to streamline publishing and version control.
Accessibility-specific considerations
- Identify on-screen text: Captions should reflect important on-screen text (lower-thirds, titles) either as captions or as separate metadata tracks.
- Multiple-language tracks: Offer captions in the original language (subtitles for the deaf and hard-of-hearing — SDH) and translated subtitle tracks separately.
- Positioning: Avoid placing captions over critical on-screen visual information; use positioning cues in WebVTT/TTML when necessary.
- Reader customization: Ensure captions support user-controlled font size and background opacity when rendered by players.
Quality metrics and monitoring
- Objective metrics:
- Word Error Rate (WER) target: <10% for final, human-reviewed captions in standard speech.
- CPS compliance: >98% of captions within the CPS threshold (≤17 CPS).
- Timing accuracy: >95% of captions start within 200 ms of speech onset in QA samples.
- Subjective metrics:
- Viewer satisfaction surveys focusing on readability and helpfulness.
- Accessibility audits with community members who are deaf or hard of hearing.
- Continuous improvement:
- Track recurring errors and update the style guide.
- Maintain a changelog of style decisions and exceptions.
Examples and edge cases
- Overlapping speech:
- Use two staggered captions, each anchored to speaker if possible: 00:01:10,000 –> 00:01:11,500 — I can’t believe you— 00:01:11,400 –> 00:01:12,800 — It’s fine, just listen.
- Long technical terms:
- Break into syllable-friendly places if necessary and keep CPS low; include a glossary in accompanying metadata.
- Live captioning:
- For live events, favor immediacy over perfection; use concise captions and include a note when accuracy might lag: [Live captions — may contain errors].
Governance and updates
- Maintain a living subtitle style guide stored in a versioned repository.
- Review and update the guide quarterly or after major platform/policy changes.
- Log exceptions and rationale for future reference.
These guidelines provide a framework to produce captions that are consistent, readable, and respectful of viewers’ needs. Use them as a baseline and adapt specifics (CPS threshold, profanity policy, WER targets) to your organization’s audience and legal requirements.
Leave a Reply