wikis / Remotion / wiki / concepts / captions.md view as markdown
Captions: Transcribing, Parsing & Display
Definition
@remotion/captions (v4.0.216+, MIT-licensed — install with the package name @remotion/captions) standardizes subtitles around one data structure, the Caption type, so that transcribing, parsing, formatting, and exporting interoperate. The typical pipeline: transcribe audio (or parse an .srt) into Caption[], group words into TikTok-style "pages" with createTikTokStyleCaptions(), then render each page in a <Sequence> with per-word highlighting.
How It Works
The Caption type
import type {Caption} from '@remotion/captions';
Fields: text (the caption text), startMs / endMs (start and end in milliseconds), timestampMs (a singular timestamp — the t_dtw value when using Whisper.cpp, otherwise possibly the start/end average or undefined), and confidence (0–1 transcription confidence).
Whitespace matters: include a space in text before each word — spaces are the delimiters. When rendering, set white-space: pre on the caption container so they survive.
Getting captions: transcription options
@remotion/install-whisper-cpp |
@remotion/whisper-web |
@remotion/openai-whisper |
@remotion/elevenlabs |
|
|---|---|---|---|---|
| Environment | Server (Node.js) | Browser (WASM) | Cloud API | Cloud API |
| Speed | Fast (hardware-dependent) | Slow (WASM overhead) | Fast | Fast |
| Cost | Free | Free | Paid | Paid |
| Offline | Yes | Yes | No | No |
| Convert with | toCaptions() |
toCaptions() |
openaiWhisperApiToCaptions() |
elevenLabsTranscriptToCaptions() |
You can also use your own caption format — the Caption type is recommended, not required (it also matches the paid Remotion Editor Starter and Animated Captions products).
Importing and exporting SRT
parseSrt({input}) takes the contents of a SubRip file as a string and returns {captions} — one Caption per cue with confidence: 1 and timestampMs as the cue midpoint. Load a file from public/ with fetch(staticFile('subtitles.srt')) inside a useDelayRender() flow: continueRender(handle) on success, cancelRender(e) on failure.
The inverse, serializeSrt({lines}), takes a two-dimensional array — each top-level item is one SRT line, second-level items are words concatenated without added spaces; the line's timestamps come from the first word's startMs and last word's endMs. Empty arrays are skipped. (ensureMaxCharactersPerLine() exists but is internal/undocumented for now.)
TikTok-style pages
const {pages} = createTikTokStyleCaptions({
captions,
combineTokensWithinMilliseconds: 1200,
});
Words closer together than combineTokensWithinMilliseconds merge into one page — a high value fits many words per page, a low value approaches word-by-word display. Each TikTokPage has text, startMs, durationMs (v4.0.261+), and tokens: [{text, fromMs, toMs}] with absolute millisecond times. Safe in browser, Node.js, and Bun.
Displaying captions
- Fetch
captions.json(or parse SRT) underuseDelayRender(). - Page them with
createTikTokStyleCaptions()insideuseMemo. - Sequence each page:
startFrame = (page.startMs / 1000) * fps; end at the next page's start orstartFrame + (SWITCH_CAPTIONS_EVERY_MS / 1000) * fps, whichever is sooner; skip pages with non-positive duration. - Highlight the active word inside the page component: convert the sequence-relative frame back to absolute time (
absoluteTimeMs = page.startMs + (frame / fps) * 1000) and mark tokens wheretoken.fromMs <= absoluteTimeMs && token.toMs > absoluteTimeMs(the example uses highlight color#39E508on white text,fontSize: 80,whiteSpace: 'pre').
Polish: fitText() from @remotion/layout-utils auto-scales text to the video width; add enter/exit animation via animation; improve legibility with WebkitTextStroke: '4px black' plus paintOrder: 'stroke'.
Key Parameters
combineTokensWithinMilliseconds— the single most important knob; controls page-switch cadence (example value1200).Caption.textwhitespace — leading space per word; render withwhite-space: pre.parseSrt({input})/serializeSrt({lines})— string in, string out;linesisCaption[][].- Token times (
fromMs/toMs) are absolute; sequence frames are relative — convert deliberately.
When To Use
- Social-format videos (Reels/TikTok/Shorts) with burned-in word-by-word captions → full pipeline above.
- Existing subtitle files →
parseSrt()instead of transcribing. - Generating
.srtdeliverables from edited captions →serializeSrt(). - Choosing a transcription backend: local server and free → Whisper.cpp; zero infrastructure in-browser → whisper-web; fastest setup with budget → OpenAI or ElevenLabs APIs.
Risks & Pitfalls
- Omitting spaces before words makes
createTikTokStyleCaptions()merge everything into one line/page — the most common formatting bug. - Forgetting
white-space: precollapses the spaces you carefully preserved. - Mixing absolute and relative time when highlighting tokens — add
page.startMsto the sequence-relative time. - Fetching captions without
delayRender()renders frames before data arrives; always pair withcontinueRender/cancelRender. serializeSrt()adds no spaces between words — yourCaption.textwhitespace must already be correct.
Related Concepts
- compositions fundamentals —
<Sequence>timing that drives caption pages - audio — the audio being transcribed
- animation — entrance/exit effects for caption pages
- packages catalog — where
@remotion/captionsand the Whisper packages sit in the ecosystem
Sources
- raw/github_doc-packages-docs-docs-captions-api-mdx.md / -captions-index-mdx.md — package overview
- raw/github_doc-packages-docs-docs-captions-caption-mdx.md —
Captiontype - raw/github_doc-packages-docs-docs-captions-parse-srt-mdx.md / -captions-serialize-srt-mdx.md — SRT round-trip
- raw/github_doc-packages-docs-docs-captions-importing-mdx.md —
.srtimport flow - raw/github_doc-packages-docs-docs-captions-transcribing-mdx.md — transcription option comparison
- raw/github_doc-packages-docs-docs-captions-create-tiktok-style-captions-mdx.md — paging API
- raw/github_doc-packages-docs-docs-captions-displaying-mdx.md — display patterns and full example
- raw/github_doc-packages-docs-docs-captions-ensure-max-characters-per-line-m.md — internal API note
