wikis / Remotion / wiki / concepts / captions.md view as markdown

Captions: Transcribing, Parsing & Display

type: conceptconfidence: highupdated: 2026-06-11sources: 10

Definition

@remotion/captions (v4.0.216+, MIT-licensed — install with the package name @remotion/captions) standardizes subtitles around one data structure, the Caption type, so that transcribing, parsing, formatting, and exporting interoperate. The typical pipeline: transcribe audio (or parse an .srt) into Caption[], group words into TikTok-style "pages" with createTikTokStyleCaptions(), then render each page in a <Sequence> with per-word highlighting.

How It Works

The `Caption` type

import type {Caption} from '@remotion/captions';

Fields: text (the caption text), startMs / endMs (start and end in milliseconds), timestampMs (a singular timestamp — the t_dtw value when using Whisper.cpp, otherwise possibly the start/end average or undefined), and confidence (0–1 transcription confidence).

Whitespace matters: include a space in text before each word — spaces are the delimiters. When rendering, set white-space: pre on the caption container so they survive.

Getting captions: transcription options

	`@remotion/install-whisper-cpp`	`@remotion/whisper-web`	`@remotion/openai-whisper`	`@remotion/elevenlabs`
Environment	Server (Node.js)	Browser (WASM)	Cloud API	Cloud API
Speed	Fast (hardware-dependent)	Slow (WASM overhead)	Fast	Fast
Cost	Free	Free	Paid	Paid
Offline	Yes	Yes	No	No
Convert with	`toCaptions()`	`toCaptions()`	`openaiWhisperApiToCaptions()`	`elevenLabsTranscriptToCaptions()`

You can also use your own caption format — the Caption type is recommended, not required (it also matches the paid Remotion Editor Starter and Animated Captions products).

Importing and exporting SRT

parseSrt({input}) takes the contents of a SubRip file as a string and returns {captions} — one Caption per cue with confidence: 1 and timestampMs as the cue midpoint. Load a file from public/ with fetch(staticFile('subtitles.srt')) inside a useDelayRender() flow: continueRender(handle) on success, cancelRender(e) on failure.

The inverse, serializeSrt({lines}), takes a two-dimensional array — each top-level item is one SRT line, second-level items are words concatenated without added spaces; the line's timestamps come from the first word's startMs and last word's endMs. Empty arrays are skipped. (ensureMaxCharactersPerLine() exists but is internal/undocumented for now.)

TikTok-style pages

const {pages} = createTikTokStyleCaptions({
  captions,
  combineTokensWithinMilliseconds: 1200,
});

Words closer together than combineTokensWithinMilliseconds merge into one page — a high value fits many words per page, a low value approaches word-by-word display. Each TikTokPage has text, startMs, durationMs (v4.0.261+), and tokens: [{text, fromMs, toMs}] with absolute millisecond times. Safe in browser, Node.js, and Bun.

Displaying captions

Fetch captions.json (or parse SRT) under useDelayRender().
Page them with createTikTokStyleCaptions() inside useMemo.
Sequence each page: startFrame = (page.startMs / 1000) * fps; end at the next page's start or startFrame + (SWITCH_CAPTIONS_EVERY_MS / 1000) * fps, whichever is sooner; skip pages with non-positive duration.
Highlight the active word inside the page component: convert the sequence-relative frame back to absolute time (absoluteTimeMs = page.startMs + (frame / fps) * 1000) and mark tokens where token.fromMs <= absoluteTimeMs && token.toMs > absoluteTimeMs (the example uses highlight color #39E508 on white text, fontSize: 80, whiteSpace: 'pre').

Polish: fitText() from @remotion/layout-utils auto-scales text to the video width; add enter/exit animation via animation; improve legibility with WebkitTextStroke: '4px black' plus paintOrder: 'stroke'.

Key Parameters

combineTokensWithinMilliseconds — the single most important knob; controls page-switch cadence (example value 1200).
Caption.text whitespace — leading space per word; render with white-space: pre.
parseSrt({input}) / serializeSrt({lines}) — string in, string out; lines is Caption[][].
Token times (fromMs/toMs) are absolute; sequence frames are relative — convert deliberately.

When To Use

Social-format videos (Reels/TikTok/Shorts) with burned-in word-by-word captions → full pipeline above.
Existing subtitle files → parseSrt() instead of transcribing.
Generating .srt deliverables from edited captions → serializeSrt().
Choosing a transcription backend: local server and free → Whisper.cpp; zero infrastructure in-browser → whisper-web; fastest setup with budget → OpenAI or ElevenLabs APIs.

Risks & Pitfalls

Omitting spaces before words makes createTikTokStyleCaptions() merge everything into one line/page — the most common formatting bug.
Forgetting white-space: pre collapses the spaces you carefully preserved.
Mixing absolute and relative time when highlighting tokens — add page.startMs to the sequence-relative time.
Fetching captions without delayRender() renders frames before data arrives; always pair with continueRender/cancelRender.
serializeSrt() adds no spaces between words — your Caption.text whitespace must already be correct.

Related Concepts

compositions fundamentals — <Sequence> timing that drives caption pages
audio — the audio being transcribed
animation — entrance/exit effects for caption pages
packages catalog — where @remotion/captions and the Whisper packages sit in the ecosystem

Sources

raw/github_doc-packages-docs-docs-captions-api-mdx.md / -captions-index-mdx.md — package overview
raw/github_doc-packages-docs-docs-captions-caption-mdx.md — Caption type
raw/github_doc-packages-docs-docs-captions-parse-srt-mdx.md / -captions-serialize-srt-mdx.md — SRT round-trip
raw/github_doc-packages-docs-docs-captions-importing-mdx.md — .srt import flow
raw/github_doc-packages-docs-docs-captions-transcribing-mdx.md — transcription option comparison
raw/github_doc-packages-docs-docs-captions-create-tiktok-style-captions-mdx.md — paging API
raw/github_doc-packages-docs-docs-captions-displaying-mdx.md — display patterns and full example
raw/github_doc-packages-docs-docs-captions-ensure-max-characters-per-line-m.md — internal API note