---
title: "Captions: Transcribing, Parsing & Display"
type: concept
tags: [captions, subtitles, srt, whisper, tiktok-style]
created: 2026-06-11
updated: 2026-06-11
confidence: high
sources: ["raw/github_doc-packages-docs-docs-captions-api-mdx.md", "raw/github_doc-packages-docs-docs-captions-caption-mdx.md", "raw/github_doc-packages-docs-docs-captions-parse-srt-mdx.md", "raw/github_doc-packages-docs-docs-captions-serialize-srt-mdx.md", "raw/github_doc-packages-docs-docs-captions-create-tiktok-style-captions-mdx.md", "raw/github_doc-packages-docs-docs-captions-importing-mdx.md", "raw/github_doc-packages-docs-docs-captions-transcribing-mdx.md", "raw/github_doc-packages-docs-docs-captions-displaying-mdx.md", "raw/github_doc-packages-docs-docs-captions-ensure-max-characters-per-line-m.md", "raw/github_doc-packages-docs-docs-captions-index-mdx.md"]
---

# Captions: Transcribing, Parsing & Display

## Definition

`@remotion/captions` (v4.0.216+, MIT-licensed — install with the package name `@remotion/captions`) standardizes subtitles around one data structure, the **`Caption`** type, so that transcribing, parsing, formatting, and exporting interoperate. The typical pipeline: transcribe audio (or parse an `.srt`) into `Caption[]`, group words into TikTok-style "pages" with `createTikTokStyleCaptions()`, then render each page in a `<Sequence>` with per-word highlighting.

## How It Works

### The `Caption` type

```tsx
import type {Caption} from '@remotion/captions';
```

Fields: `text` (the caption text), `startMs` / `endMs` (start and end in milliseconds), `timestampMs` (a singular timestamp — the `t_dtw` value when using Whisper.cpp, otherwise possibly the start/end average or undefined), and `confidence` (0–1 transcription confidence).

**Whitespace matters**: include a space in `text` before each word — spaces are the delimiters. When rendering, set `white-space: pre` on the caption container so they survive.

### Getting captions: transcription options

| | `@remotion/install-whisper-cpp` | `@remotion/whisper-web` | `@remotion/openai-whisper` | `@remotion/elevenlabs` |
|---|---|---|---|---|
| Environment | Server (Node.js) | Browser (WASM) | Cloud API | Cloud API |
| Speed | Fast (hardware-dependent) | Slow (WASM overhead) | Fast | Fast |
| Cost | Free | Free | Paid | Paid |
| Offline | Yes | Yes | No | No |
| Convert with | `toCaptions()` | `toCaptions()` | `openaiWhisperApiToCaptions()` | `elevenLabsTranscriptToCaptions()` |

You can also use your own caption format — the `Caption` type is recommended, not required (it also matches the paid Remotion Editor Starter and Animated Captions products).

### Importing and exporting SRT

`parseSrt({input})` takes the contents of a SubRip file as a string and returns `{captions}` — one `Caption` per cue with `confidence: 1` and `timestampMs` as the cue midpoint. Load a file from `public/` with `fetch(staticFile('subtitles.srt'))` inside a `useDelayRender()` flow: `continueRender(handle)` on success, `cancelRender(e)` on failure.

The inverse, `serializeSrt({lines})`, takes a **two-dimensional** array — each top-level item is one SRT line, second-level items are words concatenated without added spaces; the line's timestamps come from the first word's `startMs` and last word's `endMs`. Empty arrays are skipped. (`ensureMaxCharactersPerLine()` exists but is internal/undocumented for now.)

### TikTok-style pages

```tsx
const {pages} = createTikTokStyleCaptions({
  captions,
  combineTokensWithinMilliseconds: 1200,
});
```

Words closer together than `combineTokensWithinMilliseconds` merge into one page — a high value fits many words per page, a low value approaches word-by-word display. Each `TikTokPage` has `text`, `startMs`, `durationMs` (v4.0.261+), and `tokens: [{text, fromMs, toMs}]` with **absolute** millisecond times. Safe in browser, Node.js, and Bun.

### Displaying captions

1. **Fetch** `captions.json` (or parse SRT) under `useDelayRender()`.
2. **Page** them with `createTikTokStyleCaptions()` inside `useMemo`.
3. **Sequence** each page: `startFrame = (page.startMs / 1000) * fps`; end at the next page's start or `startFrame + (SWITCH_CAPTIONS_EVERY_MS / 1000) * fps`, whichever is sooner; skip pages with non-positive duration.
4. **Highlight the active word** inside the page component: convert the sequence-relative frame back to absolute time (`absoluteTimeMs = page.startMs + (frame / fps) * 1000`) and mark tokens where `token.fromMs <= absoluteTimeMs && token.toMs > absoluteTimeMs` (the example uses highlight color `#39E508` on white text, `fontSize: 80`, `whiteSpace: 'pre'`).

Polish: `fitText()` from `@remotion/layout-utils` auto-scales text to the video width; add enter/exit animation via [[concepts/animation]]; improve legibility with `WebkitTextStroke: '4px black'` plus `paintOrder: 'stroke'`.

## Key Parameters

- `combineTokensWithinMilliseconds` — the single most important knob; controls page-switch cadence (example value `1200`).
- `Caption.text` whitespace — leading space per word; render with `white-space: pre`.
- `parseSrt({input})` / `serializeSrt({lines})` — string in, string out; `lines` is `Caption[][]`.
- Token times (`fromMs`/`toMs`) are absolute; sequence frames are relative — convert deliberately.

## When To Use

- Social-format videos (Reels/TikTok/Shorts) with burned-in word-by-word captions → full pipeline above.
- Existing subtitle files → `parseSrt()` instead of transcribing.
- Generating `.srt` deliverables from edited captions → `serializeSrt()`.
- Choosing a transcription backend: local server and free → Whisper.cpp; zero infrastructure in-browser → whisper-web; fastest setup with budget → OpenAI or ElevenLabs APIs.

## Risks & Pitfalls

- **Omitting spaces before words** makes `createTikTokStyleCaptions()` merge everything into one line/page — the most common formatting bug.
- **Forgetting `white-space: pre`** collapses the spaces you carefully preserved.
- **Mixing absolute and relative time** when highlighting tokens — add `page.startMs` to the sequence-relative time.
- **Fetching captions without `delayRender()`** renders frames before data arrives; always pair with `continueRender`/`cancelRender`.
- **`serializeSrt()` adds no spaces between words** — your `Caption.text` whitespace must already be correct.

## Related Concepts

- [[concepts/compositions-fundamentals]] — `<Sequence>` timing that drives caption pages
- [[concepts/audio]] — the audio being transcribed
- [[concepts/animation]] — entrance/exit effects for caption pages
- [[entities/packages-catalog]] — where `@remotion/captions` and the Whisper packages sit in the ecosystem

## Sources

- raw/github_doc-packages-docs-docs-captions-api-mdx.md / -captions-index-mdx.md — package overview
- raw/github_doc-packages-docs-docs-captions-caption-mdx.md — `Caption` type
- raw/github_doc-packages-docs-docs-captions-parse-srt-mdx.md / -captions-serialize-srt-mdx.md — SRT round-trip
- raw/github_doc-packages-docs-docs-captions-importing-mdx.md — `.srt` import flow
- raw/github_doc-packages-docs-docs-captions-transcribing-mdx.md — transcription option comparison
- raw/github_doc-packages-docs-docs-captions-create-tiktok-style-captions-mdx.md — paging API
- raw/github_doc-packages-docs-docs-captions-displaying-mdx.md — display patterns and full example
- raw/github_doc-packages-docs-docs-captions-ensure-max-characters-per-line-m.md — internal API note