๐ฌ Doublage

๐ Works in any language. The screenshots here show French dubs, but nothing is French-specific โ Doublage works with content in any language: use the video's existing subtitles (any language), or auto-generate them with Whisper (multilingual, ~100 languages). Voice/background separation, speaker diarization, recording and assembly are all language-independent.
Dub your favorite scenes with friends, straight from the browser. You (the host) pick a video, prepare the scene and host the game. Your friends open a link in their browser (zero install) and dub a character. Everything is then assembled into a final video.
Voices go through an external voice channel (Discord or anything else): the game does not transmit audio, it only synchronizes the recording.
YOU (host) RELAY (small server) YOUR FRIENDS
preprocess + assembly <-> rooms + WebSocket + storage <-> browser (0 install)
Python + GPU + ffmpeg (click the link)The host app runs locally. The relay can run on any always-on server (a Raspberry Pi is plenty, but any machine works): see "Hosting the relay" below.
Two game modes
- Live Take: a single, synchronized take. The host starts the recording, everyone plays at the same time (3-2-1 countdown), the host can pause and resume. A "one shot" vibe.
- Pro Take: everyone at their own pace. Each player records their lines one by one (with a 3-2-1 countdown before each line), several takes possible (configurable limit, or unlimited), picks their best take and tunes the timing. A banner shows who is done; once everyone is finished, the host wraps up and moves on to assembly.
For your friends (nothing to install)
- Open the invite link (e.g.
https://your-relay/?room=ABC12) or enter the code. - Pick a nickname, then a character in the lobby (the host can also assign roles at random).
- Headphones recommended, otherwise the scene's sound leaks back into the mic. On Bluetooth, calibrate your mic delay (popup at the start, ~15 s) so your voice lands on beat.
- Play: the video rolls and a rhythmo band (dubbing-studio style) scrolls your lines word by word; your text is highlighted when it's your turn.

For you (the host)
1. Install (once)
ffmpegon the PATH (an NVIDIA GPU is used automatically for encoding whenh264_nvencis available, otherwise it falls back to CPUlibx264).- Python 3.10+ then:
python -m venv .venv
.venv\Scripts\activate # Windows, or: source .venv/bin/activate
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124 # GPU recommended
pip install -r requirements.txt- Copy
.env.exampleto.envand fill in at least:HUGGINGFACE_TOKENfor pyannote diarization (first accept the terms of thepyannote/speaker-diarization-3.1model on huggingface.co).RELAY_URL: the address of the relay the host targets (defaults tohttp://127.0.0.1:21826, i.e. a relay launched locally).
2. Launch the host app
python -m dubbing.host
# http://127.0.0.1:8765
3. Prepare the scene (everything in the app)
Enter your nickname, then "New scene":
- Source: browse to a video (mp4/mkv), set the range to keep (timeline,
HH:MM:SSfields, "at cursor"), check the tracks (the VFF is pre-selected, never the VFQ nor forced/SDH tracks). Subtitles come from the embedded track by default; if the video has no usable one, tick "Generate subtitles automatically" (local Whisper large-v3 transcription). Then run preprocessing. - Preprocessing: voice/background separation (Demucs), diarization (pyannote) and word-level forced alignment (for the karaoke rhythmo band), assembled into a
project.json. - Editing: fix who speaks, the text, the timestamps (start/end per line), rename or merge the detected voices, split a line with two speakers, listen to excerpts. The word-level karaoke is re-aligned automatically when you edit a text.
- Game: choose the mode and the number of takes.
- Host: the scene is transcoded (H.264 720p, browser-playable) and sent to the relay. You get a code and a link to share.


Diarization tip: for 2-3 characters with similar voices, set "Num speakers" = the real count + 1, then merge the extra clusters in the editor. Setting this number enables embedding-based diarization (one embedding per line), far more robust than overlap for close voices.
From the command line, without the UI:
python -m dubbing.pipeline --input film.mkv --output output_bundles/out_film --start 37:06 --end 39:30 --num-speakers 3
python -m dubbing.host --bundle output_bundles/out_film
Useful pipeline options: --list-tracks (list the tracks and exit), --auto-subs (Whisper subtitles instead of the embedded track), --resume (resume by skipping stages already produced), --skip-separation / --skip-diarization / --skip-alignment, --num-speakers / --min-speakers / --max-speakers. python -m dubbing.pipeline -h lists everything.
4. Play then assemble
- Lobby, "Start the game". In Live you drive the recording; in Pro everyone records their lines. Each take is automatically compensated for its player's measured mic delay (calibration).
- "Finish" opens the assembly screen: each player's kept takes are pre-selected, you can re-choose, re-time, trim/split a take, set the background/voice levels and a per-character latency adjustment. Audio mix choice:
- separated (default): the Demucs background (music/SFX without voices) everywhere.
- OWS: the original audio everywhere, except during dubbed lines (where it switches to background + the take).
- Choose mp4/mkv (re-encode to H.264 ticked by default), then "Assemble" and download the final video.

Hosting the relay
The relay is standalone (relay/server.py, light dependencies, no ffmpeg and no ML). On the server:
pip install -r relay/requirements.txt
python relay/server.pyIt is configured via the .env at the root (see .env.example): HOST, PORT, STORAGE_DIR, ROOM_TTL_MIN, EMPTY_ROOM_GRACE_MIN, MAX_ROOM_MB, ALLOWED_ORIGINS, and PUBLIC_URL (used for the Open Graph link preview). The defaults are fine (port 21826, ephemeral rooms purged after inactivity and shortly after everyone has left).
For a production setup: a reverse proxy (Caddy, nginx) for TLS, and a systemd service to restart the relay automatically (doublage.service + update.bat templates to adapt to your server).
Structure
Path | Role |
|---|---|
| Preprocessing orchestration (CLI |
| ffmpeg extraction: muted video, WAV audio, SRT subtitles |
| ffprobe probing + track selection (prefers VFF, excludes VFQ/forced/SDH) |
| Subtitle parsing + splitting of multi-speaker cues |
| Auto subtitles (Whisper large-v3, local) for |
| Voice/background separation (Demucs) |
| pyannote diarization (overlap, or per-line embedding when |
| Assigning lines to characters, merging lines |
| Word-level forced alignment (torchaudio MMS_FA) for the rhythmo band |
| Bundle dataclasses + reading/writing |
| VRAM management + GPU-accelerated encoding (NVENC, CPU |
| Local preprocessing UI (file pick, crop, preview) |
| Local host app: editing, hosting, collection, final assembly |
| Lightweight distribution bundle (H.264 720p video, AAC audio) |
| Mix/mux helpers for the final export |
| Multi-player assembly (mix + mux, |
| Single web app (landing, lobby, game, rhythmo band, assembly) |
| Relay: code-based rooms, WebSocket, storage of bundles and takes |
| Dev tool: compare several diarization settings |
Notes
- Headphones on the players' side, to avoid the background audio leaking into the mic. On Bluetooth, calibrate the mic delay (wireless headsets add 150-300 ms).
- The distribution video is H.264 720p (the original HEVC does not play everywhere); the original files are used for the final assembly.
- ffmpeg encoding goes through the GPU (NVENC) when possible, with an automatic fallback to CPU (
libx264). - Language-agnostic: bring subtitles in any language, or auto-generate them with Whisper (multilingual). The screenshots show French scenes only as an example.
- "Between friends" security: a room = an invite code, no accounts. Whoever has the code gets in.