Exploring Next

Exploring Next — Ep 451 w/ Justy & Cody — SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Justy and Cody dig into SwanVoice, a zero-shot text-to-speech paper aimed at long monologues and multi-speaker dialogue. They focus on the real bottleneck the paper targets: keeping a whole conversation acoustically and emotionally coherent instead of generating each turn separately and stitching it together. Cody breaks down the pipeline, data construction, VAE compression, flow-matching DiT, speaker-turn conditioning, and the training curriculum. Justy keeps pulling it back to production reality for podcasts, dramas, and multi-voice tools, while both note the paper’s strongest caveat: content accuracy still looks like the main weak spot.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →