SwanVoice: Expressive Long Form Zero Shot Speech Synthesis for Both Monologue and Dialogue
Justy and Cody dig into SwanVoice, a zero-shot text-to-speech paper aimed at long monologues and multi-speaker dialogue. They focus on the real bottleneck the paper targets: keeping a whole conversation acoustically and emotionally coherent instead of generating each turn separately and stitching it together. Cody breaks down the pipeline, data construction, VAE compression, flow-matching DiT, speaker-turn conditioning, and the training curriculum. Justy keeps pulling it back to production reality for podcasts, dramas, and multi-voice tools, while both note the paper’s strongest caveat: content accuracy still looks like the main weak spot.
Script: GPT-5.4 Voice: Inworld TTS 2
Transcript
Justy The funny part is, the boring workaround still kind of runs the industry. Generate each voice turn by turn, glue it together, hope nobody notices.
Cody Yeah. And for long dialogue, people absolutely notice. The room changes, the pauses feel fake, somebody suddenly sounds like they moved three feet closer to a different mic.
Justy Which is why this paper got me. SwanVoice is basically saying, stop treating a conversation like a zip file full of separate clips. Make the whole scene together.
Cody Right.
Cody And that matters because the stuck part here is not single-speaker zero-shot anymore. That's improved a lot. The stubborn problem is expressive long-form dialogue where you need speaker switching, emotional continuity, and monologue quality all at once instead of trading one off for another.
Justy Also, before I forget, I am on terrible sleep. I got in late, my bag is still half-unpacked, and your coffee situation is way too effective.
Cody I warned you. DC coffee is just anxiety with better branding. I also spent this morning fighting a printer, which weirdly set me up for this paper because both experiences were about systems that almost work.
Justy That is such an episode four hundred fifty-one thing for us to say out loud. Anyway, this one solves a real product pain. If you're making podcasts, audio stories, character tools, even dubbing-ish workflows, stitching turns is expensive and it sounds assembled.
Cody Mm-hm.
Cody The paper is pretty smart about admitting the model alone won't save you. They build SwanData-Speech first, which is a pipeline for turning messy in-the-wild audio into monologue and dialogue training sets. Podcasts, dramas, film and TV style material, long recordings, variable speakers, variable acoustics.
Justy And they go big on data prep, right? Not just scrape audio and pray.
Cody Exactly.
Cody They do vocal separation, then diarization with 3D-Speaker, then merge same-speaker chunks with silence rules, cap monologue segments at sixty seconds, dialogue segments up to one hundred twenty seconds, and require two to four speakers. The pause handling matters too. They built Swan Forced Aligner for word-level timestamps so the text includes pause-aware symbols instead of clean written punctuation that teaches the wrong rhythm.
Justy That's one of those details product people underestimate until a demo sounds weird. Written punctuation is not spoken timing. A transcript that looks nice can still produce speech that feels socially off.
Cody Yeah.
Cody And they keep raw text as the main conditioning signal, which preserves semantics, but they patch the ugly edge cases. For Chinese pronunciation and mixed-language weirdness, they add pinyin substitution variants and this synthetic set called RobustMegaTTS3 for hard pronunciations, polyphonic characters, code-switching, irregular spellings, that whole mess.
Justy I kind of love that they didn't pretend pronunciation corner cases are beneath the glamour of the architecture. Because in production, the glamorous model dies on the weird proper noun.
Cody Yes. The architecture is solid too, though. They use a twenty-five hertz VAE to compress speech so the sequence is shorter but still reconstructs well enough for long-form generation. Then a flow-matching DiT does the generation, and it's conditioned on speaker-turn IDs so it knows not just who is speaking, but where the conversational handoff is happening across the whole span.
Justy So in plain English, they shrink the audio representation into something manageable, then generate the whole conversation with awareness of turn structure instead of sampling one line at a time.
Cody Right, right.
Cody And the training schedule matters. They start from monologue speech, then move to mixed and real dialogue data, then do post-training with DiffusionNFT. The rewards they mention are phone-level accuracy and speaker similarity. I think that's their attempt to avoid the classic failure where dialogue fine-tuning improves turn control but trashes monologue quality.
Justy This is where I stop rolling my eyes at research claims a little. It feels less like a lab-only trick and more like somebody actually thought about deployment. Not fully shippable from a paper, obviously, but shippable-shaped.
Cody I agree, mostly. My real caution is the one they state themselves. Content accuracy is still the main limitation. And in speech, if the words drift, skip, or repeat, all the expressive coherence in the world does NOT save you.
Justy Sure. Because the user hears the mistake once and the magic is gone. But if they really are beating open-source baselines on richness and hierarchy in both monologue and dialogue, that's a meaningful step. Especially for teams trying to get beyond flat narrator voice plus awkward character swaps.
Cody And I do like that they chose non-autoregressive generation here. For long dialogue, autoregressive setups invite latency and exposure-bias problems. This approach gets to condition on the whole text and turn sequence at once, which is just more aligned with the task.
Justy Okay, tiny detour. The phrase exposure bias still sounds like a fake condition invented by tired engineers at one in the morning.
Cody It does. It sounds like I need less screen time and more vegetables. But, yes, back to the paper, if I were pushing on this next, I'd want more evidence around content fidelity over really long spans and with very similar voices. That's where dialogue systems love to get smug in the demo and then wobble.
Justy No, that's fair. And for builders, I think the immediate read is: if you're doing audio scenes, synthetic hosts, character conversations, maybe multilingual-ish products, this paper looks closer to a systems recipe than a single model trick. Data curation, pause-aware transcripts, turn labels, then the generator.
Cody Oh interesting.
Cody Also, they do have audio demos at swanaigc dot github dot io slash swanvoice. I didn't see a linked code release in the paper excerpt we have, so I wouldn't call this Build Next material yet. More like listen-next.
Justy That's probably the right note. Go hear the demos, keep one eyebrow up for word accuracy, and maybe don't insult your printer before coffee tomorrow, Cody.