In this work, we focus on timbre conversion, a key type of VC. Current VC methods face two challenges: retaining source speaker information in the extracted content and inadequately capturing timbre features, often leading to suboptimal speaker similarity in the converted speech. To address these issues, we propose the TDE-VC model, a zero-shot voice conversion framework that incorporates a phased-trained content extractor, combining the strengths of adversarial speaker classifier and data perturbation to extract cleaner content. Critically, we introduce a timbre disentanglement and extraction strategy, based on a multi-level consistency constraint, which effectively disentangles timbre from content and guides the timbre encoder to focus solely on timbre extraction. Additionally, we present an effective multi-scale timbre encoder. Experimental results demonstrate that TDE-VC significantly improves speaker similarity, especially for unseen target speakers, while maintaining competitive naturalness compared to existing methods.
The reconstructed speech after the SR operation with different segment length settings on the mel-spectrogram is shown below.
Raw wavform | 25 frames (we adopt) | Others lengths | |
---|---|---|---|
raw wav | length: 25 | length: 5 length: 10 | length: 20 length: 30 |
Further experiments demonstrated that using a segment length of 25 frames achieved the highest speaker similarity under this strategy. The corresponding experimental results are presented below:
Source | Target | Conversion | |
---|---|---|---|
p227_231 | p323_023 | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
p225_322 | p243_003 | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
Source | Target | Conversion | |
---|---|---|---|
p238_315 | Lirbi7729M | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
p230_248 | F_022 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
p362_003 | M_009 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
p227_231 | F_001 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
p228_132 | F_018 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
Source | Target | Conversion | |
---|---|---|---|
1995-1826-0019 | Libri5412W | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
1284-1181-0009 | Libri237F | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
6930-75918-0009 | M_011 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
1580-141083-0002 | F_022 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
1580-141083-0002 | Libri2830M | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
2961-960-0013 | M_009 (real-world) | YourTTS FreeVC | DDDMVC TDE-VC (ours) |
The proposed TDE-VC model adopts the VITS architecture [1] for its excellent reconstruction capability and is inspired by FreeVC [2] for its robust VC performance. It also incorporates the vocoder from MS-iSTFT-VITS [3] for faster training and inference. Our proposed timbre disentanglement and extraction strategy is inspired by Disentangling SV [4].
[1] https://github.com/jaywalnut310/vits
[2] https://github.com/OlaWod/FreeVC
[3] https://github.com/MasayaKawamura/MB-iSTFT-VITS
[4] https://proceedings.neurips.cc/paper_files/paper/2023/Paper-Conference.pdf