TDE-VC Demo

TDE-VC: Timbre Disentanglement and Extraction Via Consistency for Zero-Shot Voice Conversion

Abstract

In this work, we focus on timbre conversion, a key type of VC. Current VC methods face two challenges: retaining source speaker information in the extracted content and inadequately capturing timbre features, often leading to suboptimal speaker similarity in the converted speech. To address these issues, we propose the TDE-VC model, a zero-shot voice conversion framework that incorporates a phased-trained content extractor, combining the strengths of adversarial speaker classifier and data perturbation to extract cleaner content. Critically, we introduce a timbre disentanglement and extraction strategy, based on a multi-level consistency constraint, which effectively disentangles timbre from content and guides the timbre encoder to focus solely on timbre extraction. Additionally, we present an effective multi-scale timbre encoder. Experimental results demonstrate that TDE-VC significantly improves speaker similarity, especially for unseen target speakers, while maintaining competitive naturalness compared to existing methods.

Timbre disengagement and extraction strategy

SR Operation

The reconstructed speech after the SR operation with different segment length settings on the mel-spectrogram is shown below.

Raw wavform 25 frames (we adopt) Others lengths
raw wav length: 25 length: 5 length: 10 length: 20 length: 30

Further experiments demonstrated that using a segment length of 25 frames achieved the highest speaker similarity under this strategy. The corresponding experimental results are presented below:

Voice Conversion

Source Target Conversion
p227_231 p323_023 YourTTS FreeVC DDDMVC TDE-VC (ours)
p225_322 p243_003 YourTTS FreeVC DDDMVC TDE-VC (ours)

Source Target Conversion
p238_315 Lirbi7729M YourTTS FreeVC DDDMVC TDE-VC (ours)
p230_248 F_022 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)
p362_003 M_009 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)
p227_231 F_001 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)
p228_132 F_018 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)

Source Target Conversion
1995-1826-0019 Libri5412W YourTTS FreeVC DDDMVC TDE-VC (ours)
1284-1181-0009 Libri237F YourTTS FreeVC DDDMVC TDE-VC (ours)
6930-75918-0009 M_011 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)
1580-141083-0002 F_022 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)
1580-141083-0002 Libri2830M YourTTS FreeVC DDDMVC TDE-VC (ours)
2961-960-0013 M_009 (real-world) YourTTS FreeVC DDDMVC TDE-VC (ours)

Overview of the TDE-VC model

Acknowledgements

The proposed TDE-VC model adopts the VITS architecture [1] for its excellent reconstruction capability and is inspired by FreeVC [2] for its robust VC performance. It also incorporates the vocoder from MS-iSTFT-VITS [3] for faster training and inference. Our proposed timbre disentanglement and extraction strategy is inspired by Disentangling SV [4].
[1] https://github.com/jaywalnut310/vits
[2] https://github.com/OlaWod/FreeVC
[3] https://github.com/MasayaKawamura/MB-iSTFT-VITS
[4] https://proceedings.neurips.cc/paper_files/paper/2023/Paper-Conference.pdf