TDE-VC Demo Page

Abstract

In this work, we focus on timbre conversion, a key type of VC. Current VC methods face two challenges: retaining source speaker information in the extracted content and inadequately capturing timbre features, often leading to suboptimal speaker similarity in the converted speech. To address these issues, we propose the TDE-VC model, a zero-shot voice conversion framework that incorporates a phased-trained content extractor, combining the strengths of adversarial speaker classifier and data perturbation to extract cleaner content. Critically, we introduce a timbre disentanglement and extraction strategy, based on a multi-level consistency constraint, which effectively disentangles timbre from content and guides the timbre encoder to focus solely on timbre extraction. Additionally, we present an effective multi-scale timbre encoder. Experimental results demonstrate that TDE-VC significantly improves speaker similarity, especially for unseen target speakers, while maintaining competitive naturalness compared to existing methods.

Timbre disengagement and extraction strategy

The reconstructed speech after the SR operation with different segment length settings on the mel-spectrogram is shown below.

Raw wavform	25 frames (we adopt)	Others lengths
raw wav	length: 25	length: 5 length: 10	length: 20 length: 30

Further experiments demonstrated that using a segment length of 25 frames achieved the highest speaker similarity under this strategy. The corresponding experimental results are presented below:

Voice Conversion

Seen-to-Seen (The differences are minimal, so only two demos are provided.)

Source	Target	Conversion
p227_231	p323_023	YourTTS FreeVC	DDDMVC TDE-VC (ours)
p225_322	p243_003	YourTTS FreeVC	DDDMVC TDE-VC (ours)

Seen-to-Unseen (✨Zero-shot VC)

Source	Target	Conversion
p238_315	Lirbi7729M	YourTTS FreeVC	DDDMVC TDE-VC (ours)
p230_248	F_022 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)
p362_003	M_009 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)
p227_231	F_001 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)
p228_132	F_018 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)

Unseen-to-Unseen (✨Zero-shot VC)

Source	Target	Conversion
1995-1826-0019	Libri5412W	YourTTS FreeVC	DDDMVC TDE-VC (ours)
1284-1181-0009	Libri237F	YourTTS FreeVC	DDDMVC TDE-VC (ours)
6930-75918-0009	M_011 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)
1580-141083-0002	F_022 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)
1580-141083-0002	Libri2830M	YourTTS FreeVC	DDDMVC TDE-VC (ours)
2961-960-0013	M_009 (real-world)	YourTTS FreeVC	DDDMVC TDE-VC (ours)