End-to-end One-shot Voice Conversion with Perturbation based Speech Disentanglement

Qicong Xie, Shan Yang, Yi Lei, Lei Xie, Dan Su Northwestern Polytechnical University, Xi'an, China Tencent AI Lab, China

0. Contents

Abstract
Demos -- Voice Conversion
Demos -- Pitch Control

1. Abstract

The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of quality, naturalness, and speaker similarity.

2. Demos -- Voice Conversion

Scenario	Target Speaker	Source speech	Method
Scenario	Target Speaker	Source speech	AGAIN-VC	NVC-Net	Proposed
seen2seen	p314 (F)	p225 (F)
	p334 (M)	p260 (M)
	p334 (M)	p245 (F)
	p314 (F)	p260 (M)
unseen2seen	p314 (F)	p351 (F)
	p334 (M)	p351 (F)
	p314 (F)	p363 (M)
	p334 (M)	p363 (M)
seen2unseen	p351 (F)	p239 (F)
	p347 (M)	p239 (F)
	p351 (F)	p260 (M)
	p347 (M)	p260 (M)
unseen2unseen	p351 (F)	p262 (F)
	p347 (M)	p262 (F)
	p351 (F)	p363 (M)
	p347 (M)	p363 (M)

2. Demos -- Pitch Control

Target Speaker	Source speech	Method
Target Speaker	Source speech	decrease	not adjust	increase