End-to-end One-shot Voice Conversion with Perturbation based Speech Disentanglement

Qicong Xie, Shan Yang, Yi Lei, Lei Xie, Dan Su
Northwestern Polytechnical University, Xi'an, China
Tencent AI Lab, China

0. Contents

  1. Abstract
  2. Demos -- Voice Conversion
  3. Demos -- Pitch Control


1. Abstract

The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of quality, naturalness, and speaker similarity.



2. Demos -- Voice Conversion

Scenario Target Speaker Source speech Method
AGAIN-VC NVC-Net Proposed
seen2seen p314 (F) p225 (F)
p334 (M) p260 (M)
p334 (M) p245 (F)
p314 (F) p260 (M)
unseen2seen p314 (F) p351 (F)
p334 (M) p351 (F)
p314 (F) p363 (M)
p334 (M) p363 (M)
seen2unseen p351 (F) p239 (F)
p347 (M) p239 (F)
p351 (F) p260 (M)
p347 (M) p260 (M)
unseen2unseen p351 (F) p262 (F)
p347 (M) p262 (F)
p351 (F) p363 (M)
p347 (M) p363 (M)

2. Demos -- Pitch Control

Target Speaker Source speech Method
decrease not adjust increase