Skip to the content.

Abstract


Recently, there has been an increase in research on zero-shot voice conversion. Many conventional studies use dynamic layers to conduct conversion for unseen speakers. Our aim is to extend dynamic methods to content information as well. To achieve this, we propose AGRN-VC, which utilizes ConvNeXt v2 modules with adaptive global response normalization (AGRN) layers to convey content information. Also, we adopt auxiliary learning with cluster-based pseudo labels to prevent the transmission of source speaker information along with content information. This method performs a pseudo label classification task using the output of the content encoder. It helps the content encoder to focus on content information while excluding speaker information. We conduct comparative experiments between various baseline models and the proposed model using subjective and objective metrics. Our proposed approach demonstrates better converted speech quality in terms of speaker similarity and naturalness.

Unseen Scenario

F2F


Source Target VQMIVC TriAAN-VC AGRN-VC
sample 1
Text: I let go the telephone and ran.
sample 2
Text: Anything is possible in football and we can beat Celtic again.
sample 3
Text: It was horrible, but there is still more.

F2M


Source Target VQMIVC TriAAN-VC AGRN-VC
sample 1
Text: We must provide a long-term solution to tackle this attitude.
sample 2
Text: People look, but no one ever finds it.
sample 3
Text: He seems to be pleased with the picture.

M2F


Source Target VQMIVC TriAAN-VC AGRN-VC
sample 1
Text: By that time, however, it was already too late.
sample 2
Text: That has given me great confidence.
sample 3
Text: It would still have been a good film, but very different.

M2M


Source Target VQMIVC TriAAN-VC AGRN-VC
sample 1
Text: That's what it's all about, isn't it?
sample 2
Text: It cannot go on for ever like this.
sample 3
Text: That review will look at management across the sector as a whole.