Demo for AGRN-VC

Abstract

Recently, there has been an increase in research on zero-shot voice conversion. Many conventional studies use dynamic layers to conduct conversion for unseen speakers. Our aim is to extend dynamic methods to content information as well. To achieve this, we propose AGRN-VC, which utilizes ConvNeXt v2 modules with adaptive global response normalization (AGRN) layers to convey content information. Also, we adopt auxiliary learning with cluster-based pseudo labels to prevent the transmission of source speaker information along with content information. This method performs a pseudo label classification task using the output of the content encoder. It helps the content encoder to focus on content information while excluding speaker information. We conduct comparative experiments between various baseline models and the proposed model using subjective and objective metrics. Our proposed approach demonstrates better converted speech quality in terms of speaker similarity and naturalness.

Unseen Scenario

F2F

	Source	Target	VQMIVC	TriAAN-VC	AGRN-VC
sample 1
sample 1	Text: I let go the telephone and ran.
sample 2
sample 2	Text: Anything is possible in football and we can beat Celtic again.
sample 3
sample 3	Text: It was horrible, but there is still more.

F2M

	Source	Target	VQMIVC	TriAAN-VC	AGRN-VC
sample 1
sample 1	Text: We must provide a long-term solution to tackle this attitude.
sample 2
sample 2	Text: People look, but no one ever finds it.
sample 3
sample 3	Text: He seems to be pleased with the picture.

M2F

	Source	Target	VQMIVC	TriAAN-VC	AGRN-VC
sample 1
sample 1	Text: By that time, however, it was already too late.
sample 2
sample 2	Text: That has given me great confidence.
sample 3
sample 3	Text: It would still have been a good film, but very different.

M2M

	Source	Target	VQMIVC	TriAAN-VC	AGRN-VC
sample 1
sample 1	Text: That's what it's all about, isn't it?
sample 2
sample 2	Text: It cannot go on for ever like this.
sample 3
sample 3	Text: That review will look at management across the sector as a whole.