Recently, there has been an increase in research on zero-shot voice conversion. Many conventional studies use dynamic layers to conduct conversion for unseen speakers. Our aim is to extend dynamic methods to content information as well. To achieve this, we propose AGRN-VC, which utilizes ConvNeXt v2 modules with adaptive global response normalization (AGRN) layers to convey content information. Also, we adopt auxiliary learning with cluster-based pseudo labels to prevent the transmission of source speaker information along with content information. This method performs a pseudo label classification task using the output of the content encoder. It helps the content encoder to focus on content information while excluding speaker information. We conduct comparative experiments between various baseline models and the proposed model using subjective and objective metrics. Our proposed approach demonstrates better converted speech quality in terms of speaker similarity and naturalness.