You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations
summary: One-shot voice conversion (VC) is a method that enables the transformation
between any two speakers using only a single target speaker utterance. Existing
methods often rely on complex architectures and pre-trained speaker
verification (SV) models to improve the fidelity of converted speech. Recent
works utilizing K-means quantization (KQ) with self-supervised learning (SSL)
features have proven capable of capturing content information from speech.
However, they often struggle to preserve speaking variation, such as prosodic
detail and phonetic variation, particularly with smaller codebooks. In this
work, we propose a simple yet effective one-shot VC model that utilizes the
characteristics of SSL features and speech attributes. Our approach addresses
the issue of losing speaking variation, enabling high-fidelity voice conversion
trained with only reconstruction losses, without requiring external speaker
embeddings. We demonstrate the performance of our model across 6 evaluation
metrics, with results highlighting the benefits of the speaking variation
compensation method.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations
summary: One-shot voice conversion (VC) is a method that enables the transformation
between any two speakers using only a single target speaker utterance. Existing
methods often rely on complex architectures and pre-trained speaker
verification (SV) models to improve the fidelity of converted speech. Recent
works utilizing K-means quantization (KQ) with self-supervised learning (SSL)
features have proven capable of capturing content information from speech.
However, they often struggle to preserve speaking variation, such as prosodic
detail and phonetic variation, particularly with smaller codebooks. In this
work, we propose a simple yet effective one-shot VC model that utilizes the
characteristics of SSL features and speech attributes. Our approach addresses
the issue of losing speaking variation, enabling high-fidelity voice conversion
trained with only reconstruction losses, without requiring external speaker
embeddings. We demonstrate the performance of our model across 6 evaluation
metrics, with results highlighting the benefits of the speaking variation
compensation method.
id: http://arxiv.org/abs/2411.16147v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.
The text was updated successfully, but these errors were encountered: