Fig. 1: Overall architecture of UniTok-Audio, where the H-Codec Encoder is only used to generate label tokens during training and excluded during inference.
UniTok-Audio is a unified discrete-token-based audio generation framework that consists of four parts: (1) a WavLM or HuBERT audio encoder, (2) a CLAP text encoder, (3) a LLaMA-based language model, and (4) a novel H-Codec decoder. The WavLM and HuBERT encoders extract continuous speech features from audio. The LLaMA-based LM takes these features as input and predicts discrete speech tokens generated by H-Codec in an autoregressive manner. Finally, the H-Codec decoder reconstructs enhanced speech from predicted tokens. Guided by task identifiers and timestep embeddings, the LM processes latent sequences using cross-attention to generate task-specific outputs.
Fig. 2: Overall architecture of H-Codec.
H-Codec features two encoding streams: a self-supervised learning (SSL) stream and a waveform stream. The SSL stream captures semantic-rich information and injects it into the first-layer codec tokens via direct encoding from HuBERT/WavLM features. The waveform stream uses a proven DAC-like framework to encode and decode high-quality audio. Both streams are downsampled to achieve a low frame rate of 25 Hz. During training, both streams are used to obtain target tokens. During inference, only the decoder is active to generate high-fidelity audio.
Fig. 3: Speech Reconstruction and Semantic Performance.
H-Codec achieves the best performance at a token rate of 50 for most metrics. Moreover, its UTMOS score closely matches that of the ground truth, indicating that the reconstructed audio faithfully preserves the original speech quality. We also observe that certain models exceed the ground truth in UTMOS when operating at low token rates. We suspect this occurs because, under limited token constraints, the decoder behaves partly as a generative model—yielding plausible speech output but the alignment with the input was less precise..
| Clean | Noisy | Enhanced |
|---|---|---|
| Target | Mixture | Enhanced |
|---|---|---|
| Mixture | Speaker 1 | Speaker 2 |
|---|---|---|
| Source | Reference | Converted |
|---|---|---|
| Text Query | Source | Converted |
|---|---|---|
| the crowd is cheering and giving applause | ||
| someone is beating the drum continuously | ||
| Someone is typing on a keyboard | ||
| a person is pressing the shutter button of the camera to check something |