UniTok-Audio: A Unified Audio Generation Framework Via Universal Discrete Token

1Intelligent Connectivity, Alibaba Group, 2Tongyi Lab, Alibaba Group 3Zhejiang University
†Equal Contribution: Haoyin Yan, Chengwei Liu, *Corresponding author: Shaofei Xue

Abstract

Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. To address these issues, we propose UniTok-Audio, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target auido in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in an unified framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase.

UniTok-Audio

UniTok-Audio Architecture

Fig. 1: Overall architecture of UniTok-Audio, where the H-Codec Encoder is only used to generate label tokens during training and excluded during inference.

UniTok-Audio is a unified discrete-token-based audio generation framework that consists of four parts: (1) a WavLM or HuBERT audio encoder, (2) a CLAP text encoder, (3) a LLaMA-based language model, and (4) a novel H-Codec decoder. The WavLM and HuBERT encoders extract continuous speech features from audio. The LLaMA-based LM takes these features as input and predicts discrete speech tokens generated by H-Codec in an autoregressive manner. Finally, the H-Codec decoder reconstructs enhanced speech from predicted tokens. Guided by task identifiers and timestep embeddings, the LM processes latent sequences using cross-attention to generate task-specific outputs.

H-Codec

H-Codec Architecture

Fig. 2: Overall architecture of H-Codec.

H-Codec features two encoding streams: a self-supervised learning (SSL) stream and a waveform stream. The SSL stream captures semantic-rich information and injects it into the first-layer codec tokens via direct encoding from HuBERT/WavLM features. The waveform stream uses a proven DAC-like framework to encode and decode high-quality audio. Both streams are downsampled to achieve a low frame rate of 25 Hz. During training, both streams are used to obtain target tokens. During inference, only the decoder is active to generate high-fidelity audio.

H-Codec Result

H-Codec Architecture

Fig. 3: Speech Reconstruction and Semantic Performance.

H-Codec achieves the best performance at a token rate of 50 for most metrics. Moreover, its UTMOS score closely matches that of the ground truth, indicating that the reconstructed audio faithfully preserves the original speech quality. We also observe that certain models exceed the ground truth in UTMOS when operating at low token rates. We suspect this occurs because, under limited token constraints, the decoder behaves partly as a generative model—yielding plausible speech output but the alignment with the input was less precise..

Speech Restoration (SR)

Clean Noisy Enhanced

Target Speech Extraction (TSE)

Target Mixture Enhanced

Speech Separation (SS)

Mixture Speaker 1 Speaker 2

Voice Conversion (VC)

Source Reference Converted

Language-Queried Audio Source Separation (LASS)

Text Query Source Converted
the crowd is cheering and giving applause
someone is beating the drum continuously
Someone is typing on a keyboard
a person is pressing the shutter button of the camera to check something