• Home

On this page

  • Abstract
  • Models
  • Speech enhancement audio examples

Modeling strategies for speech enhancement in the latent space of a neural audio codec

Authors
Affiliations

Sofiene Kammoun

CentraleSupélec, IETR (UMR CNRS 6164)

Xavier Alameda-Pineda

Inria at Univ. Grenoble Alpes, CNRS, LJK

Simon Leglaive

CentraleSupélec, IETR (UMR CNRS 6164)

Published

September 18, 2025

Abstract

Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction.

Keywords

Speech enhancement, neural audio codec, autoregressive modeling, latent representations, discrete tokens.

🚧 Code will be made available upon publication

📃 Read Paper on arXiv

Models

This figure illustrates the four speech enhancement variants explored in our study, organized along two main axes: representation type (discrete tokens vs. continuous vectors) and modeling strategy (autoregressive vs. non-autoregressive). The top row shows models operating in the discrete latent space of the neural audio codec: D-AR (autoregressive) predicts clean speech tokens sequentially, while D-NAR (non-autoregressive) predicts them in parallel. The bottom row shows their continuous-space counterparts. For detailed descriptions of each model and additional variants, please refer to the paper.

Speech enhancement audio examples