Skip to the content.

Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information of the low-resolution audio via the reverse sampling process of DMs. The proposed method can be a drop-in replacement for the vanilla sampling process and can significantly improve the performance of the existing works. Moreover, by coupling the proposed sampling method with an unconditional DM, i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups. We also attain state-of-the-art results on the VCTK Multi-Speaker benchmark with this novel formulation.

Animation of the Bandwidth Extension Process (200 steps, 12k to 48k)

We recommend reader listen to the samples when reading the paper. All utterances were randomly picked from the VCTK test set.

Samples: 24k to 48k, 50 steps

We recommend using headphones for this section.

  p360_001 p361_002
 
Input
 
Target
 
NU-Wave
 
NU-Wave+
 
NU-Wave 2
 
NU-Wave 2+
 
WSRGlow
 
UDM+

Samples: 16k to 48k, 50 steps

We recommend using headphones for this section.

  p363_004 p364_005
 
Input
 
Target
 
NU-Wave
 
NU-Wave+
 
NU-Wave 2
 
NU-Wave 2+
 
WSRGlow
 
UDM+

Samples: 8k to 16k, 50 steps

  p374_012 p376_233
 
Input
 
Target
 
NU-Wave 2
 
NU-Wave 2+
 
NU-Wave 2+ w/o MCG
 
NVSR
 
UDM+
 
UDM+ w/o MCG

Unconditional Samples

In this section, the samples were generated unconditionally, using our noise predictor with 100 steps.

Ancestral Sampling

DDIM