Publications

Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution

Published in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information of the low-resolution audio via the reverse sampling process of DMs. The proposed method can be a drop-in replacement for the vanilla sampling process and can significantly improve the performance of the existing works. Moreover, by coupling the proposed sampling method with an unconditional DM, i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups. We also attain state-of-the-art results on the VCTK Multi-Speaker benchmark with this novel formulation.

Recommended citation: Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, and Hao Tang, "Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution", IEEE International Conference on Acoustics, Speech and Signal Processing, June 2023. https://ieeexplore.ieee.org/abstract/document/10095103

Music Demixing Challenge 2021

Published in Frontiers in Signal Processing, 2022

Music source separation has been intensively studied in the last decade and tremendous progress with the advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models and corresponding papers, which can help researchers integrate the best practices into their models. In recent years, the widely used MUSDB18 dataset played an important role in measuring the performance of music source separation. While the dataset made a considerable contribution to the advancement of the field, it is also subject to several biases resulting from a focus on Western pop music and a limited number of mixing engineers being involved. To address these issues, we designed the Music Demixing Challenge on a crowd-based machine learning competition platform where the task is to separate stereo songs into four instrument stems (Vocals, Drums, Bass, Other). The main differences compared with the past challenges are 1) the competition is designed to more easily allow machine learning practitioners from other disciplines to participate, 2) evaluation is done on a hidden test set created by music professionals dedicated exclusively to the challenge to assure the transparency of the challenge, i.e., the test set is not accessible from anyone except the challenge organizers, and 3) the dataset provides a wider range of music genres and involved a greater number of mixing engineers. In this paper, we provide the details of the datasets, baselines, evaluation metrics, evaluation results, and technical challenges for future competitions.

Recommended citation: Y Mitsuji, G Fabbro, S Uhlich, F-R Stöter, A Défossez, M Kim, W Choi, C-Y Yu, K-W Cheuk, "Music Demixing Challenge 2021", Front. Sig. Proc. 1:808395, January 2022. https://doi.org/10.3389/frsip.2021.808395

Danna-Sep: Unite to separate them all

Published in 2021 Music Demixing Workshop, ISMIR, 2021

Deep learning-based music source separation has gained a lot of interest in the last decades. Most of the existing methods operate with either spectrograms or waveforms. Spectrogram based models learn suitable masks for separating magnitude spectrogram into different sources, and waveform-based models directly generate waveforms of individual sources. The two types of models have complementary strengths; the former is superior given harmonic sources such as vocals, while the latter demonstrates better results for percussion and bass instruments. In this work, we improved upon the state-of-the-art (SoTA) models and successfully combined the best of both worlds. The backbones of the proposed framework, dubbed Danna-Sep, are two spectrogram-based models including a modified X-UMX and U-Net, and an enhanced Demucs as the waveform-based model. Given an input of mixture, we linearly combined respective outputs from the three models to obtain the final result. We showed in the experiments that, despite its simplicity, Danna-Sep surpassed the SoTA models by a large margin in terms of Source-to-Distortion Ratio.

Recommended citation: Chin-Yun Yu and Kin-Wai Cheuk, "Danna-Sep: Unite to separate them all", The ISMIR 2021 Workshop on Music Source Separation, November 2021. https://arxiv.org/abs/2112.03752

Harmonic Preserving Neural Networks for Efficient and Robust Multipitch Estimation

Published in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020

Multi-pitch estimation (MPE) is a fundamental yet challenging task in audio processing. Recent MPE techniques based on deep learning have shown improved performance but are computation-hungry and relatively sensitive to the variation of data such as noise contamination, cross-domain data, etc. In this paper, we present the harmonic preserving neural network (HPNN), a model that incorporates deep learning and domain knowledge in signal processing to improve the efficiency and robustness of MPE. The proposed method starts from the multilayered cepstrum (MLC), a feature representation that utilizes repeated Fourier transform and nonlinear scaling to suppress the non-periodic components in signals. Following the combined frequency and periodicity (CFP) principle, the last two layers of the MLC are integrated to suppress the harmonics of pitches in the spectrum and enhance the components of true fundamental frequencies. A convolutional neural network (CNN) is then placed to further optimize the pitch activation. The whole system is constructed as an end-to-end learning scheme. Improved time efficiency and performance robustness to noise and cross-domain data are demonstrated with experiments on polyphonic music in various noise levels and multi-talker speech.

Recommended citation: Chin-Yun Yu, Jing-Hua Lin, and Li Su, "Harmonic Preserving Neural Networks for Efficient and Robust Multipitch Estimation", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, December 2020. https://ieeexplore.ieee.org/abstract/document/9306345

Multi-layered Cepstrum for Instantaneous Frequency Estimation

Published in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2018

We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise. Taking the operation of cepstrum (i.e., Fourier transform, filtering, and nonlinear activation) recursively, MLC is shown as an efficient method to enhance MF0 saliency in a step-by-step manner. Evaluation on a real-world polyphonic music dataset under both normal and low-fidelity conditions demonstrates the potential of MLC.

Recommended citation: Chin-Yun Yu and Li Su, "Multi-layered Cepstrum for Instantaneous Frequency Estimation", IEEE Global Conference on Signal and Information Processing, Novermber 2018. https://ieeexplore.ieee.org/document/8646684