Token-based Audio Inpainting via Discrete Diffusion

We introduce our method: Audio Inpainting using Discrete Diffusion Model (AIDD), a novel approach for audio inpainting that applies discrete diffusion modeling over tokenized representations of audio. Using a pretrained WavTokenizer and a discrete diffusion model, AIDD reconstructs missing audio segments with high perceptual fidelity. Our method operates in token space to capture semantic structures more effectively and demonstrates superior results on long audio gaps compared to existing methods.

Model Overview

Architecture of AIDD

MusicNet Inpainting Results

Use headphones for best experience.

For each segment, we introduced four synthetic gaps at fixed, evenly spaced locations. The gap durations varied across experiments, ranging from 50ms to 300ms. These corrupted inputs were then processed using our model, which reconstructed the missing audio to produce complete, inpainted waveforms.

Gap: 50ms
Original Masked Generated
Gap: 100ms
Original Masked Generated
Gap: 200ms
Original Masked Generated
Gap: 300ms
Original Masked Generated

MTG Dataset Inpainting Results

Use headphones for best experience.

Gap: 500ms
Original Masked Generated