Audio Editing by Following Instructions with Latent Diffusion Models

Abstract. Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, AUDIT has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).


AUDIT consists of a VAE, a T5 text encoder, and a diffusion network, and accepts the mel-spectrogram of the input audio and the edit instructions as conditional inputs and generates the edited audio as output.

Samples for Different Audio Editing Tasks

Instruction Input Audio Output Audio
Add a car horn honks several times loudly
Drop the sound of a woman talking
Replace laughter to trumpet
Perform Super-resolution

More Samples for Adding

The adding task is to add another sound event to the input audio. For instance, transforming an input audio with the caption "A baby is crying" into an output audio with the semantic information "A baby is crying while thundering in the background." It is important to note that the adding task should not only ensure that the generated output audio contains both the semantic content of the input audio and the newly added semantic content, but also that the content of the input audio should remain as unchanged as possible in the output audio.

Text Input Audio SDEdit (N=1/2T) SDEdit (N=1/4T) Ours
Add clip-clop of horse hooves
Add a motorboat speeding in the background
Add the sound of knocking in the middle
Add a bell in the beginning
Add a short sound of hi-hat in the end

More Samples for Dropping

The dropping task aims to remove one or more sound events from the input audio. For example, removing the sound event "dog barking" from an input audio with the caption "A man giving a speech while a dog barking" to an output audio with the semantic description "A man giving a speech".

Text Input Audio SDEdit (N=1/2T) SDEdit (N=1/4T) Ours
Drop the sound of a duck quacking in water
Drop the sound of dishes and pots and pans in the middle
Drop: pouring water
Drop people cheering
Drop a short firework explosion in the end

More Samples for Replacement

The replacement task aims to substitute one sound event in an input audio with another sound event. For example, replacing the sound event "bell ringing" with "fireworks" in an audio with the caption "the sound of gun shooting and bell ringing" results an output audio with the semantic description "the sound of gun shooting and bell ringing and fireworks".

Text Input Audio SDEdit (N=1/2T) SDEdit (N=1/4T) Ours
Replace: wind instrument to drum kit
Replace dropping coin with the sound of something tearing
Replace the sound of squeak to the sound of clapping
Replace clink with fart
Replace a people yelling to insects buzzing

More Samples for Inpainting

The audio inpainting task is to complete a masked segment of an audio based on the context or provided textual description. SDEdit-Rough and SDEdit-Precise are two baseline methods, you can check more details in our paper.

Text Input Audio SDEdit-Rough SDEdit-Precise Ours
A toilet flushing.
A group of people are laughing
A person repidly types on a keyboard
A sudden horn
A baby cries followed by rustling and heavy breathing

More Samples for Super-Resolution

The audio super-resolution task can be viewed as completing the high-frequency information of a low-sampled input audio (converting the low-sampled input audio into a high-sampled output audio).

Text Input Audio SDEdit SDEdit-Precise Ours
A baby cries and a young girl speaks briefly
A car is shifting gears
Insects buzzing followed by rattling and rustling
Gunfire sounds
Continuous crinkling in a quiet environment

Text-to-Audio Generation

Since we compare with generative model-based audio editing baseline methods, we also train a text-to-audio latent diffusion model. Our model achieves the best performance in three objective metrics, FD, KL, and IS. Compared to the previously best-performing model (AudioLDM), our model reduces FD by 3.12 (23.31 to 20.19), KL by 0.27 (1.59 to 1.32), and increases IS by 1.10 (8.13 to 9.23). This demonstrates that our generation model can serve as a strong baseline model for generation-based editing methods.

Text Our Text-to-Audio Model
Jazz music
A person snoring
Someone typing on a computer
Train passing and a short honk
Birds singing while ocean waves crashing
Wind blows and insects buzz while birds chirp
A woman giving a speech while group of people applauding in the end