Lejun Min, Junyan Jiang, Gus Xia, Jingwei Zhao @MusicXLab
[Paper] | [Code Repo]

We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. Internal control refers to the process in which users pre-define a part of the music and then let the model infill the rest. External control conditions the model with external yet related information, such as chord, texture, or other features, via the cross-attention mechanism. We show that by using internal and external controls, Polyffusion unifies a wide range of music creation tasks, including melody generation given accompaniment, accompaniment generation given melody, arbitrary music segment inpainting, and music arrangement given chords or textures.

In each MIDI visualization block, blue notes denote the generated results. Except for the iterative inpainting demo, all the demos are 8-bar long (16s).

Unconditional Generation


Melody generation given accompaniment

With the pre-defined accompaniment as follows,

the model inpaints the upper melody via masked generation:


Accompaniment generation given melody

With the pre-defined melody as follows,

the model inpaints the lower accompaniment via masked generation:

Here is another example. Given the melody as follows,

the model inpaints the lower accompaniment:


Arbitrary segment inpainting

With the pre-defined segment as follows,

the model re-generates the 3rd, 4th, 5th, and 7th bars via masked generation:


Iterative inpainting for long-term generation

Polyffusion can generate long-term music by iteratively inpainting the future given the past. Here is a 64-bar generation:


Generation with chord conditioning

By feeding the following chord progression as the condition of the generation (chords are first encoded by a pre-trained VAE),

the model can generate music scores that follow the given chord progression:

Moreover, by setting the guidance scale as 5 with Classifier-free Guidance, the generation involves more in-chord tones:


Generation with texture conditioning

By feeding the following music texture as the condition of the generation (the texture is encoded by a pre-trained VAE),

the model can generate music scores that inherit the given music texture:


Chord-specified accompaniment generation given melody

This is a hybrid control case where we want to generate the corresponding accompaniment for a melody, while further specifying its chord progression. The melody is given as follows:

And the chord progression we want the generated accompaniment to comply with is as follows:

The model takes both the controls and inpaint the accompaniment:


Texture-specified melody generation given accompaniment

This is a hybrid control case where we want to generate the corresponding melody for a accompaniment, while further specifying its music texture. The accompaniment is given as follows:

And the music texture we want the generated melody to inherit is as follows:

The model takes both the controls and inpaint the melody:


Thanks html-midi-player for the excellent MIDI visualization.