ViolinDiff:
Enhancing Expressive Violin Synthesis with Pitch Bend Conditioning

Daewoong Kim, Hao-Wen Dong, and Dasaem Jeong

Abstract

Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling.

MIDI Encoding

To encode MIDI input, we employed piano roll representation as it enables encoding polyphonic pitch while being aligned to mel spectrogram at the frame level. If a mel spectrogram $ M \in \mathbb{R}^{F\times T} $ consists of $ T $ time frames and $ F $ mel bins, the input piano roll is a 2D binary matrix $ R \in \{0, 1\}^{P \times T} $, where $ P $ denotes the number of pitches. To emphasize the importance of the onset and offset of each note along with its frame-wise activation, we employed three different types of rolls: $ R_{\text{frame}}, R_{\text{onset}}, R_{\text{offset}} $.

In addition to these multiple binary piano rolls, we also include pitch bend information in the model's input. Pitch bends in MIDI are discrete events, each with a pitch bend value and a corresponding time, which enable the continuous change of pitch. These events often occur at finer intervals than what traditional piano roll representations can capture. Therefore, we calculate frame-wise pitch bend value by weighted average considering the duration of pitch bend events within each frame, which results in $ R_{\text{bend}} \in (-1, 1)^{P \times T} $.

The encoded result $ R_{\text{bend}} \in (-1, 1)^{P \times T} $ is computed using the equation below:

$$ R_\text{bend}(p, n) = \frac{\sum_{i} b_i^p \cdot d_i^p}{\sum_{i} d_i^p} $$

Here, $ b_i^p $ is the $ i $-th pitch bend value (in semitones) for pitch $ p $, and $ d_i^p $ is the duration for which that event is active within time frame $ n $. The duration of a pitch bend event is defined as the time it remains active until the next event. This method generalizes well to both monophonic and polyphonic music, where multiple pitches may undergo different pitch bends simultaneously.

Model Architecture

Synthesis Module

In the synthesis module, the encoder processes the rolls $ \mathcal{R} $ as input through a GRU layer, followed by Transformer Encoder Blocks, producing the note feature sequence $ E_R $. A simple linear layer then generates an auxiliary mel spectrogram $ \tilde{M} $ from $ E_R $. For the denoiser of the synthesis module, we adopt the non-causal WaveNet architecture with additional modifications to incorporate performer conditioning through FiLM layers.

To denoise the noisy mel spectrogram $ M_t \in \mathbb{R}^{F\times T} $, we first add a time step embedding $ e_t \in \mathbb{R}^D $ by passing it through a linear layer $ f: \mathbb{R}^D \rightarrow \mathbb{R}^F $ to each time frame of $ M_t $. The combined input is modulated again via a FiLM layer, using the auxiliary mel spectrogram $ \tilde{M} $ as a conditioning input.

Each residual block in the denoiser receives conditioning inputs $ e_t $, $ e_p $, and $ E_R $. The performer embedding $ e_p $ is processed through FiLM layers to modulate the internal processing within the residual block, while the time step embedding $ e_t $ and the note features $ E_R $ are added directly.

The model is trained to minimize the following loss function: \[ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{simple}} + \mathcal{L}_{\text{aux}} \] where the $ \mathcal{L}_{\text{simple}} $ loss follows the formulation used to predict the noise $ \epsilon $ added to the noisy mel spectrogram $ M_t $. \[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{M_0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta \left( \sqrt{\bar{\alpha}_t} M_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t, c \right) \right\|^2 \right]. \] In addition, the auxiliary loss $ \mathcal{L}_{\text{aux}} $ is used to minimize the L2 reconstruction error between the predicted auxiliary mel spectrogram $ \tilde{M} $ and the ground truth mel spectrogram $ M $.

Bend Estimation Module

The bend estimation module is adapted from the synthesis module to handle pitch bend data directly. Unlike the synthesis module, which conditions its denoising process on an auxiliary mel spectrogram $ \tilde{M} $, the bend estimation module operates directly on the noisy bend roll $ R_{\text{bend}_t} $. During the diffusion, we only add noise to the regions with note activation. Other regions were excluded in noise addition and loss calculation.

Results

To evaluate the advantage of modeling pitch bends explicitly, we also designed a baseline model, NoBend, that does not use pitch bend information during the synthesis. The NoBend model only utilizes the synthesis module and takes $ \mathcal{R}_{\text{w/o bend}} $ instead of $ \mathcal{R} $, meaning it has to model the F0 trajectory implicitly without access to explicit pitch bend information.

ViolinDiff outperformed the baseline model in vibrato accuracy, and FAD metrics, demonstrating the benefits of explicit pitch bend modeling. Listening tests showed ViolinDiff achieved the highest realism score among synthesized models. For detailed results, please refer to the full paper.