Paper: OpenReview
Authors: Anonymous authors
Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised methods. We demonstrate that our model is able to reliably discover and control important but rarely labeled attributes of speech, such as affect and speaking rate, with as little as 0.1% (3 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline.
This page contains a set of audio samples in support of the paper. All the utterances are unseen during training. Sections 1-3 demo controlling speaking rate, pitch variation (continuous labels), and affect (discrete labels), via semi-supervision of these factors, at multiple supervision levels. Sections 4-6 demo the transfer of controllability to unlabeled speakers, for whom we do not observe labels of the factors above (i.e., domain transfer).
Contents
In this section, we demonstrate the degree to which we are able to control the variation in speaking rate at different levels of supervision. Speaking rate is normalized to match the standard normal prior (zs~N(μ=0, σ=1)).
   | Supervision % | zs(Speaking Rate)   | 0.1% | 1% | 10% |
---|---|---|---|
-5σ |   |   |   |
-3σ |   |   |   |
-1σ |   |   |   |
0 |   |   |   |
1σ |   |   |   |
3σ |   |   |   |
5σ |   |   |   |
In this section, we demonstrate the degree to which we are able to control the variation in F0 (fundamental frequency, a proxy for arousal or excitement) at different levels of supervision. F0 variation is normalized to match the standard normal prior (zs~N(μ=0, σ=1)).
   | Supervision % | zs(Pitch variation)   | 0.1% | 1% | 10% |
---|---|---|---|
-5σ |   |   |   |
-3σ |   |   |   |
-1σ |   |   |   |
0 |   |   |   |
1σ |   |   |   |
3σ |   |   |   |
5σ |   |   |   |
In this section, we demonstrate the degree to which we are able to control the variation in affect at different levels of supervision. Higher levels of supervision is required for controlling affect, compared to speaking rate and pitch variations. In general the variations are subtle, so as the upper bound we have also presented the fully supervised case (i.e., 100% supervision).
   | Supervision % | zs(Affect)   | 1% | 10% | 20% | 100% |
---|---|---|---|---|
Arousal= -2 (low) Valence=-2 (angry) |   |   |   |   |
Arousal= -2 (low) Valence=-1 (sad) |   |   |   |   |
Arousal= -2 (low) Valence=2 (happy) |   |   |   |   |
Arousal=2 (high) Valence=-2 (angry) |   |   |   |   |
Arousal=2 (high) Valence=-1 (sad) |   |   |   |   |
Arousal=2 (high) Valence=2 (happy) |   |   |   |   |
In this section, we demonstrate the degree to which speaking rate control at 10% supervision is able to generalize to speakers without speaking rate labels in training, which are colored blue.
   | Speaker | zs(Speaking Rate)   | female w/ label |
male w/ label |
female w/o label |
female w/o label |
male w/o label |
---|---|---|---|---|---|
-5σ |   |   |   |   |   |
-3σ |   |   |   |   |   |
-1σ |   |   |   |   |   |
0 |   |   |   |   |   |
1σ |   |   |   |   |   |
3σ |   |   |   |   |   |
5σ |   |   |   |   |   |
In this section, we demonstrate the degree to which pitch variation control at 10% supervision is able to generalize to speakers without pitch variation labels in training, which are colored blue.
   | Speaker | zs(Pitch Variation)   | female w/ label |
male w/ label |
female w/o label |
female w/o label |
male w/o label |
---|---|---|---|---|---|
-5σ |   |   |   |   |   |
-3σ |   |   |   |   |   |
-1σ |   |   |   |   |   |
0 |   |   |   |   |   |
1σ |   |   |   |   |   |
3σ |   |   |   |   |   |
5σ |   |   |   |   |   |
In this section, we demonstrate the degree to which affect control at 10% supervision is able to generalize to speakers without affect labels in training, which are colored blue.
   | Speaker | zs(Affect)   | male w/ label |
female w/ label |
female w/o label |
female w/o label |
male w/o label |
---|---|---|---|---|---|
Arousal= -2 (low) Valence=-2 (angry) |   |   |   |   |   |
Arousal= -2 (low) Valence=-1 (sad) |   |   |   |   |   |
Arousal= -2 (low) Valence=2 (happy) |   |   |   |   |   |
Arousal=2 (high) Valence=-2 (angry) |   |   |   |   |   |
Arousal=2 (high) Valence=-1 (sad) |   |   |   |   |   |
Arousal=2 (high) Valence=2 (happy) |   |   |   |   |   |