Unit-selection synthesis

Paul Taylor

doi:10.1017/CBO9780511816338.018

We now turn to unit-selection synthesis which is the dominant synthesis technique in text-to-speech today. Unit selection is the natural extension of second-generation concatenative systems, and deals with the issues of how to manage large numbers of units, how to extend prosody beyond just F0 and timing control, and how to alleviate the distortions caused by signal processing.

From concatenative synthesis to unit selection

The main progression from first- to second-generation systems was a move away from fully explicit synthesis models. Of the first-generation techniques, classical LP synthesis differs from formant synthesis in that it uses data, rather than rules, to specify vocal-tract behaviour. Both first-generation techniques, however, still used explicit source models. The improved quality of second-generation techniques stems largely from abandoning explicit source models as well, regardless of whether TD-PSOLA (no model), RELP (use of real residuals) or a sinusoidal model (no strict source/filter model) is employed. The direction of progress is therefore clear: a movement away from explicit, hand-written rules, towards implicit, data-driven techniques.

By the early 1990s, a typical second-generation system was a concatenative diphone system in which the pitch and timing of the original waveforms were modified by a signal-processing technique to match the pitch and timing of the specification. In these second-generation systems, the assumption is that the specification from the text-analysis system comprises a list of items as before, where each item is specified with phonetic/phonemic identity information, a pitch and a timing value. Hence, these systems assume the following.

Book contents

16 - Unit-selection synthesis

Summary

Access options

Book contents

16 - Unit-selection synthesis

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive