Book contents
- Frontmatter
- Contents
- Foreword
- Preface
- 1 Introduction
- 2 Communication and language
- 3 The text-to-speech problem
- 4 Text segmentation and organisation
- 5 Text decoding: finding the words from the text
- 6 Prosody prediction from text
- 7 Phonetics and phonology
- 8 Pronunciation
- 9 Synthesis of prosody
- 10 Signals and filters
- 11 Acoustic models of speech production
- 12 Analysis of speech signals
- 13 Synthesis techniques based on vocal-tract models
- 14 Synthesis by concatenation and signal-processing modification
- 15 Hidden-Markov-model synthesis
- 16 Unit-selection synthesis
- 17 Further issues
- 18 Conclusion
- Appendix A Probability
- Appendix B Phone definitions
- References
- Index
14 - Synthesis by concatenation and signal-processing modification
Published online by Cambridge University Press: 25 January 2011
- Frontmatter
- Contents
- Foreword
- Preface
- 1 Introduction
- 2 Communication and language
- 3 The text-to-speech problem
- 4 Text segmentation and organisation
- 5 Text decoding: finding the words from the text
- 6 Prosody prediction from text
- 7 Phonetics and phonology
- 8 Pronunciation
- 9 Synthesis of prosody
- 10 Signals and filters
- 11 Acoustic models of speech production
- 12 Analysis of speech signals
- 13 Synthesis techniques based on vocal-tract models
- 14 Synthesis by concatenation and signal-processing modification
- 15 Hidden-Markov-model synthesis
- 16 Unit-selection synthesis
- 17 Further issues
- 18 Conclusion
- Appendix A Probability
- Appendix B Phone definitions
- References
- Index
Summary
We saw in Chapter 13 that, while vocal-tract methods can often generate intelligible speech, they seem fundamentally limited in terms of generating natural-sounding speech. We saw that, in the case of formant synthesis, the main limitation is not so much in generating the speech from the parametric representation, but rather in generating these parameters from the input specification which was created by the text-analysis process. The mapping between the specification and the parameters is highly complex, and seems beyond what we can express in explicit human-derived rules, no matter how “expert” the rule designer. We face the same problems with articulatory synthesis and in addition have to deal with the facts that acquiring data is fundamentally difficult and improving naturalness often necessitates a considerable increase in complexity in the synthesiser.
A partial solution to the complexities of specifiction-to-parameter mapping is found in the classical LP technique whereby we bypassed the issue of generating of the vocal-tract parameters explicitly and instead measured them from data. The source parameters, however, were still specified by an explicit model, which was identified as the main source of the unnaturalness.
In this chapter we introduce a set of techniques that attempt to get around these limitations. In a way, these can be viewed as extensions of the classical LP technique in that they use a data-driven approach: the increase in quality, however, largely arises from the abandonment of the over-simplistic impulse/noise source model.
- Type
- Chapter
- Information
- Text-to-Speech Synthesis , pp. 412 - 434Publisher: Cambridge University PressPrint publication year: 2009