Book contents
- Frontmatter
- Contents
- Foreword
- Preface
- 1 Introduction
- 2 Communication and language
- 3 The text-to-speech problem
- 4 Text segmentation and organisation
- 5 Text decoding: finding the words from the text
- 6 Prosody prediction from text
- 7 Phonetics and phonology
- 8 Pronunciation
- 9 Synthesis of prosody
- 10 Signals and filters
- 11 Acoustic models of speech production
- 12 Analysis of speech signals
- 13 Synthesis techniques based on vocal-tract models
- 14 Synthesis by concatenation and signal-processing modification
- 15 Hidden-Markov-model synthesis
- 16 Unit-selection synthesis
- 17 Further issues
- 18 Conclusion
- Appendix A Probability
- Appendix B Phone definitions
- References
- Index
15 - Hidden-Markov-model synthesis
Published online by Cambridge University Press: 25 January 2011
- Frontmatter
- Contents
- Foreword
- Preface
- 1 Introduction
- 2 Communication and language
- 3 The text-to-speech problem
- 4 Text segmentation and organisation
- 5 Text decoding: finding the words from the text
- 6 Prosody prediction from text
- 7 Phonetics and phonology
- 8 Pronunciation
- 9 Synthesis of prosody
- 10 Signals and filters
- 11 Acoustic models of speech production
- 12 Analysis of speech signals
- 13 Synthesis techniques based on vocal-tract models
- 14 Synthesis by concatenation and signal-processing modification
- 15 Hidden-Markov-model synthesis
- 16 Unit-selection synthesis
- 17 Further issues
- 18 Conclusion
- Appendix A Probability
- Appendix B Phone definitions
- References
- Index
Summary
We saw in Chapter 13 that, despite the approximations in all the vocal-tract models concerned, the limiting factor in generating high-quality speech is not so much in converting the parameters into speech, but in knowing which parameters to use for a given synthesis specification. Determining these by hand-written rules can produce fairly intelligible speech, but the inherent complexities of speech seem to place an upper limit on the quality that can be achieved in this way. The various second-generation synthesis techniques explained in Chapter 14 solve the problem by simply measuring the values from real speech waveforms. Although this is successful to a certain extent, it is not a perfect solution. As we will see in Chapter 16, we can never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database is very uneven. Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data.
An alternative is to use statistical, machine-learning techniques to infer the specification-to-parameter mapping from data. While this and the concatenative approach can both be described as data-driven, in the concatenative approach we are effectively memorising the data, whereas in the statistical approach we are attempting to learn the general properties of the data.
- Type
- Chapter
- Information
- Text-to-Speech Synthesis , pp. 435 - 473Publisher: Cambridge University PressPrint publication year: 2009