Representing Tone in Levenshtein Distance

12 - Representing Tone in Levenshtein Distance

Published online by Cambridge University Press: 12 September 2012

Cathryn Yang and

Andy Castro

Edited by

John Nerbonne ,

Charlotte Gooskens ,

Sebastian Kürschner and

Renée van Bezooijen

Show author details

Cathryn Yang: Affiliation:
La Trobe University
John Nerbonne: Affiliation:
University of Groningen
Charlotte Gooskens: Affiliation:
University of Groningen
Sebastian Kürschner: Affiliation:
Friedrich-Alexander-Universität Erlangen-Nürnberg
Renée van Bezooijen: Affiliation:
University of Groningen

Book contents

Get access

Summary

Abstract Levenshtein distance, also known as string edit distance, has been shown to correlate strongly with both perceived distance and intelligibility in various Indo-European languages (Gooskens and Heeringa, 2004; Gooskens, 2006). We apply Levenshtein distance to dialect data from Bai (Allen, 2004), a Sino-Tibetan language, and Hongshuihe (HSH) Zhuang (Castro and Hansen, accepted), a Tai language. In applying Levenshtein distance to languages with contour tone systems, we ask the following questions: 1) How much variation in intelligibility can tone alone explain? and 2) Which representation of tone results in the Levenshtein distance that shows the strongest correlation with intelligibility test results? This research evaluates six representations of tone: onset, contour and offset; onset and contour only; contour and offset only; target approximation (Xu & Wang, 2001), autosegments of H and L, and Chao's (1930) pitch numbers. For both languages, the more fully explicit onset-contouroffset and onset-contour representations showed significantly stronger inverse correlations with intelligibility. This suggests that, for cross-dialectal listeners, the optimal representation of tone in Levenshtein distance should be at a phonetically explicit level and include information on both onset and contour.

INTRODUCTION

The Levenshtein distance algorithm measures the phonetic distance between closely related language varieties by counting the cost of transforming the phonetic segment string of one cognate into another by means of insertions, deletions and substitutions. After Kessler (1995) first applied the algorithm to dialect data in Irish Gaelic, Heeringa (2004) showed that cluster analysis based on Levenshtein distances agreed remarkably with expert consensus on Dutch dialect groupings.

Type: Chapter
Information: Computing and Language Variation
International Journal of Humanities and Arts Computing Volume 2
, pp. 205 - 220

Publisher: Edinburgh University Press

Print publication year: 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

12 - Representing Tone in Levenshtein Distance

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive