Hostname: page-component-7bb8b95d7b-fmk2r Total loading time: 0 Render date: 2024-09-05T15:23:38.134Z Has data issue: false hasContentIssue false

Compound noun segmentation based on lexical data extracted from corpus

Published online by Cambridge University Press:  25 July 2001

JUNTAE YOON
Affiliation:
IRCS, University of Pennsylvania, 3401 Walnut St., Suite 400A, Philadelphia, PA 19104-6228, USA; e-mail: jtyoon@linc.cis.upenn.edu

Abstract

Compound noun segmentation is one of the crucial problems in Korean language processing because a series of nouns in Korean may appear without space in real text, which makes it difficult to identify its morphological constituents. This paper presents an effective method of Korean compound noun segmentation based on lexical data extracted from a corpus. The segmentation consists of two tasks: First, it uses a Hand-Build Segmentation Dictionary (HBSD) to segment compound nouns which frequently occur or need an exceptional process. Second, a segmentation algorithm using data from a corpus is proposed, where simple nouns and their frequencies are stored in a Simple Noun Dictionary (SND) for segmentation. The analysis is executed based on modified tabular parsing using min-max operation. Our experiments have shown a very effective accuracy rate of about 97.29%, which turns out to be very effective.

Type
Research Article
Copyright
© 2001 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)