大连理工大学硕士学位论文
摘 要
词义排歧在机器翻译、信息检索、句子分析和语音识别等许多领域有重要的作用。因此在自然语言处理领域,词义排歧方法的研究具有重要的理论和实践意义。本文主要研究在标注语料库支持下的基于有指导学习算法的词义排歧方法。
在词义排歧模型中引入有指导的AdaBoost.MH算法。首先通过简单决策树算法对多义词上下文中的知识源进行学习,产生准确率较低的弱规则;之后,通过AdaBoost.MH算法对这些弱规则进行加强;经过若干次迭代后,最终得到一个准确度更高的规则,即为最终的排歧模型。论文还针对系统的学习效率和实用性给出了一种简单终止算法迭代的方法。
为获取多义词上下文中的知识源,本文在采用传统的词性标注和局部搭配序列等知识源的基础上,引入了一种新的知识源,即语义范畴。实验结果表明语义范畴知识源的引入有助于提高算法的学习效率和排歧的正确率。
建立有指导学习算法所需的大规模人工标注语料是相当困难的,本文提出了一种通过WWW资源自动构建适合汉语多义词排歧的标注语料的方法。并通过实验验证了这种语料库的可用性。
在对6个典型汉语多义词和SENSEVAL3中文语料中20个汉语多义词的词义消歧实验中,AdaBoost.MH算法获得了较高的开放测试正确率(85.75%和75.84%)。
关键词:自然语言处理;词义排歧;AdaBoost.MH算法;知识源
-I-
一种基于AdaBoost.MH算法的汉语多义词排歧方法
Abstract
Word sense disambiguation (WSD) plays an important role in many areas of natural language processing such as machine translation, information retrival, sentence analysis, speech recognition. The research on WSD has great theoretical and practical significance.The main work in the dissertation is to study the supervised learning algorithm learning WSD knowledge from many kinds of resources based on large sense-tagged Chinese corpus.
An approach based on supervised AdaBoost.MH learning algorithm for Chinese word sense disambiguation is presented. AdaBoost.MH algorithm is employed to learn WSD knowledge from many kinds of resources and to boost the accuracy of the weak stumps rules for decision trees and repeatedly calls a learner to finally produce a more accurate rule. A simple stopping criterion is also presented in view of the efficiency of learning and the utility of system.
As for Chinese WSD, in order to extract more contextual information, we introduce a new WSD knowledge --- semantic categorization as well as two classical knowledge sources: part-of-speech of neighboring words and local collocations. Experimental results show that the semantic categorization knowledge is useful for improving the learning efficency of the algorithm and accuracy of disambiguation.
Due to the flexibility and complexity of bulding up a broad coverage semantically annotated corpus, an approach based on WWW search engines to automatically obtain annotated corpus for Chinse WSD is presented.
AdaBoost.MH algorithm has a higher disambiguation accuracy rates which are 85.75% and 75.84% in open tests for 6 typical polysemous Chinese words and 20 polysemous words from SENSEVAL3 Chinese corpus.
Key Words:Natural Language Processing; Word sense disambiguation;
AdaBoost.MH algorithm;Multiple knowledge sources
-II-
大连理工大学硕士学位论文
目 录
摘 要 .............................................................................................................................I Abstract.......................................................................................................................... II 引 言 ............................................................................................................................ 1 1 问题描述..................................................................................................................... 3
1.1 词义排歧的提出及其意义 ................................................................................ 3
1.1.1 词义排歧 ................................................................................................. 3 1.1.2 词义排歧研究的意义.............................................................................. 4 1.2 国内外的研究状况 ........................................................................................... 5
1.2.1 有指导排歧方法 ..................................................................................... 5 1.2.2 基于词典的排歧方法.............................................................................. 9 1.2.3 无指导的排歧方法................................................................................ 10 1.3 面临的主要问题 ............................................................................................. 11
1.3.1 上下文选择 ........................................................................................... 11 1.3.2 词义的划分 ........................................................................................... 12 1.4 词义排歧的评测方法...................................................................................... 12 1.5 本文的工作 ..................................................................................................... 12 2 面向WSD的AdaBoost.MH算法模型 .................................................................... 14
2.1 基本概念......................................................................................................... 14 2.2 AdaBoost.MH算法简介................................................................................... 15
2.2.1 AdaBoost算法背景 .............................................................................. 15 2.2.2 AdaBoost算法基本思想 ....................................................................... 16 2.2.3 算法误差的分析 ................................................................................... 18 2.2.4 多类分类问题 ....................................................................................... 20 2.2.5 AdaBoost算法的优缺点 ........................................................................ 20 2.3 面向WSD的AdaBoost.MH算法描述 ............................................................. 21
-III-
一种基于AdaBoost.MH算法的汉语多义词排歧方法
2.4 弱学习器的设计及Zt的选取 ......................................................................... 22 3 上下文特征的选择 ................................................................................................... 25
3.1 相邻词的词性标注(POS) ............................................................................ 25 3.2 局部搭配信息 ................................................................................................. 26 3.3 语义范畴信息 ................................................................................................. 26
3.3.1 《同义词词林》简介............................................................................ 26 3.3.2 对《同义词词林》中未登录词的处理 ................................................. 28 3.3.3 语义范畴信息的选取............................................................................ 29
4 汉语AdaBoost.MH -- WSD实验 ............................................................................. 30
4.1 语料库............................................................................................................. 30
4.1.1 人民日报语料 ....................................................................................... 30 4.1.2 SENSEVAL3 中文语料............................................................................ 30 4.2 实验评测及结果 ............................................................................................. 31
4.2.1 人民日报语料实验结果与评测 ............................................................ 32 4.2.2 SENSEVAL3 中文语料实验结果与评测 ................................................. 32 4.3 算法中迭代次数的确定.................................................................................. 34 4.4 语义信息的引入对排歧效果的影响............................................................... 35
4.4.1 人民日报语料实验................................................................................ 35 4.4.2 SENSEVAL3 中文语料实验 .................................................................... 36
5 自动建立带标注的语料库的方法 ............................................................................ 38
5.1 自动构建标注语料库的模型 .......................................................................... 38
5.1.1 搜索关键字的建立................................................................................ 39 5.1.2 语料库的建立和修剪............................................................................ 40 5.2 语料库可用性的评测实验与分析 .................................................................. 42
5.2.1 语料库................................................................................................... 42 5.2.2 语料库中搜索到的新搭配 .................................................................... 42 5.2.3 上下文特征的选取................................................................................ 42 5.2.4 实验结果及评测 ................................................................................... 42
-IV-
大连理工大学硕士学位论文
结 论 .......................................................................................................................... 45 参 考 文 献 ................................................................................................................. 46 附录A 附录B 附录C 附录D
SENSEVAL3中文语料示例 ....................................................................... 49 标注语义范畴信息的语料示例 ................................................................... 51 《同义词词林》语义信息示例 ................................................................... 53 《同义词词林扩展版》语义信息示例 ........................................................ 54
攻读硕士学位期间发表学术论文情况 ........................................................................ 55 致 谢 ...................................................................................................................... 56 大连理工大学学位论文版权使用授权书 .................................................................... 57
-V-