毕业设计 (论 文)
模式识别中聚类分析算法综述
院 别 专业名称 班级学号 学生姓名 指导教师
信息与计算科学
2013年06月10日
模式识别中聚类分析算法综述
摘 要
聚类分析是将数据分类到不同的类或者簇的过程,聚类分析是一种探索性的分析,在分类的过程中,人们不必事先给出一个分类的标准,聚类分析能够从样本数据出发,自动进行分类。从实际应用的角度看,聚类分析是数据挖掘的主要任务之一。而且聚类能够作为一个独立的工具获得数据的分布状况,观察每一簇数据的特征,集中对特定的聚簇集合作进一步地分析。聚类分析还可以作为其他算法(如分类和定性归纳算法)的预处理步骤。
本文对模式识别中聚类分析算法进行了综述,主要论述了顺序算法、层次算法和基于代价函数最优的聚类算法,其中层次算法分为合并算法和分裂算法,其中合并算法又包括最短距离法、最长距离法、中间距离法、重心法、类平均距离法;而基于代价函数最优的聚类算法则分为K均值算法和迭代自组织的数据分析算法。本文首先介绍了聚类算法的应用范围及其意义,并对聚类算法的基本分类进行了简单介绍,同时对可能聚类的数量进行了阐述。之后,详细介绍了上述各类算法的算法思想及其具体的实现步骤,并在顺序算法一章中给出了BSAS算法的改进,并运用MATLAB对层次算法和基于代价函数最优的聚类算法中的几个具体算法进行了代码实现,通过对样品图片的识别分类认识了聚类算法的具体应用,并且认识到了几类算法各自的特点。其中,层次算法中的五个算法实现步骤较为简单,但在其实现过程中需要输入一个合适的阈值,阈值的大小直接影响最后的结果,而且相同的阈值,不同的算法可能得到不同的结果。而K均值算法的实现结果则与阈值无关,只需定义迭代次数和类中心个数。与之相比,ISODATA算法则具有自组织性,会在计算过程中不断调整类中心的个数。
关键词: 聚类分析,顺序算法,层次算法,基于代价函数最优的聚类算法
The Overview of Pattern Recognition Clustering Algorithm
Author:Whuenkmnkn Tutor:Cnunnknhcfjuj
Abstract
Cluster analysis is a data classification into different classes or clusters in the process, Cluster analysis is an exploratory analysis, in the classification process, people do not give a classification criterion in advance, cluster analysis to the data from the sample starting, automatic classification. From a practical perspective, Cluster analysis is one of the main tasks of data mining. Moreover clustering can be used as a separate tool to obtain the distribution of the data, observe characteristics of the data in each cluster and make a further analysis on particular clustered sets. Cluster analysis can also be used as other algorithms’ (such as classification and qualitative induction algorithm) preprocessing step.
In this paper, clustering algorithms in pattern recognition are reviewed, mainly discussing the sequential algorithm, hierarchical algorithms and clustering algorithm based on cost function optimization. Hierarchical algorithm is divided into division algorithm and merging algorithm, which also includes the shortest distance algorithm, the longest distance algorithm, the middle distance algorithm, center of gravity algorithm, the class average distance algorithm; while the clustering algorithm based on cost function optimization is divided into K-means algorithm and iterative self-organizing data analysis algorithms. At first this paper describes the application of clustering algorithm and its significance, and give a brief introduction of the basic clustering algorithm, while the possible number of clusters are described. And then the algorithm ideas and concrete steps to achieve of various algorithms above are detailed. At the same time, the improved BSAS algorithm is gave in the chapter about the sequential algorithm and several specific algorithms in the hierarchical clustering algorithm and the algorithm based on cost function optimization are coded by MATLAB. Through identifying sample images, I get to know the specific application and the characteristics of different clustering algorithms. The five specific hierarchical algorithms’ are easy to achieve by several simple steps, while its implementation process need to enter an appropriate threshold value. The threshold value directly affects the final clustering results and different algorithms may produce different results with the same threshold value. While the results of K-means algorithm is independent of the threshold, simply define the number of
iterations and the number of cluster center. In contrast, ISODATA algorithm is self-organization and will adjust the number of cluster center continuously during the calculation process.
Key Words: Cluster Analysis, Sequential Algorithm, Hierarchical Algorithm, Clustering
Algorithm Based on Cost Function Optimization
目 录
1 绪论 ...................................................................................................................................... 1
1.1 课题背景及意义 ......................................................................................................... 1 1.2 聚类算法的种类 ......................................................................................................... 1 1.3 可能聚类的数量 ......................................................................................................... 2 2 聚类算法Ⅰ:顺序算法 ...................................................................................................... 4
2.1 基本顺序算法方案描述 ............................................................................................. 4 2.2 聚类数的估计 ............................................................................................................. 5 2.3 BSAS的改进 .............................................................................................................. 6 2.4 改进阶段 ..................................................................................................................... 7 3 聚类算法Ⅱ:层次算法 ...................................................................................................... 9
3.1 合并算法 ..................................................................................................................... 9
3.1.1 最短距离法 ...................................................................................................... 10 3.1.2 最长距离法 ...................................................................................................... 11 3.1.3 中间距离法 ...................................................................................................... 12 3.1.4 重心法 .............................................................................................................. 12 3.1.5 类平均距离法 .................................................................................................. 13 3.2 分裂算法 ................................................................................................................... 14 4 聚类算法Ⅲ:基于代价函数最优的聚类算法 ................................................................ 16
4.1 K均值算法 ............................................................................................................... 16 4.2 迭代自组织的数据分析算法 ................................................................................... 16 结 论 .................................................................................................................................. 19 致 谢 .................................................................................................................................. 20 参考文献 .................................................................................................................................. 20 附 录 A ................................................................................................................................... 20 附 录 B ................................................................................................................................... 24