文本分类研究综述(3)

2020-06-05 09:37

张博锋 等:基于机器学习的文本分类研究综述

511

4Reuters Version #32Naive BayesDecision TreeDecision RuleLLSFOn LineRocchioNNGIsk-NNSVMAdaBoost.MH10.650.70.750.8Breakeven point ( or F1 ) values0.850.9图I 分类方法性能比较(数据来源于[]) 尽管所有的实验不一定满足5.2的三个条件,也不一定全面,仍旧认为实验的结果反映一些事实.可以看出,Boosting, SVM,基于实例的方法以及回归方法的性能排在前列;而NN和在线方法的性能略次于上述几类;Rocchio, Na?ve Bayes的性能排在最后,与性能最好的几类总是有10%左右的差距.

6 中文文本分类

实际上将本文中讨论的所有分类方法应用于中文(亚洲等文字)文本分类不存在理论上的障碍.中文文本分类研究的特殊性在于两个方面:

(1)项的获取

中文文本中词与词之间是没有间隔的,因此如果以词作为项,则不能直接从文本中获得明确的词的信息,最直接的办法是在分类前对文本先做词语的切分(分词),这本身是中文研究中的一个重要领域,并未完全解决,[]中讨论了几种分词的方法.分词的误差可能会给分来带来问题,但是,由于研究重点不同, 在TC文献中仍就使用简单的分词方法,大多数文献使用N-gram方法,即将句子按照N个连续的字符切分,对N=2(称为Bi-gram),我们可以将’’文本分类”分成”文本”,”本分”,”分类”三个词.在特定文集上,按照N-Gram方法获得的性能与分词后获得的性能差别不大[][][].

(2)测试文集

中文文本分类的研究尚未形成公开和标准的测试文集,很多语料库尚在建设和完善之中,大部分实验是在研究者自己归纳和收集的文本资源上进行的[][].复旦大学中文自然语言处理开放平台给出了文本分类的语料库[],共有涉及政治,经济,军事等20类的专题文章.

7 小结

本文讨论了基于机器学习的文本分类技术方法的研究现状,主要从文本索引,分类方法以及性能评价等方面进行了总结,内容概括如下:

(1)文本的索引技术主要采用了VSM模型,项的权重计算以及降维技术多考虑其统计意义;

(2)文本分类方法多样,已有方法的优化以及单一分类器的组合是提高分类器性能的有效手段,并且TC已经成为ML中各种方法的一个重要的应用领域和挑战,;

(3)分类器性能的综合评价表明,在一些基准文集上,多分类器以及SVM等方法的分类性能较优; (4)中文文本分类在处理上具有一定特殊性,并且需要更多标准的建立和资源的建设.

最后,在TC领域中仍旧有很多问题需要解决,如最大限度地提高文本分类的性能,对有噪音文本的学习和分类[],极短文本的分类[],文本的实时分类等,这些都可能成为基于ML的TC领域内新的研究方向.

8 一级标题 标题1 8.1 二级标题 标题2 8.1.1 三级标题 标题3 定理1(******). *定理内容.* [“定义”、“算法”等的排版格式与此相同]

证明:*证明过程.* [“例”等的排版格式相同] *正文部分.* 正文文字

12

致谢 *致谢内容.* 致谢 References: Reference Journal of Software 软件学报 2005,16(6)

[1] 作者. 题目. 刊名(全称), 出版年,卷号(期号):起始页码. [期刊] Text of Reference(当参考文献数?10时用) Text of Reference 1(当参考文献数?10时用) [2] 作者. 书名. 版次(初版不写), 出版地(城市名): 出版者, 出版年. 起始页码(非必要项). [书籍]

[3] 作者. 题目. In(中文用“见”): 整本文献的编者姓名ed(多编者用eds). 文集实际完整名称. 出版地(城市名): 出版者, 出版年.

起止页码. [会议录(论文集、论文汇编等)]

[4] 著者. 题名. 学位, 学位授予单位, 出版年. [学位论文]

[5] Author. Title. Technical Report, Report No., Publishing place (city name): Publisher, Year (in Chinese with English abstract). [科技

报告]

附中文参考文献: 中文参考文献[2] [5] 著者.题名.科技报告,报告号,出版地(或单位所在地):出版者(或单位),出版年.

[1] Diao Y, Lu H,Wu D, A comparative study of classification-based personal e-mail filtering, in Proceedings of PAKDD-00, 4th Pacific-Asia

Conference on Knowledge Discovery and Data Mining, T Terano, H Liu, and ALP Chen, Editors. 2000, Springer Verlag, Heidelberg, {DE}: Kyoto, {JP}. p. 408--419.

[2] Lewis DD. Representation and learning in information retrieval

Ph.D. thesis. 1992, Department of Computer Science, University of Massachusetts: Amherst, {US}.

[3] Sebastiani F, Machine learning in automated text categorization. Acm Computing Surveys, 2002. 34(1): p. 1--47.

[4] Chiang JH,Chen YC, An intelligent news recommender agent for filtering and categorizing large volumes of text corpus. International Journal

of Intelligent Systems, 2004. 19(3): p. 201-216.

[5] Adam CK, Ng HT,Chieu HL. Bayesian Online Classifiers for Text Classification and Filtering. in Proceedings of SIGIR-02, 25th ACM

International Conference on Research and Development in Information Retrieval. 2002. Tampere, {FI}: {ACM} Press, New York, {US}. [6] Attardi G, Gull\\'{\\i} A,Sebastiani F, Automatic {W}eb Page Categorization by Link and Context Analysis, in Proceedings of THAI-99, 1st

European Symposium on Telematics, Hypermedia and Artificial Intelligence, C Hutchison and G Lanzarone, Editors. 1999: Varese, {IT}. p. 105--119.

[7] Hwang BY,Lee BJ, An efficient e-mail monitoring system for detecting proprietary information outflow using broad concept learning, in

Metainformatics. 2004. p. 72-78.

[8] Mladenic D,Grobelnik M, Feature selection on hierarchy of {W}eb documents. Decision Support Systems, 2003. 35(1): p. 45--87. [9] Ceci M,Malerba D, Hierarchical Classification of {HTML} Documents with {WebClassII}, in Proceedings of ECIR-03, 25th European

Conference on Information Retrieval, F Sebastiani, Editor. 2003, Springer Verlag: Pisa, {IT}. p. 57--72.

[10] Liao Y,Vemuri VR, Using Text Categorization Techniques for Intrusion Detection, in Proceedings of the 11th USENIX Security Symposium,

D Boneh, Editor. 2002, ??: San Francisco, {US}. p. 51--59.

[11] Zhang ZH,Shen H, Suppressing false alarms of intrusion detection using improved text categorization method, in 2004 Ieee International

Confernece on E-Technology, E-Commere and E-Service, Proceedings. 2004. p. 163-166.

[12] Hayes PJ,Weinstein SP, {\\sc Construe/Tis}: a system for content-based indexing of a database of news stories, in Proceedings of IAAI-90,

2nd Conference on Innovative Applications of Artificial Intelligence, A Rappaport and R Smith, Editors. 1990, {AAAI} Press, Menlo Park, {US}. p. 49--66.

[13] Mitchell TM, Machine Learing. 1996, New York: McGraw Hill.

[14] Cavnar WB,Trenkle JM, N-Gram-Based Text Categorization, in Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and

Information Retrieval, ??, Editor. 1994, ??: Las Vegas, {US}. p. 161--175.

[15] Peng F, Schuurmans D,Wang S, Language and Task Independent Text Categorization with Simple Language Models, in Proceedings of

HLT-03, 3rd Human Language Technology Conference, ??, Editor. 2003, ??: Edmonton, {CA}. p. ??.

[16] Kehagias A, Petridis V, Kaburlasos VG, et al., A Comparison of Word- and Sense-based Text Categorization Using Several Classification

Algorithms. Journal of Intelligent Information Systems, 2003. 21(3): p. 227--247.

[17] Moschitti A,Basili R, Complex Linguistic Features for Text Classification: A Comprehensive Study, in Proceedings of ECIR-04, 26th

European Conference on Information Retrieval Research, S McDonald and J Tait, Editors. 2004, Springer Verlag, Heidelberg, {DE}: Sunderland, {UK}. p. 181--196.

张博锋 等:基于机器学习的文本分类研究综述

13

[18] Apte C, Damerau FJ,Weiss SM, Automated learning of decision rules for text categorization. {ACM} Transactions on Information Systems,

1994. 12(3): p. 233--251.

[19] Cohen WW,Singer Y, Context-sensitive learning methods for text categorization. {ACM} Transactions on Information Systems, 1999. 17(2):

p. 141--173.

[20] Weiss SM, Apt\\'e C, Damerau FJ, et al., Maximizing text-mining performance. {IEEE} Intelligent Systems, 1999. 14(4): p. 63--69. [21] Salton G, Automatic Text Processing. 1998: Addison-Wesley Publishing Company.

[22] Lewis DD, Schapire RE, Callan JP, et al., Training algorithms for linear text classifiers, in Proceedings of SIGIR-96, 19th ACM International

Conference on Research and Development in Information Retrieval, H-P Frei, et al., Editors. 1996, {ACM} Press, New York, {US}: Z{\\\

[23] Salton G,Buckley C, Term-weighting approches in automatic texi retrival. Information Processing and Management, 1998. 24(5): p. 513-523. [24] Joachims T, A probabilistic analysis of the {R}occhio algorithm with {TFIDF} for text categorization, in Proceedings of ICML-97, 14th

International Conference on Machine Learning, DH Fisher, Editor. 1997, Morgan Kaufmann Publishers, San Francisco, {US}: Nashville, {US}. p. 143--151.

[25] Debole F,Sebastiani F, Supervised Term Weighting for Automated Text Categorization, in Text Mining and its Applications, S Sirmakessis,

Editor. 2004, Physica-Verlag, Heidelberg, DE. p. 81--98.

[26] Xue D,Sun M, A study on feature weighting in Chinese text categorization, in Computational Linguistics and Intelligent Text Processing,

Proceedings. 2003. p. 592-601.

[27] Xue D,Sun M, Chinese text categorization based on the binary weighting model with non-binary smoothing, in Proceedings of ECIR-03, 25th

European Conference on Information Retrieval, F Sebastiani, Editor. 2003, Springer Verlag: Pisa, {IT}. p. 408--419. [28] Aas K,Eikvil L. Text categorization: A survey. 1999, Norwegian Computing Center,: Oslo.

[29] Govert N, Lalmas M,Fuhr N, A probabilistic description-oriented approach for categorising {W}eb documents, in Proceedings of CIKM-99,

8th ACM International Conference on Information and Knowledge Management, ??, Editor. 1999, {ACM} Press, New York, {US}: Kansas City, {US}. p. 475--482.

[30] Larkey LS,Croft WB, Combining classifiers in text categorization, in Proceedings of SIGIR-96, 19th ACM International Conference on

Research and Development in Information Retrieval, H-P Frei, et al., Editors. 1996, {ACM} Press, New York, {US}: Z{\\\289--297.

[31] Dagan I, Karov Y,Roth D, Mistake-driven learning in text categorization, in Proceedings of EMNLP-97, 2nd Conference on Empirical

Methods in Natural Language Processing, C Cardie and R Weischedel, Editors. 1997, Association for Computational Linguistics, Morristown, {US}: Providence, {US}. p. 55--63.

[32] Bigi B, Using Kullback-Leibler distance for text categorization, in Proceedings of ECIR-03, 25th European Conference on Information

Retrieval, F Sebastiani, Editor. 2003, Springer Verlag: Pisa, {IT}. p. 305--319.

[33] Nunzio GMD, A Bidimensional View of Documents for Text Categorisation, in Proceedings of ECIR-04, 26th European Conference on

Information Retrieval Research, S McDonald and J Tait, Editors. 2004, Springer Verlag, Heidelberg, {DE}: Sunderland, {UK}. p. 112--126. [34] Fuhr N, A probabilistic model of dictionary-based automatic indexing, in Proceedings of RIAO-85, 1st International Conference ``Recherche

d'Information Assistee par Ordinateur'', ??, Editor. 1985, ??: Grenoble, {FR}. p. 207--216.

[35] Xie C-f,LI X, A Sequence-Based Automatic Text Classification Algorithm. Journal of Software, 2002. 13(4): p. 783-789.

[36] Lodhi H, Saunders C, Shawe-Taylor J, et al., Text Classification using String Kernels. Journal of Machine Learning Research, 2002. 2: p.

419--444.

[37] Caropreso MF, Matwin S,Sebastiani F, A learner-independent evaluation of the usefulness of statistical phrases for automated text

categorization, in Text Databases and Document Management: Theory and Practice, AG Chin, Editor. 2001, Idea Group Publishing: Hershey, {US}. p. 78--102.

[38] Jacobs PS, Joining statistics with {NLP} for text categorization, in Proceedings of ANLP-92, 3rd Conference on Applied Natural Language

Processing, M Bates and O Stock, Editors. 1992, Association for Computational Linguistics, Morristown, {US}: Trento, {IT}. p. 178--185. [39] Basili R, Moschitti A,Pazienza MT, {NLP}-driven {IR}: Evaluating Performances over a Text Classification task, in Proceeding of IJCAI-01,

17th International Joint Conference on Artificial Intelligence, B Nebel, Editor. 2001: Seattle, {US}. p. 1286--1291.

[40] Yang Y,Chute CG, An example-based mapping method for text categorization and retrieval. {ACM} Transactions on Information Systems,

1994. 12(3): p. 252--277.

[41] Lewis DD,Ringuette M, A comparison of two learning algorithms for text categorization, in Proceedings of SDAIR-94, 3rd Annual

Symposium on Document Analysis and Information Retrieval, ??, Editor. 1994, ??: Las Vegas, {US}. p. 81--93. [42] Li YH,Jain AK, Classification of text documents. The Computer Journal, 1998. 41(8): p. 537--546.

14

Journal of Software 软件学报 2005,16(6)

[43] Ng HT, Goh WB,Low KL, Feature selection, perceptron learning, and a usability case study for text categorization, in Proceedings of

SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, NJ Belkin, AD Narasimhalu, and P Willett, Editors. 1997, {ACM} Press, New York, {US}: Philadelphia, {US}. p. 67--73.

[44] Sable CL,Hatzivassiloglou V, Text-based approaches for non-topical image categorization. International Journal of Digital Libraries, 2000.

3(3): p. 261--275.

[45] Schutze H, Hull DA,Pedersen JO, A comparison of classifiers and document representations for the routing problem, in Proceedings of

SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, EA Fox, P Ingwersen, and R Fidel, Editors. 1995, {ACM} Press, New York, {US}: Seattle, {US}. p. 229--237.

[46] Wiener ED, Pedersen JO,Weigend AS, A neural network approach to topic spotting, in Proceedings of SDAIR-95, 4th Annual Symposium on

Document Analysis and Information Retrieval, ??, Editor. 1995, ??: Las Vegas, {US}. p. 317--332.

[47] Mladenic D, Feature subset selection in text learning, in Proceedings of ECML-98, 10th European Conference on Machine Learning, C

N\\'edellec and Ce Rouveirol, Editors. 1998, Springer Verlag, Heidelberg, {DE}: Chemnitz, {DE}. p. 95--100. [48] Yang Y, An evaluation of statistical approaches to text categorization. Information Retrieval, 1999. 1(1/2): p. 69--90.

[49] Yang Y,Pedersen JO, A comparative study on feature selection in text categorization, in Proceedings of ICML-97, 14th International

Conference on Machine Learning, DH Fisher, Editor. 1997, Morgan Kaufmann Publishers, San Francisco, {US}: Nashville, {US}. p. 412--420.

[50] Forman G, An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, March,

2003. 3: p. 1289--1305.

[51] Moulinier I, Ra{\%u{s}}kinis G,Ganascia J-G, Text categorization: a symbolic approach, in Proceedings of SDAIR-96, 5th Annual Symposium

on Document Analysis and Information Retrieval, ??, Editor. 1996, ??: Las Vegas, {US}. p. 87--99.

[52] Selamat A,Omatu S, Web page feature selection and classification using neural networks. Information Sciences, 2004. 158(1): p. 69--88. [53] Fuhr N, Hartmann S, Knorz G, et al., {AIR/X} -- a Rule-Based Multistage Indexing System for Large Subject Fields, in Proceedings of

RIAO-91, 3rd International Conference ``Recherche d'Information Assistee par Ordinateur'', Ae Lichnerowicz, Editor. 1991, Elsevier Science Publishers, Amsterdam, {NL}: Barcelona, {ES}. p. 606--623.

[54] Galavotti L, Sebastiani F,Simi M, Experiments on the use of feature selection and negative evidence in automated text categorization, in

Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, JeL Borbinha and T Baker, Editors. 2000, Springer Verlag, Heidelberg, {DE}: Lisbon, {PT}. p. 59--68.

[55] Forman G, A pitfall and solution in multi-class feature selection for text classification, in Proceedings of ICML-04, 21st International

Conference on Machine Learning, CE Brodley, Editor. 2004, Morgan Kaufmann Publishers, San Francisco, {US}: Banff, CA. p. ??. [56] Lewis DD, An evaluation of phrasal and clustered representations on a text categorization task, in Proceedings of SIGIR-92, 15th ACM

International Conference on Research and Development in Information Retrieval, NJ Belkin, P Ingwersen, and AM Pejtersen, Editors. 1992, {ACM} Press, New York, {US}: Kobenhavn, {DK}. p. 37--50.

[57] Slonim N,Tishby N, The Power of Word Clusters for Text Classification, in Proceedings of ECIR-01, 23rd European Colloquium on

Information Retrieval Research, ??, Editor. 2001, ??: Darmstadt, {DE}. p. ??.

[58] Bekkerman R, El-Yaniv R, Tishby N, et al., Distributional word clusters vs.\\ words for text categorization. Journal of Machine Learning

Research, 2003. 3: p. 1183--1208.

[59] Baker LD,McCallum AK, Distributional clustering of words for text classification, in Proceedings of SIGIR-98, 21st ACM International

Conference on Research and Development in Information Retrieval, WB Croft, et al., Editors. 1998, {ACM} Press, New York, {US}: Melbourne, {AU}. p. 96--103.

[60] Zelikovitz S,Hirsh H, Using {LSI} for Text Classification in the Presence of Background Text, in Proceedings of CIKM-01, 10th ACM

International Conference on Information and Knowledge Management, H Paques, L Liu, and D Grossman, Editors. 2001, {ACM} Press, New York, {US}: Atlanta, {US}. p. 113--118.

[61] Chen L, Tokuda N,Nagai A, A new differential LSI space-based probabilistic document classifier. Information Processing Letters, 2003.

88(5): p. 203--212.

[62] Koller D,Sahami M, Hierarchically classifying documents using very few words, in Proceedings of ICML-97, 14th International Conference

on Machine Learning, DH Fisher, Editor. 1997, Morgan Kaufmann Publishers, San Francisco, {US}: Nashville, {US}. p. 170--178. [63] McCallum AK,Nigam K. A Comparison of Event Models for Naive Bayes Text Classification. in AAAI-98 Workshop on Learning for Text

Categorization. 1998. Menlo Park CA: AAAI Press.

[64] Ruiz ME,Srinivasan P, Hierarchical neural networks for text categorization, in Proceedings of SIGIR-99, 22nd ACM International

Conference on Research and Development in Information Retrieval, MA Hearst, F Gey, and R Tong, Editors. 1999, {ACM} Press, New York, {US}: Berkeley, {US}. p. 281--282.

张博锋 等:基于机器学习的文本分类研究综述

[65] Weigend AS, Wiener ED,Pedersen JO, Exploiting hierarchy in text categorization. Information Retrieval, 1999. 1(3): p. 193--216.

15

[66] Tsay J-J,Wang J-D, Improving linear classifier for Chinese text categorization. Information Processing and Management, 2004. 40(2): p.

223--237.


文本分类研究综述(3).doc 将本文的Word文档下载到电脑 下载失败或者文档不完整,请联系客服人员解决!

下一篇:第四讲 和弦

相关阅读
本类排行
× 注册会员免费下载(下载后可以自由复制和排版)

马上注册会员

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信: QQ: