毕业设计(论文)专用纸
毕 业 设 计(论 文)
题 目:基于VSM模型的文本相似性的比较
姓 名 X X X X X 学 号 A A A A A 所在学院 B B B B B 专业班级 C C C C C 指导教师 D D D D D 日 期
0
毕业设计(论文)专用纸
摘 要
在互联网迅速发展的时代,网络上的信息数量越来越多,种类也比较纷杂。虽然能在我们查询相关信息是提供大量选择,但是靠人工浏览的方式在浩瀚的信息库中找到自己最需要最相关的信息,无疑给用户带来了麻烦,而且效率也十分低下。为了解决这一个问题,关于判断文本相似度的技术应运而生,目前广泛运用于计算机,电信等行业。本文着重阐述了计算文本相似度的过程中会遇到的难题,以及解决这些难题需要用到的相应算法,最后利用VSM模型进行简单的设计与运用,完成基于web的相似网页检测程序
关键字:文本相似度;相似网页检测;VSM模型
1
毕业设计(论文)专用纸
ABSTRACT
With the Internet developing rapidly,there are more and more Information on the Internet,and the varieties of Information is becoming more complex.Although we have a bigger chance to use the Information,it is very difficult and inefficient for users to find the Information which they are most needed in the Information Database.To solve this problem,the relevant technology is invented and now widely used in Computer and Telecom field.This passage is mainly demonstrated the problems we may meet when we calculate the text similarity and the relevant algorithm solving the problems above .In the end,we use VSM model to design and complete the Project-Similar Web detection Based On Web
Key Words:text similarity;similar web detection;VSM model
2
毕业设计(论文)专用纸
目 录
摘 要 ···························································································································· 1 ABSTRACT ······················································································································ 2 目 录 ···························································································································· 3 第一章 绪论 ·············································································································· 6
1.1选题背景 ······································································································· 6 1.2研究意义 ······································································································· 6 1.3国内外研究现状 ························································································ 6
1.3.1国外文本相似度研究状况 ······················································· 6 1.3.2国内文本相似度研究情况 ······················································· 7 1.4开发语言 ······································································································· 8 1.5本文的主要工作和论文结构 ······························································· 8
1.5.1主要工作 ·························································································· 8 1.5.2论文结构 ·························································································· 9
第二章 系统原理介绍 ······················································································ 10
2.1原理概述 ····································································································· 10 2.2系统相关知识点简介 ············································································ 10
2.2.1向量空间模型 ·············································································· 10 2.2.2中文分词技术 ·············································································· 11 2.2.3TF统计方法 ··················································································· 12
3
毕业设计(论文)专用纸
2.2.4TF-IDF算法 ··················································································· 13 2.2.5数据降维 ························································································ 16 2.2.6相似度计算方法 ········································································· 16 2.3系统实现思想 ··························································································· 17
第三章 系统分析与设计 ·············································································· 19
3.1系统需求分析 ··························································································· 19 3.2系统功能概述 ··························································································· 19
3.2.1系统流程 ························································································ 19 3.2.2功能模块介绍 ·············································································· 20 3.3系统性能要求 ··························································································· 21
第四章 系统实现 ······························································································· 22
4.1系统运行环境 ··························································································· 22 4.2 核心相关代码分析 ················································································ 22
4.2.1分词类的介绍 ·············································································· 22 4.2.2核心代码解析 ·············································································· 23
第五章 系统测试 ································································································· 29
5.1文章分词测试 ··························································································· 29 5.2获取关键字测试 ······················································································ 29 5.3抓取网页内容测试 ················································································· 30 5.4计算文本相似度 ······················································································ 30
第六章 总结与展望 ···························································································· 31
4