
基于支持向量机的文本分类算法的研究.pdf
54页硕士学位论文 I摘摘 要要 近年来,随着 Internet 的迅速发展,网络上的文档信息成指数级增长文本分类作为信息检索的重要环节,已成为一个重要的研究课题支持向量机是一种基于统计学习理论的机器学习方法,能解决非线性、高维数、局部极小点等实际问题本文主要研究支持向量机在文本分类中的应用问题 本文首先介绍了文本分类的发展概况和相关技术,接着对统计学习理论和支持向量机进行了介绍,为后面章节的研究提供了理论基础 本文搭建了实验平台,对几种常用的分类算法的性能进行了比较实验结果表明,支持向量机非常适用于文本分类 针对支持向量机算法中存在的对噪音敏感问题,本文提出了一种基于重复训练的支持向量机方法该方法选取重复训练后会对分类面有影响的样本,根据其类别隶属度,重复训练相应的次数,以此来改变样本的权值,减小噪音的影响本文将该算法应用于文本分类,实验结果表明,在适度增加训练时间的条件下,比标准支持向量机方法具有更好的抗噪音的能力,而且可以提高分类性能 为了提高支持向量机的泛化能力,本文提出了一种新的训练算法该算法首先计算类中心,然后根据样本离类中心的距离,选取距离类中心较近的少量样本,将它们重命名后加入到原训练集中,以此来改变样本的权值,达到强调有代表性的样本的目的,最后用扩展后的训练集重新训练支持向量机。
将此算法应用于中文网页分类中,实验结果表明,该算法在适度增加训练时间的条件下,取得了比标准支持向量机更好的分类性能 关键词:文本分类;中文网页分类;支持向量机;隶属度;类中心 基于支持向量机的文本分类算法研究 IIAbstract With the development of the Internet, the information on the Internet increases exponentially. One important research focuses on how to deal with these great capacities of online documents. As one of the crucial parts of information retrieval, text classification has become an important research direction. Support vector machines(SVM), as a machine learning method based on statistical learning theory, can resolve such practical problems as nonlinearity, high dimension and local minima. This thesis mainly focuses on the drawbacks of SVM in the practical application including text categorization. This thesis firstly introduces general development and some techniques of text categorization. Then, the statistical learning theory and SVM was introduced, lay basic theoretical for the research in the following chapters. We put up an experimentation platform, and test some usual text categorization algorithm from which get the result that SVM are particularly suited for text categorization. Since SVM is very sensitive to noises in the training set, a support vector machine algorithm based on training repeatedly is proposed in this thesis. Samples having effects on decision surface after being trained repeatedly are chosen. And then they are trained repeatedly for some times according to their fuzzy membership. The weight of these samples is changed by this way and reduced in the influence of noises. The improved SVM algorithm is employed to text categorization, though the training time is increased, better effect is obtained than the traditional support vector machine, and this method effectively distinguishes between the valid samples and the noises. In this thesis another improved support vector machine is presented to enhance the classification performance. In the proposed algorithm the class center is calculated, and the samples closing to the class center are chosen, renamed and added to the training set to strength their weight. Therefore, the representative samples are emphasized. Then the expanded training set is inputted for training the support vector machine. The improved support vector machine algorithm is employed to Chinese text categorization, though the training time is increased, the better performance is obtained compared to traditional support vector machine. 硕士学位论文 IIIKey Words: Text categorization; Chinese web categorization; Support vector machine; Membership; Class center 基于支持向量机的文本分类算法研究 IV插图索引插图索引 图 2.1 文本分类器模型............................................................................................6 图 2.2 词袋模型中文档的词频向量示意图 .............................................................7 图 2.3 k 近邻算法 ................................................................................................10 图 2.4 SVM、kNN 和 NB 在微型新闻组数据集下的平均查准率比较 ...............15 图 2.5 SVM、kNN 和 NB 在 20 新闻组数据集下的平均查准率比较 .................15 图 3.1 结构风险最小化示意图 ..............................................................................19 图 3.2 最优分类面示意图 ......................................................................................20 图 3.3 支持向量机示意图 ......................................................................................23 图 4.1 线性可分情形下的分类面...........................................................................35 图 4.2 移去非支持向量后分类面不变 ...................................................................35 图 4.3 添加位于分类间隔外的样本后分类面不变 ................................................35 硕士学位论文 V附表索引附表索引 表 2.1 相关变量表 ...................................................................................................9 表 2.2 类别kc 的邻接表..........................................................................................11 表 2.3 全局邻接表 .................................................................................................12 表 2.4 20 新闻组数据集的主题划分 ....................................................................14 表 3.1 svmtrain 中可选项的参数及涵义 ..............................................................27 表 4.1 文档数最多的 10 个类别文档数的变化......................................................37 表 4.2 Retuers-21578 数据集上的实验结果.........................................................38 表 4.3 20 新闻组数据集上的实验结果 ................................................................38 表 5.1 数据集一扩展前后测试集和训练集网页数 ............................................。
