餘弦定理實現新聞自動分類算法

時間 2019-11-19

標籤餘弦定理實現自動分類算法简体版

原文原文鏈接

前言

餘弦定理，這個在初中課本中就出現過的公式，恐怕沒有人不知道的吧。可是另一個概念，可能不是不少的人會據說過，他叫空間向量，通常用e表示，高中課本中有專門講過這個東西，有了餘弦定理和向量空間，咱們就能夠作許多有意思的事情了，利用餘弦定理計算文本類似度的算法就是其中一個很典型的例子。固然這個話題太老，說的人太多，沒有什麼新意，恰巧週末閱讀了吳軍博士的<<數學之美>>這門書，書中講到了利用餘弦定理實現新聞分類，因而就索性完成這個算法的初步模型。感興趣的能夠繼續往下看。java

算法背景

在以往，若是對一則新聞進行歸類，通常使用的都是人工分類的辦法，大致上看一下標題和首尾兩段文字，就能知道新聞是屬於財經的，體育的又或者是健康類的。可是在當今信息爆炸的時代，這顯然是不可能完成的任務，因此咱們急切的相用機器本身幫咱們」分類「。最好的形式是我給計算機提供大量的已分類好的數據，等強大的計算機大腦訓練好了這個分類模型，後邊的事情就是他來完成了。看起來這好像很高深，很困難的樣子，可是其實咱們本身也能夠寫一個，只是效果可能不會那麼好。算法

分類器實現原理

新聞自動分類器實現的本質也是利用餘弦定理比較文本的類似度，因而這個問題的難點就在於這個特徵向量哪裏來，怎麼去得到。特徵向量，特徵向量，關鍵兩個字在於特徵，新聞的特徵就在於他的關鍵詞，個人簡單理解就是專業性的詞語，換句話說，就是屬於某類新聞特有的詞語，好比金融類的新聞，關鍵詞通常就是股票啊，公司啊，上市啊等等詞語。這些詞的尋找能夠經過統計詞頻的方式實現，最後統計出來的關鍵詞，進行降序排列，一個關鍵詞就表明一個新的維度。那麼新的問題又來了，我要統計詞頻，那麼就得首先進行分詞，要把每一個新聞句子的主謂賓通通挖掘出來啊，好像這個工做比我整個算法還要複雜的樣子。OK，其實已經有人已經幫咱們把這個問題解決了，在這個算法中我使用的是中科大的ICTCLAS分詞系統，效果很是棒，舉個例子，下面是我原始的新聞內容：app

[java] view plain copyide

print ?工具

教育部副部長：教育公平是社會公平重要基礎
7月23日，教育部黨組副書記、副部長杜玉波爲全國學聯全體表明做《教育綜合改革與青年學生成長成才》的專題報告。中國青年網記者張炎良攝
人民網北京7月24日電（記者賀迎春實習生王斯慧

通過分詞系統處理後的分詞效果：測試

[java] view plain copyui

print ?this

教育部/nt 副/b 部長/n ：/wm 教育/v 公平/an 是/vshi 社會/n 公平/a 重要/a 基礎/n
7月/t 23日/t ，/wd 教育部/nt 黨組/n 副/b 書記/n 、/wn 副/b 部長/n 杜玉波/nr 爲/p 全國學聯/nt 全體/n 表明做/n 《/wkz 教育/vn 綜合/vn 改革/vn 與/cc 青年/n 學生/n 成長/vi 成才/vi 》/wky 的/ude1 專題/n 報告/n 。/wj 中國/ns 青年/n 網/n 記者/n 張/q 炎/ng 良/d 攝/vg
人民/n 網/n 北京/ns 7月/t 24日/t 電/n （/wkz 記者/n 賀/vg 迎春/n 實習生/n 王斯慧/nr ）/wky 昨日/t ，/wd 教育部/nt 副/b 部長

OK，有了這個分詞的結果以後，後面的事情就水到渠成了。編碼

算法的實現步驟

一、給定訓練的新聞數據集。spa

二、經過分詞系通通計詞頻的方式，統計詞頻最高的N位做爲特徵詞，即特徵向量

三、輸入測試數據，一樣統計詞頻，並於訓練數據的進行商的操做，獲得特徵向量值

四、最後利用餘弦定理計算類似度，並與最小閾值作比較。

算法的代碼實現

ICTCLAS工具類ICTCLAS.Java:

[java] view plain copy

print ?

package NewsClassify;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.util.StringTokenizer;
public class ICTCLAS50 {
static {
try {
String libpath = System.getProperty("user.dir") + "\\lib";
String path = null;
StringTokenizer st = new StringTokenizer(libpath,
System.getProperty("path.separator"));
if (st.hasMoreElements()) {
path = st.nextToken();
}
// copy all dll files to java lib path
File dllFile = null;
InputStream inputStream = null;
FileOutputStream outputStream = null;
byte[] array = null;
dllFile = new File(new File(path), "ICTCLAS50.dll");
if (!dllFile.exists()) {
inputStream = ICTCLAS50.class.getResource("/lib/ICTCLAS50.dll")
.openStream();
outputStream = new FileOutputStream(dllFile);
array = new byte[1024];
for (int i = inputStream.read(array); i != -1; i = inputStream
.read(array)) {
outputStream.write(array, 0, i);
}
outputStream.close();
}
} catch (Exception e) {
e.printStackTrace();
}
try {
// load JniCall.dll
System.loadLibrary("ICTCLAS50");
System.out.println("4444");
} catch (Error e) {
e.printStackTrace();
}
}
public native boolean ICTCLAS_Init(byte[] sPath);
public native boolean ICTCLAS_Exit();
public native int ICTCLAS_ImportUserDictFile(byte[] sPath, int eCodeType);
public native int ICTCLAS_SaveTheUsrDic();
public native int ICTCLAS_SetPOSmap(int nPOSmap);
public native boolean ICTCLAS_FileProcess(byte[] sSrcFilename,
int eCodeType, int bPOSTagged, byte[] sDestFilename);
public native byte[] ICTCLAS_ParagraphProcess(byte[] sSrc, int eCodeType,
int bPOSTagged);
public native byte[] nativeProcAPara(byte[] sSrc, int eCodeType,
int bPOStagged);
}

新聞實體類New.java

[java] view plain copy

print ?

package NewsClassify;
/**
* 詞語實體類
*
* @author lyq
*
*/
public class Word implements Comparable<Word> {
// 詞語名稱
String name;
// 詞頻
Integer count;
public Word(String name, Integer count) {
this.name = name;
this.count = count;
}
@Override
public int compareTo(Word o) {
// TODO Auto-generated method stub
return o.count.compareTo(this.count);
}
}

分類算法類NewsClassify.java:

[java] view plain copy

print ?

package NewsClassify;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
/**
* 分類算法模型
*
* @author lyq
*
*/
public class NewsClassifyTool {
// 餘弦向量空間維數
private int vectorNum;
// 餘弦類似度最小知足閾值
private double minSupportValue;
// 當前訓練數據的新聞類別
private String newsType;
// 訓練新聞數據文件地址
private ArrayList<String> trainDataPaths;
public NewsClassifyTool(ArrayList<String> trainDataPaths, String newsType,
int vectorNum, double minSupportValue) {
this.trainDataPaths = trainDataPaths;
this.newsType = newsType;
this.vectorNum = vectorNum;
this.minSupportValue = minSupportValue;
}
/**
* 從文件中讀取數據
*/
private String readDataFile(String filePath) {
File file = new File(filePath);
StringBuilder strBuilder = null;
try {
BufferedReader in = new BufferedReader(new FileReader(file));
String str;
strBuilder = new StringBuilder();
while ((str = in.readLine()) != null) {
strBuilder.append(str);
}
in.close();
} catch (IOException e) {
e.getStackTrace();
}
return strBuilder.toString();
}
/**
* 計算測試數據的特徵向量
*/
private double[] calCharacterVectors(String filePath) {
int index;
double[] vectorDimensions;
double[] temp;
News news;
News testNews;
String newsCotent;
String testContent;
String parseContent;
// 高頻詞彙
ArrayList<Word> frequentWords;
ArrayList<Word> wordList;
testContent = readDataFile(filePath);
testNews = new News(testContent);
parseNewsContent(filePath);
index = filePath.indexOf('.');
parseContent = readDataFile(filePath.substring(0, index) + "-split.txt");
testNews.statWords(parseContent);
vectorDimensions = new double[vectorNum];
// 計算訓練數據集的類別的特徵向量
for (String path : this.trainDataPaths) {
newsCotent = readDataFile(path);
news = new News(newsCotent);
// 進行分詞操做
index = path.indexOf('.');
parseNewsContent(path);
parseContent = readDataFile(path.substring(0, index) + "-split.txt");
news.statWords(parseContent);
wordList = news.wordDatas;
// 將詞頻統計結果降序排列
Collections.sort(wordList);
frequentWords = new ArrayList<Word>();
// 截取出前vectorDimens的詞語
for (int i = 0; i < vectorNum; i++) {
frequentWords.add(wordList.get(i));
}
temp = testNews.calVectorDimension(frequentWords);
// 將特徵向量值進行累加
for (int i = 0; i < vectorDimensions.length; i++) {
vectorDimensions[i] += temp[i];
}
}
// 最後取平均向量值做爲最終的特徵向量值
for (int i = 0; i < vectorDimensions.length; i++) {
vectorDimensions[i] /= trainDataPaths.size();
}
return vectorDimensions;
}
/**
* 根據求得的向量空間計算餘弦類似度值
*
* @param vectorDimension
* 已求得的測試數據的特徵向量值
* @return
*/
private double calCosValue(double[] vectorDimension) {
double result;
double num1;
double num2;
double temp1;
double temp2;
// 標準的特徵向量，每一個維度上都爲1
double[] standardVector;
standardVector = new double[vectorNum];
for (int i = 0; i < vectorNum; i++) {
standardVector[i] = 1;
}
temp1 = 0;
temp2 = 0;
num1 = 0;
for (int i = 0; i < vectorNum; i++) {
// 累加分子的值
num1 += vectorDimension[i] * standardVector[i];
// 累加分母的值
temp1 += vectorDimension[i] * vectorDimension[i];
temp2 += standardVector[i] * standardVector[i];
}
num2 = Math.sqrt(temp1) * Math.sqrt(temp2);
// 套用餘弦定理公式進行計算
result = num1 / num2;
return result;
}
/**
* 進行新聞分類
*
* @param filePath
* 測試新聞數據文件地址
*/
public void newsClassify(String filePath) {
double result;
double[] vectorDimension;
vectorDimension = calCharacterVectors(filePath);
result = calCosValue(vectorDimension);
// 若是餘弦類似度值知足最小閾值要求，則屬於目標分類
if (result >= minSupportValue) {
System.out.println(String.format("最終類似度結果爲%s,大於閾值%s,因此此新聞屬於%s類新聞",
result, minSupportValue, newsType));
} else {
System.out.println(String.format("最終類似度結果爲%s,小於閾值%s,因此此新聞不屬於%s類新聞",
result, minSupportValue, newsType));
}
}
/**
* 利用分詞系統進行新聞內容的分詞
*
* @param srcPath
* 新聞文件路徑
*/
private void parseNewsContent(String srcPath) {
// TODO Auto-generated method stub
int index;
String dirApi;
String desPath;
dirApi = System.getProperty("user.dir") + "\\lib";
// 組裝輸出路徑值
index = srcPath.indexOf('.');
desPath = srcPath.substring(0, index) + "-split.txt";
try {
ICTCLAS50 testICTCLAS50 = new ICTCLAS50();
// 分詞所需庫的路徑、初始化
if (testICTCLAS50.ICTCLAS_Init(dirApi.getBytes("GB2312")) == false) {
System.out.println("Init Fail!");
return;
}
// 將文件名string類型轉爲byte類型
byte[] Inputfilenameb = srcPath.getBytes();
// 分詞處理後輸出文件名、將文件名string類型轉爲byte類型
byte[] Outputfilenameb = desPath.getBytes();
// 文件分詞(第一個參數爲輸入文件的名,第二個參數爲文件編碼類型,第三個參數爲是否標記詞性集1 yes,0
// no,第四個參數爲輸出文件名)
testICTCLAS50.ICTCLAS_FileProcess(Inputfilenameb, 0, 1,
Outputfilenameb);
// 退出分詞器
testICTCLAS50.ICTCLAS_Exit();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

場景測試了Client.java:

[java] view plain copy

print ?

package NewsClassify;
import java.util.ArrayList;
/**
* 新聞分類算法測試類
* @author lyq
*
*/
public class Client {
public static void main(String[] args){
String testFilePath1;
String testFilePath2;
String testFilePath3;
String path;
String newsType;
int vectorNum;
double minSupportValue;
ArrayList<String> trainDataPaths;
NewsClassifyTool classifyTool;
//添加測試以及訓練集數據文件路徑
testFilePath1 = "C:\\Users\\lyq\\Desktop\\icon\\test\\testNews1.txt";
testFilePath2 = "C:\\Users\\lyq\\Desktop\\icon\\test\\testNews2.txt";
testFilePath3 = "C:\\Users\\lyq\\Desktop\\icon\\test\\testNews3.txt";
trainDataPaths = new ArrayList<String>();
path = "C:\\Users\\lyq\\Desktop\\icon\\test\\trainNews1.txt";
trainDataPaths.add(path);
path = "C:\\Users\\lyq\\Desktop\\icon\\test\\trainNews2.txt";
trainDataPaths.add(path);
newsType = "金融";
vectorNum = 10;
minSupportValue = 0.45;
classifyTool = new NewsClassifyTool(trainDataPaths, newsType, vectorNum, minSupportValue);
classifyTool.newsClassify(testFilePath1);
classifyTool.newsClassify(testFilePath2);
classifyTool.newsClassify(testFilePath3);
}
}