使用libSvm實現文本分類的基本過程,此文參考 使用libsvm實現文本分類 對前期數據準備及後續的分類測試進行了驗證,同時對文中做者的分詞組件修改爲hanLP分詞,對數字進行過濾,僅保留長度大於1的詞進行處理。html
轉上文做者寫的分類流程:java
文本預處理階段,增長了基於hanLP的分詞,代碼以下:maven
/** * 使用hanlp進行分詞 * Created by zhouyh on 2018/5/30. */ public class HanLPDocumentAnalyzer extends AbstractDocumentAnalyzer implements DocumentAnalyzer { private static final Log LOG = LogFactory.getLog(HanLPDocumentAnalyzer.class); public HanLPDocumentAnalyzer(ConfigReadable configuration) { super(configuration); } @Override public Map<String, Term> analyze(File file) { String doc = file.getAbsolutePath(); LOG.debug("Process document: file=" + doc); Map<String, Term> terms = Maps.newHashMap(); BufferedReader br = null; try { br = new BufferedReader(new InputStreamReader(new FileInputStream(file), charSet)); String line = null; while((line = br.readLine()) != null) { LOG.debug("Process line: " + line); List<com.hankcs.hanlp.seg.common.Term> termList = HanLP.segment(line); if (termList!=null && termList.size()>0){ for (com.hankcs.hanlp.seg.common.Term hanLPTerm : termList){ String word = hanLPTerm.word; if (!word.isEmpty() && !super.isStopword(word)){ if (word.trim().length()>1){ Pattern compile = Pattern.compile("(\\d+\\.\\d+)|(\\d+)|([\\uFF10-\\uFF19]+)"); Matcher matcher = compile.matcher(word); if (!matcher.find()){ Term term = terms.get(word); if (term == null){ term = new TermImpl(word); terms.put(word, term); } term.incrFreq(); } } } else { LOG.debug("Filter out stop word: file=" + file + ", word=" + word); } } } } } catch (IOException e) { throw new RuntimeException("", e); } finally { try { if(br != null) { br.close(); } } catch (IOException e) { LOG.warn(e); } LOG.debug("Done: file=" + file + ", termCount=" + terms.size()); } return terms; } public static void main(String[] args){ String filePath = "/Users/zhouyh/work/yanfa/xunlianji/UTF8/train/ClassFile/C000008/0.txt"; HanLPDocumentAnalyzer hanLPDocumentAnalyzer = new HanLPDocumentAnalyzer(new Configuration()); hanLPDocumentAnalyzer.analyze(new File(filePath)); String str = "測試hanLP分詞"; System.out.println(str); // Pattern compile = Pattern.compile("(\\d+\\.\\d+)|(\\d+)|([\\uFF10-\\uFF19]+)"); // Matcher matcher = compile.matcher("9402"); // if (matcher.find()){ // System.out.println(matcher.group()); // } } }
這裏對原做者提供的訓練集資源作了合併,將訓練集擴大到10個類別,每一個類別的8000文本中,前6000文本做爲訓練集,後2000文本做爲測試集,文本結構以下圖所示:ide
測試集中是一樣的結構。測試
生成的特徵向量與libsvm須要的訓練集格式以下面所示:spa
libsvm訓練集格式文檔:debug
針對測試集也經過上述方式處理。3d
使用libSvm訓練分類文本code
文本轉換:htm
./svm-scale -l 0 -u 1 /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/train.txt > /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/train-scale.txt
測試集也作一樣轉換:
./svm-scale -l 0 -u 1 /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test.txt > /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test-scale.txt
進行模型訓練,此部分耗時較長:
./svm-train -h 0 -t 0 /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/train-scale.txt /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/model.txt
訓練過程以下圖所示:
訓練完成會生成model文件
採用預先處理好的測試文本進行分類測試:
./svm-predict /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test-scale.txt /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/model.txt /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/predict.txt
獲得結果爲:Accuracy = 81.6568% (16333/20002) (classification)
總體流程作完,獲得文件以下圖所列:
至此,仿照原做者的思路,對libsvm的分類流程作了一次實踐。
JAVA代碼測試
創建相關java項目,引入libsvm的jar包,我這裏採用maven搭建,引入jar包:
<!-- https://mvnrepository.com/artifact/tw.edu.ntu.csie/libsvm --> <!-- libsvm jar包 --> <dependency> <groupId>tw.edu.ntu.csie</groupId> <artifactId>libsvm</artifactId> <version>3.17</version> </dependency>
同時要把libsvm包中的svm_predict.java及svm_train.java引入,並對svm_predict.java的類作簡單改動,將預測的結果值返回,測試代碼以下:
public class LibSvmAlgorithm { public static void main(String[] args){ String[] testArgs = {"/Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test-scale.txt", "/Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/model.txt", "/Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/predict1.txt"}; try { Double accuracy = svm_predict.main(testArgs); System.out.println(accuracy); } catch (IOException e) { e.printStackTrace(); } } }