使用JRegex抽取網頁信息

時間 2019-11-12

標籤使用 jregex 抽取網頁信息欄目 HTML 简体版

原文原文鏈接

當網絡爬蟲將網頁下載到磁盤上之後，須要對這些網頁中的內容進行抽取，爲索引作準備。一個網頁中的數據大部分是HTML標籤，索引確定不會去索引這些標籤。也就是說，這種信息是沒有用處的信息，須要在抽取過程當中過濾掉。另外，一個網頁中通常會存在廣告信息、錨文本信息，還有一些咱們不感興趣的信息，都被視爲垃圾信息，若是不加考慮這些內容，抽取出來的信息不只佔用存儲空間，並且在索引之後，爲終端用戶提供檢索服務，用戶檢會索到不少無用的垃圾信息，勢必影響用戶的體驗。
這裏，針對論壇，採用配置模板的方式來實現信息的抽取。使用的工具能夠到http://jregex.sourceforge.net上下載，JRegex是一個基於Java的正則庫，能夠經過在正則模板中指定待抽取信息的變量，在抽取過程當中會將抽取到的信息賦給該變量，從而獲得感興趣的信息。並且，JRegex庫支持多級分組匹配。
爲了直觀，假設，有一個論壇的一個網頁的源代碼形如：html

<a id="anchor">標題</a>
<cont>
     <author>a1</author>
     <time>2009</time>
     <post>p1</post>
     <author>a2</author>
     <time>2008</time>
     <post>p2</post>
     <author>a3</author>
     <time>2007</time>
     <post>p3</post>
     <author>a4</author>
     <time>2006</time>
     <post>p4</post>
     <author>2005</author>
     <time>t5</time>
     <post>p5</post>
</cont>

將該網頁代碼文件保存爲bbsPage.txt文件，準備進行處理。
如今，咱們的目標是抽取標題、做者、時間、內容這些內容，固然，標題徹底能夠從TITLE標籤中得到，可是通常網站的一個網頁，會在標題文本的後面加上一些目錄或者網站名稱的信息，例如一個標題爲「品味北京奧運中心_奧運加油站_我行我攝_XXX社區_XXX社區是最活躍的社區之一」，一些垃圾信息佔了標題的大部分，因此咱們不從TITLE標籤中抽取標題。
接着，針對上面的網頁文件建立信息抽取的正則模板，以下所示：java

(?s)<a\sid="anchor">({title}.{1,100}?)\s*(.{1,10240}?)({name}.{1,100}?)\s*<time>({when}.{1,100}?)</time>\s*({content}.{1,100}?)

第一部分爲(?s)<a\sid=」anchor」>({title}.{1,100}?)\s*(.{1,10240}?)，包含兩個組，第一個組名稱爲title，直接可以抽取到網頁的標題文本，並存儲到變量title中；而第二個組沒有指定組的名稱，表示在後面還存在子組，在子組中繼續進行抽取。
第二部分爲({name}.{1,100}?)\s*({when}.{1,100}?)\s* ({content}.{1,100}?)，剛好是父組中未指定組名稱的第二個組內中的一個循環。
上面兩個模板之間使用一個空格字符隔開，保存到pattern.txt文件中。
可能，你已經觀察到了，網頁的標題只有一個，而對其餘的信息正好可以構成一個循環組，單獨從父組中分離出來繼續進行抽取，結構很整齊。因此，在使用JRegex庫進行編碼抽取的時候，主要就是針對兩個組進行的。
我基於上面思想和數據，實現了信息的抽取。
首先定義了一個鍵值對實體類，使用泛型，以下所示：網絡

package org.shirdrn.test;

public class Pair<K, V> {

     private K key;
     private V value;

     public Pair(K key, V value) {
          this.key = key;
          this.value = value;
     }

     public K getKey() {
          return key;
     }

     public void setKey(K key) {
          this.key = key;
     }

     public V getValue() {
          return value;
     }

     public void setValue(V value) {
          this.value = value;
     }

}

進行信息抽取的核心類爲InfomationExtraction，以下所示：app

package org.shirdrn.test;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import jregex.Matcher;
import jregex.Pattern;

public class InfomationExtraction {

     private String htmlString;
     private String patternString;
     private List<Pair<String, String>> dataList = new ArrayList<Pair<String, String>>();

     public InfomationExtraction() {

     }

     public InfomationExtraction(String htmlFileName, String patternFileName) {
          this.htmlString = this.readString(htmlFileName);
          this.patternString = this.readString(patternFileName);
     }

     public Pattern[] getPatternArray() {
          Pattern[] pa = new Pattern[2];
          String[] psa = this.patternString.split(" ");
          for (int i = 0; i < psa.length; i++) {
               Pattern p = new Pattern(psa[i]);
               pa[i] = p;
          }
          return pa;
     }

     public void extract(Integer sgIndex) { // 指定父組中第sgIndex個組須要在子組中繼續進行抽取
          Pattern[] pa = this.getPatternArray();
          Pattern pBase = pa[0];
          Matcher mBase = pBase.matcher(this.htmlString);
          if (mBase.find()) {
               for (int i = 0; i < mBase.groupCount(); i++) {
                    String gn = pBase.groupName(i);
                    if (gn != null) {
                         String gv = mBase.group(i);
                         this.dataList.add(new Pair<String, String>(gn, gv));
                    }
               }
               String subText = mBase.group(sgIndex);
               if (subText != null) {
                    this.dataList.addAll(this.getSubGroupDataList(pa, subText)); // 調用使用子組正則模板進行抽取的方法
               }
          }
     }

     public List<Pair<String, String>> getSubGroupDataList(Pattern[] pa, String subText) { // 使用子組正則模板進行抽取
          List<Pair<String, String>> list = new ArrayList<Pair<String, String>>();
          for (int i = 1; i < pa.length; i++) {
               Pattern subp = pa[i];
               Matcher subm = subp.matcher(subText);
               while (subm.find()) {
                    for (int k = 0; k < subm.groupCount(); k++) {
                         String gn = subp.groupName(k);
                         if (gn != null) {
                              String gv = subm.group(k);
                              list.add(new Pair<String, String>(gn, gv));
                         }
                    }
               }
          }
          return list;
     }

     public String readString(String fileName) {
          InputStream in = this.getClass().getResourceAsStream("/" + fileName);
          BufferedReader reader = new BufferedReader(new InputStreamReader(in));
          StringBuffer sb = new StringBuffer();
          String line = null;
          try {
               while ((line = reader.readLine()) != null) {
                    sb.append(line);
               }
          } catch (IOException e) {
               e.printStackTrace();
          }
          return sb.toString();
     }

     public List<Pair<String, String>> getDataList() {
          return this.dataList;
     }

     public static void main(String[] args) {
          InfomationExtraction ie = new InfomationExtraction("bbsPage.txt", "pattern.txt");
          ie.extract(2);
          for (Pair<String, String> p : ie.getDataList()) {
               System.out.println("[" + p.getKey() + " " + p.getValue() + "]");
          }
     }
}

測試一下，以下所示：工具

[title 標題]
[name a1]
[when 2009]
[content p1]
[name a2]
[when 2008]
[content p2]
[name a3]
[when 2007]
[content p3]
[name a4]
[when 2006]
[content p4]
[name 2005]
[when t5]
[content p5]

至於如何組織抽取到的信息，好比你可能使用Lucene的索引，須要構造Field和Document，那麼你就要設計一個實體可以包含一個Document的全部的Field，好比一個Document包括：URL、標題、做者、發表時間、發表內容這五個項，很是容易就能作到。
使用JRegex庫，能夠很是靈活地配置模板，尤爲是對多個組的設計，這要根據你的須要來考慮。post

原文聲明以下：本文基於署名-非商業性使用-相同方式共享 4.0許可協議發佈，歡迎轉載、使用、從新發布，但務必保留文章署名時延軍（包含連接：http://shiyanjun.cn），不得用於商業目的，基於本文修改後的做品務必以相同的許可發佈。若有任何疑問，請與我聯繫。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。