原部分來自Internet上的其餘博客,只是由於很長一段時間。忘了誰是參考,這裏說聲抱歉。。html
先貼一些html頁:java
<html> <head> <meta http-equiv="content-type" content="text/html;charset=GBK"> <title>HTML Parser</title> <meta name="generator" content="Namo WebEditor"> </head> <body> <table width=620 border=0 cellpadding=1 cellspacing=0 bgcolor=#0066cc> <tr> <td width=100%> <table width=100% border=0 cellpadding=4 cellspacing=0 bgcolor=#D3E5FB> <tr bgcolor=#D3E5FB> <td width=20%><font size="2" face="Arial,Verdana"><b>想學習 Name</b></font><br> </td> <td width=13%><font size="2" face="Arial,Verdana"><b>Result</b></font><br> </td> <td width=8%><font size="2" face="Arial,Verdana"><b>Time</b></font><br> </td> <td width=59%><font size="2" face="Arial,Verdana"><b>Synopsis</b></font><br> </td> </tr> <tr bgcolor=#eeeeee> <td width=20%><font size="1" face="Arial,Verdana"><b>9</b> 想學習</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana"><font color=#ff0033>+FAIL</font> <a href="v4_wireless_802.1x_full/cdrouter_dhcp_20.txt">想學習</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">12:31</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想學習</font><br> </td> </tr> <tr bgcolor=#ffffff> <td width=20%><font size="1" face="Arial,Verdana"><b>1</b> cdrouter_basic_1</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana">Pass <a href="v4_wireless_802.1x_full/cdrouter_basic_1.txt">想學習</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">00:00</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想學習</font><br> </td> </tr> </table> </td> </tr> </table> </body> </html>
在網上搜索了一下jericho-html-3.3這個插件,用來解析table。的確很是方便。app
代碼例如如下:less
package com.xxx.hbuassys.test; import java.net.URL; import java.util.Iterator; import java.util.List; import net.htmlparser.jericho.Element; import net.htmlparser.jericho.HTMLElementName; import net.htmlparser.jericho.Segment; import net.htmlparser.jericho.Source; public class HtmlParser { public static void main(String[] args) throws Exception { String sourceUrlString="test.html"; if(sourceUrlString.indexOf(':') == -1) sourceUrlString ="file:"+sourceUrlString; Source source=new Source(new URL(sourceUrlString)); List Elements_TABLE=source.getAllElements(HTMLElementName.TABLE); Elements_TABLE.remove(0);//由於table相互嵌套。咱們需要的是第二個,因此刪掉第一個 Iterator it_TABLE = Elements_TABLE.iterator(); while(it_TABLE.hasNext()) { Element Element_TABLE = (Element)it_TABLE.next(); // System.out.println("**"+Element_TABLE.toString()+"\n**"); Segment getContent_TABLE = (Segment)Element_TABLE.getContent(); List Elements_TR = getContent_TABLE.getAllElements(HTMLElementName.TR); Iterator it_TR = Elements_TR.iterator(); while(it_TR.hasNext()) { Element Element_TR = (Element)it_TR.next(); Segment getContent_TR = (Segment)Element_TR.getContent(); List Elements_FONT = getContent_TR.getAllElements(HTMLElementName.FONT); Iterator it_FONT = Elements_FONT.iterator(); int i = 1; while(it_FONT.hasNext()) { Element Element_FONT = (Element)it_FONT.next(); Segment getContent_FONT = (Segment)Element_FONT.getContent(); String a1 = getContent_FONT.toString(); System.out.println(i + " = " + Element_FONT.getContent().getTextExtractor().toString()); i++; } System.out.println(); } } } }結果:
1 = 想學習 Name
2 = Result
3 = Time
4 = Synopsis
1 = 9 想學習
2 = +FAIL 想學習
3 = +FAIL
4 = 12:31
5 = 想學習
1 = 1 cdrouter_basic_1
2 = Pass 想學習
3 = 00:00
4 = 想學習
大體的思路就是,先取出所有的table標籤,而後對需要的table進行解析,取出裏面的tr,在從tr裏面取出td這樣就可以獲得咱們需要的內容了。學習
假設僅僅講到這,那麼就跟網上其它人講的沒有什麼差異了。ui
因爲項目的需要,使用此插件發現了一個問題:編碼
假設html頁面的編碼是UTF-8的格式,那麼解析出來的內容就會是亂碼。假設直接對這些亂碼編碼。採用new String(str.getBytes(),"GBK");等之類的操做都不能解決這個問題。本人親自測試過。spa
好比html頁面變爲:.net
<html> <head> <meta http-equiv="content-type" content="text/html;charset=UTF-8"> <title>HTML Parser</title> <meta name="generator" content="Namo WebEditor"> </head> <body> <table width=620 border=0 cellpadding=1 cellspacing=0 bgcolor=#0066cc> <tr> <td width=100%> <table width=100% border=0 cellpadding=4 cellspacing=0 bgcolor=#D3E5FB> <tr bgcolor=#D3E5FB> <td width=20%><font size="2" face="Arial,Verdana"><b>想學習 Name</b></font><br> </td> <td width=13%><font size="2" face="Arial,Verdana"><b>Result</b></font><br> </td> <td width=8%><font size="2" face="Arial,Verdana"><b>Time</b></font><br> </td> <td width=59%><font size="2" face="Arial,Verdana"><b>Synopsis</b></font><br> </td> </tr> <tr bgcolor=#eeeeee> <td width=20%><font size="1" face="Arial,Verdana"><b>9</b> 想學習</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana"><font color=#ff0033>+FAIL</font> <a href="v4_wireless_802.1x_full/cdrouter_dhcp_20.txt">想學習</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">12:31</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想學習</font><br> </td> </tr> <tr bgcolor=#ffffff> <td width=20%><font size="1" face="Arial,Verdana"><b>1</b> cdrouter_basic_1</font><br> </td> <td width=13%><font size="1" face="Arial,Verdana">Pass <a href="v4_wireless_802.1x_full/cdrouter_basic_1.txt">想學習</a></font><br> </td> <td width=8%><font size="1" face="Arial,Verdana">00:00</font><br> </td> <td width=59%><font size="1" face="Arial,Verdana">想學習</font><br> </td> </tr> </table> </td> </tr> </table> </body> </html>
1 = ???插件
? Name
2 = Result
3 = Time
4 = Synopsis
1 = 9 ???
?
2 = +FAIL ?
???
3 = +FAIL
4 = 12:31
5 = ?
?
??
1 = 1 cdrouter_basic_1
2 = Pass ??
??
3 = 00:00
4 = ?
?
??
採用的方法是:改變<meta http-equiv="content-type" content="text/html;charset=UTF-8">變爲:<meta http-equiv="content-type" content="text/html;charset=GBK">
具體狀況,參考代碼例如如下:
package com.xxx.hbuassys.test; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileReader; import java.io.InputStreamReader; import java.net.URL; import java.util.Iterator; import java.util.List; import net.htmlparser.jericho.Element; import net.htmlparser.jericho.HTMLElementName; import net.htmlparser.jericho.Segment; import net.htmlparser.jericho.Source; public class HtmlParser { public static void main(String[] args) throws Exception { BufferedReader reader=new BufferedReader(new InputStreamReader(new FileInputStream(new File("test.html")))); // BufferedReader reader=new BufferedReader(new FileReader(new File("test.html"))); StringBuilder sbf=new StringBuilder(); String str=null; while((str=reader.readLine())!=null){ sbf.append(str).append("\n"); } //解決中文亂碼的方法 String html=sbf.toString().replace("<meta http-equiv=\"content-type\" content=\"text/html;charset=UTF-8\">", "<meta http-equiv=\"content-type\" content=\"text/html;charset=GBK\">"); // System.out.println(html); Source source=new Source(html); List Elements_TABLE=source.getAllElements(HTMLElementName.TABLE); Elements_TABLE.remove(0);//由於table相互嵌套,咱們需要的是第二個,因此刪掉第一個 Iterator it_TABLE = Elements_TABLE.iterator(); while(it_TABLE.hasNext()) { Element Element_TABLE = (Element)it_TABLE.next(); // System.out.println("**"+Element_TABLE.toString()+"\n**"); Segment getContent_TABLE = (Segment)Element_TABLE.getContent(); List Elements_TR = getContent_TABLE.getAllElements(HTMLElementName.TR); Iterator it_TR = Elements_TR.iterator(); while(it_TR.hasNext()) { Element Element_TR = (Element)it_TR.next(); Segment getContent_TR = (Segment)Element_TR.getContent(); List Elements_FONT = getContent_TR.getAllElements(HTMLElementName.FONT); Iterator it_FONT = Elements_FONT.iterator(); int i = 1; while(it_FONT.hasNext()) { Element Element_FONT = (Element)it_FONT.next(); Segment getContent_FONT = (Segment)Element_FONT.getContent(); String a1 = getContent_FONT.toString(); System.out.println(i + " = " + Element_FONT.getContent().getTextExtractor().toString()); i++; } System.out.println(); } } } }
1 = 想學習 Name 2 = Result 3 = Time 4 = Synopsis 1 = 9 想學習 2 = +FAIL 想學習 3 = +FAIL 4 = 12:31 5 = 想學習 1 = 1 cdrouter_basic_1 2 = Pass 想學習 3 = 00:00 4 = 想學習