HTML解析利器HtmlAgilityPack

時間 2021-01-19

標籤 html node 正則表達式網絡 ide 工具學習 url spa .net 欄目 HTML 简体版

原文原文鏈接

在之前的項目中周公曾有解析HTML的狀況，當時是採用正則表達式一步步將無關的HTML註釋及JS代碼部分刪除掉，而後再用正則表達式找出須要提取的部分，能夠說使用正則表達式來作是一個比較繁瑣的過程，特別是對於正則表達式不是很熟悉或者要處理的HTML很複雜的狀況下。前一陣子周公仍是經過這個辦法將http://wz.csdn.net/zhoufoxcn上保存的網址導入到http://cang.baidu.com，原本還想將周公博客上的文章好好整理一下，可是考慮到使用正則真的是很繁瑣也很麻煩，因此就一直沒有動手。
直到前兩天在網上發現了一個.NET下的HTML解析類庫HtmlAgilityPack。HtmlAgilityPack是一個支持用XPath來解析HTML的類庫，在花了一點時間學習瞭解HtmlAgilityPack的API和XPath以後，周公就作了一個簡單的工具完成了這個功能，目前在CSDN上週公博文的收集地址爲：http://blog.csdn.net/zhoufoxcn/archive/2011/06/23/6564578.aspx，在51CTO上週公博文的收集地址爲http://zhoufoxcn.blog.51cto.com/792419/595327。HtmlAgilityPack是一個開源的.NET類庫，它的主頁是http://htmlagilitypack.codeplex.com/，在這裏能夠下載到最新版的類庫及API手冊，此外還能夠下載到一個用於調試的輔助工具。

XPath簡明介紹
XPath 使用路徑表達式來選取 XML 文檔中的節點或節點集。節點是經過沿着路徑 (path) 或者步 (steps) 來選取的。
下面列出了最有用的路徑表達式：
nodename:選取此節點的全部子節點。
/:從根節點選取。
//:從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.:選取當前節點。
..:選取當前節點的父節點。
例若有下面一段XML:
html

  
  
           
  
  
   
   
            
   
   <?xml version="1.0" encoding="utf-8"?> 
   
   
            
   
    <Articles> 
   
   
            
   
    <Article> 
   
   
            
   
      <Title>在ASP.NET中使用Highcharts js圖表</title> 
   
   
            
   
      <Url>http://zhoufoxcn.blog.51cto.com/792419/537324</Url> 
   
   
            
   
      <CreateAt type="en">2011-04-07</price> 
   
   
            
   
    </Article> 
   
   
            
   
    <Article> 
   
   
            
   
      <Title lang="eng">Log4Net使用詳解（續）</title> 
   
   
            
   
      <Url>http://blog.csdn.net/zhoufoxcn/archive/2010/11/23/6029021.aspx</Url> 
   
   
            
   
      <CreateAt type="zh-cn">2010年11月23日</price> 
   
   
            
   
    </Article> 
   
   
            
   
    <Article> 
   
   
            
   
      <Title>J2ME開發的通常步驟</title> 
   
   
            
   
      <Url>http://blog.csdn.net/zhoufoxcn/archive/2011/06/12/6540223.aspx</Url> 
   
   
            
   
      <CreateAt type="zh-cn">2011年06月12日</price> 
   
   
            
   
    </Article> 
   
   
            
   
    <Article> 
   
   
            
   
      <Title lang="eng">PowerDesign高級應用</title> 
   
   
            
   
      <Url>http://zhoufoxcn.blog.51cto.com/792419/166415</Url> 
   
   
            
   
      <CreateAt type="zh-cn">2007-09-08</price> 
   
   
            
   
    </Article> 
   
   
            
   
    </Articles>

針對上面的XML文件，咱們列出了帶有謂語的一些路徑表達式，以及表達式的結果：
/Articles/Article[1]：選取屬於Articles子元素的第一個Article元素。
/Articles/Article[last()]：選取屬於Articles子元素的最後一個Article元素。
/Articles/Article[last()-1]：選取屬於Articles子元素的倒數第二個Article元素。
/Articles/Article[position()<3]：選取最前面的兩個屬於 bookstore 元素的子元素的Article元素。
//title[@lang]：選取全部擁有名爲lang的屬性的title元素。
//CreateAt[@type='zh-cn']：選取全部CreateAt元素，且這些元素擁有值爲zh-cn的type屬性。
/Articles/Article[Order>2]：選取Articles元素的全部Article元素，且其中的Order元素的值須大於2。
/Articles/Article[Order<3]/Title：選取Articles元素中的Article元素的全部Title元素，且其中的Order元素的值須小於3。

HtmlAgilityPack API簡明介紹
在HtmlAgilityPack中經常使用到的類有HtmlDocument、HtmlNodeCollection、
HtmlNode和HtmlWeb等。
其流程通常是先獲取HTML，這個能夠經過HtmlDocument的Load()或LoadHtml()來加載靜態內容，或者也能夠HtmlWeb的Get()或Load()方法來加載網絡上的URL對應的HTML。
獲得了HtmlDocument的實例以後，就能夠用HtmlDocument的DocumentNode屬性，這是整個HTML文檔的根節點，它自己也是一個HtmlNode，而後就能夠利用HtmlNode的SelectNodes()方法返回多個HtmlNode的集合對象HtmlNodeCollection，也能夠利用HtmlNode的SelectSingleNode()方法返回單個HtmlNode。

HtmlAgilityPack實戰
下面是一個解析CSDN博客的代碼實例：
node

  
  
           
  
  
   
   
            
   
   using System;  
   
   
            
   
    using System.Collections.Generic;  
   
   
            
   
    using System.Text;  
   
   
            
   
    using HtmlAgilityPack;  
   
   
            
   
    using System.Text.RegularExpressions;  
   
   
            
   
      
   
   
            
   
    namespace CrawlPageApplication  
   
   
            
   
    {  
   
   
            
   
        /**  
   
   
            
   
         * 做者：周公  
   
   
            
   
         * 日期：2011-06-23  
   
   
            
   
         * Blog: http://blog.csdn.net/zhoufoxcn or http://zhoufoxcn.blog.51cto.com  
   
   
            
   
         * Weibo: http://weibo.com/zhoufoxcn  
   
   
            
   
         */ 
   
   
            
   
        public class CSDN_Parser  
   
   
            
   
        {  
   
   
            
   
            private const string CategoryListXPath = "//html[1]/body[1]/div[1]/div[1]/div[2]/div[1]/div[1]/dl[1]/dd[3]/div[1]/ul[1]/li";  
   
   
            
   
            private const string CategoryNameXPath = "//li[1]/a[2]";  
   
   
            
   
      
   
   
            
   
            /// <summary>  
   
   
            
   
            /// 分析博客首頁  
   
   
            
   
            /// </summary>  
   
   
            
   
            /// <param name="url"></param>  
   
   
            
   
            /// <returns></returns>  
   
   
            
   
            public static List<Category> ParseIndexPage(string url)  
   
   
            
   
            {  
   
   
            
   
                Uri uriCategory=null;  
   
   
            
   
                List<Category> list = new List<Category>(40);  
   
   
            
   
                  
   
   
            
   
                HtmlDocument document = new HtmlDocument();  
   
   
            
   
       //注意，這裏省略掉了使用本人其它類庫中加載URL的類，而是直接加載本地的HTML文件  
   
   
            
   
                //string html = HttpWebUtility.ReadFromUrl(url, Encoding.UTF8);  
   
   
            
   
                //document.LoadHtml(html);  
   
   
            
   
                document.Load("CSDN_index.html", Encoding.UTF8);  
   
   
            
   
                HtmlNode rootNode = document.DocumentNode;  
   
   
            
   
                HtmlNodeCollection categoryNodeList = rootNode.SelectNodes(CategoryListXPath);  
   
   
            
   
                HtmlNode temp = null;  
   
   
            
   
                Category category = null;  
   
   
            
   
                foreach (HtmlNode categoryNode in categoryNodeList)  
   
   
            
   
                {  
   
   
            
   
                    temp = HtmlNode.CreateNode(categoryNode.OuterHtml);  
   
   
            
   
                    category = new Category();  
   
   
            
   
                    category.Subject = temp.SelectSingleNode(CategoryNameXPath).InnerText;  
   
   
            
   
                    Uri.TryCreate(UriBase, temp.SelectSingleNode(CategoryNameXPath).Attributes["href"].Value, out uriCategory);  
   
   
            
   
                    category.IndexUrl = uriCategory.ToString();  
   
   
            
   
                    category.PageUrlFormat=category.IndexUrl+"?PageNumber={0}";  
   
   
            
   
                    list.Add(category);  
   
   
            
   
                    Category.CategoryDetails.Add(category.IndexUrl, category);  
   
   
            
   
                }  
   
   
            
   
                return list;  
   
   
            
   
            }  
   
   
            
   
      
   
   
            
   
        }  
   
   
            
   
    }

固然實現相似的解析51CTO的博客文章數據的代碼以下：
正則表達式

  
  
           
  
  
   
   
            
   
   using System;  
   
   
            
   
    using System.Collections.Generic;  
   
   
            
   
    using System.Text;  
   
   
            
   
    using HtmlAgilityPack;  
   
   
            
   
    using System.Text.RegularExpressions;  
   
   
            
   
      
   
   
            
   
    namespace CrawlPageApplication  
   
   
            
   
    {  
   
   
            
   
        /**  
   
   
            
   
         * 做者：周公  
   
   
            
   
         * 日期：2011-06-23  
   
   
            
   
         * Blog: http://blog.csdn.net/zhoufoxcn or http://zhoufoxcn.blog.51cto.com  
   
   
            
   
         * Weibo: http://weibo.com/zhoufoxcn  
   
   
            
   
         */ 
   
   
            
   
        public class CTO_Parser  
   
   
            
   
        {  
   
   
            
   
            private static Encoding PageEncoding = Encoding.GetEncoding("gb2312");  
   
   
            
   
            private static readonly Uri UriBase = new Uri("http://zhoufoxcn.blog.51cto.com");  
   
   
            
   
            private static string CategoryListXPath = "/html[1]/body[1]/div[5]/div[1]/div[1]/div[2]/ul[1]/li";  
   
   
            
   
            private static string CategoryNameXPath = "/li[1]/a[1]";  
   
   
            
   
      
   
   
            
   
            /// <summary>  
   
   
            
   
            /// 分析博客首頁  
   
   
            
   
            /// </summary>  
   
   
            
   
            /// <param name="url"></param>  
   
   
            
   
            /// <returns></returns>  
   
   
            
   
            public static List<Category> ParseIndexPage(string url)  
   
   
            
   
            {  
   
   
            
   
                Uri uriCategory = null;  
   
   
            
   
                List<Category> list = new List<Category>(40);  
   
   
            
   
      
   
   
            
   
                HtmlDocument document = new HtmlDocument();  
   
   
            
   
                //string html = HttpWebUtility.ReadFromUrl(url, PageEncoding);  
   
   
            
   
                //document.LoadHtml(html);  
   
   
            
   
                document.Load("51cto_index.html", PageEncoding);  
   
   
            
   
                HtmlNode rootNode = document.DocumentNode;  
   
   
            
   
                HtmlNodeCollection categoryNodeList = rootNode.SelectNodes(CategoryListXPath);  
   
   
            
   
                HtmlNode temp = null;  
   
   
            
   
                Category category = null;  
   
   
            
   
                foreach (HtmlNode categoryNode in categoryNodeList)  
   
   
            
   
                {  
   
   
            
   
                    temp = HtmlNode.CreateNode(categoryNode.OuterHtml);  
   
   
            
   
                    if (temp.SelectSingleNode(CategoryNameXPath).InnerText != "所有文章")  
   
   
            
   
                    {  
   
   
            
   
                        category = new Category();  
   
   
            
   
                        category.Subject = temp.SelectSingleNode(CategoryNameXPath).InnerText;  
   
   
            
   
                        Uri.TryCreate(UriBase, temp.SelectSingleNode(CategoryNameXPath).Attributes["href"].Value, out uriCategory);  
   
   
            
   
                        category.IndexUrl = uriCategory.ToString();  
   
   
            
   
                        category.PageUrlFormat = category.IndexUrl + "/page/{0}";  
   
   
            
   
                        list.Add(category);  
   
   
            
   
                        Category.CategoryDetails.Add(category.IndexUrl, category);  
   
   
            
   
                    }  
   
   
            
   
                }  
   
   
            
   
                return list;  
   
   
            
   
            }  
   
   
            
   
        }  
   
   
            
   
    }

在上面的代碼中出現了一個Category類，該類的定義以下：
爲了鼓勵你們動手嘗試以及在本項目中使用了周公的私家類庫，因此不提供所有源代碼下載，這裏提供周公操做的最終軟件界面：網絡

總結：HtmlAgilityPack確實是一個功能強大、體積小的開源HTML解析類庫，在本篇僅僅是介紹了其中幾個類的用法，但光這些就足以供周公快速實現了許久沒有實現的功能，若是讓周公用正則表達式來實現相似的功能，時間確定要比用這個長得多。
說明：周公最近也在琢磨一些關於微博的應用，若是有相同愛好者或者在使用微博的讀者，請圍觀周公的微博，網址是：http://weibo.com/zhoufoxcn。

2011-06-24
周公ide