基於搜狗搜索的微信公衆號爬蟲實現（C#版本）

時間 2019-11-19

原文原文鏈接

Author: Hoyho Luohtml

Email: luohaihao@gmail.comnode

轉載請保留此出處github

本文介紹基於搜狗的微信公衆號定向爬蟲，使用C#實現，故取名WeGouSharp。本文中的項目託管在Github上，你能夠戳WeGouSharp獲取源碼，歡迎點星。關於微信公共號爬蟲的項目網上已經很多，然而基本大多數的都是使用Python實現鑑於鄙人是名.NET開發人員，因而又爲廣大微軟系同胞建立了這個輪子，使用C#實現的微信爬蟲藍本爲Chyroc/WechatSogou，在此還請各位大佬指教。chrome

1.項目結構

2.數據結構

3.xpath介紹

4.使用HtmlAgilityPack解析網頁內容

5.驗證碼處理以及文件緩存

1、項目結構

以下圖json

API類：api

全部直接的操做封裝好在API類中,直接使用裏面的方法緩存

Basic類:

主要處理邏輯

FileCache：

主要出現驗證碼的時候須要使用Ccokie驗證身份，此類能夠加密後序列化保存UIN,BIZ,COOKIE等內容以供後續使用

HttpHelper類：

網絡請求，包括圖片

Tools類：

圖片處理，cookie加載等

依賴包能夠直接使用package文件夾的版本

也能夠自行在NuGet添加如(visual studio-->tools-->Nuget Package Manager-->Package Manager Console)：

Install-Package HtmlAgilityPack

2、數據結構

本項目根據微信公帳號以及搜狗搜索定義了多個結構，能夠查看模型類，主要包括如下：bash

公衆號結構：微信

public struct OfficialAccount
    {

        public string AccountPageurl;
        public string WeChatId;
        public string Name;
        public string Introduction;
        public bool IsAuth; 
        public string QrCode;
        public string ProfilePicture;//public string RecentArticleUrl;
    }

字段含義

字段	含義
AccountPageurl	微信公衆號頁
WeChatId	公號ID（惟一)
Name	名稱
Introduction	介紹
IsAuth	是否官方認證
QrCode	二維碼連接
ProfilePicture	頭像連接

公號羣發消息結構(含圖文推送)

public struct BatchMessage
    {
        public int Meaasgeid;
        public string  SendDate;
        public string Type; //49:圖文，1:文字，3:圖片，34:音頻，62:視頻public string Content; 

        public string ImageUrl; 

        public string PlayLength;
        public int FileId;
        public string AudioSrc;

        //for type 圖文public string ContentUrl;
        public int Main;
        public string Title;
        public string Digest;
        public string SourceUrl;
        public string Cover;
        public string Author;
        public string CopyrightStat;

        //for type 視頻public string CdnVideoId;
        public string Thumb;
        public string VideoSrc;

        //others
    }

字段含義

字段	含義
Meaasgeid	消息號
SendDate	發出時間（unix時間戳）
Type	消息類型:49:圖文， 1:文字， 3:圖片， 34:音頻， 62:視頻
Content	文本內容（針對類型1即文字）
ImageUrl	圖片（針對類型3，即圖片）
PlayLength	播放長度（針對類型34，即音頻，下同）
FileId	音頻文件id
AudioSrc	音頻源
ContentUrl	文章來源（針對類型49，即圖文，下同）
Main	不明確
Title	文章標題
Digest	不明確
SourceUrl	多是閱讀原文
Cover	封面圖
Author	做者
CopyrightStat	多是否原創？
CdnVideoId	視頻id（針對類型62，即視頻，下同）
Thumb	視頻縮略圖
VideoSrc	視頻連接

文章結構

  public struct Article
    {
        public string Url;
        public List<string>Imgs;
        public string Title;
        public string Brief;
        public string Time;
        public string ArticleListUrl;
        public OfficialAccount officialAccount;
    }

字段含義

字段	含義
Url	文章連接
Imgs	封面圖（可能多個）
Title	文章標題
Brief	簡介
Time	發表日期（unix時間戳）
OfficialAccount	關聯的公衆號（信息不全，僅供參考）

搜索榜結構

public struct HotWord
    {
        public int Rank;//排行
        public string Word;
        public string JumpLink; //相關連接
        public int HotDegree; //熱度
    }

三、xpath介紹

什麼是 XPath?

XPath 使用路徑表達式在 XML 文檔中進行導航
XPath 包含一個標準函數庫
XPath 是 XSLT 中的主要元素
XPath 是一個 W3C 標準

簡而言之，Xpath是XML元素的位置,下面是W3C教程時間，老鳥直接跳過

XML 實例文檔

咱們將在下面的例子中使用這個 XML 文檔。

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

選取節點

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是經過沿着路徑或者 step 來選取的。

下面列出了最有用的路徑表達式：

表達式	描述
nodename	選取此節點的全部子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

實例

在下面的表格中，咱們已列出了一些路徑表達式以及表達式的結果：

路徑表達式	結果
bookstore	選取 bookstore 元素的全部子節點。
/bookstore	選取根元素 bookstore。註釋：假如路徑起始於正斜槓( / )，則此路徑始終表明到某元素的絕對路徑！
bookstore/book	選取屬於 bookstore 的子元素的全部 book 元素。
//book	選取全部 book 子元素，而無論它們在文檔中的位置。
bookstore//book	選擇屬於 bookstore 元素的後代的全部 book 元素，而無論它們位於 bookstore 之下的什麼位置。
//@lang	選取名爲 lang 的全部屬性。

謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點。

謂語被嵌在方括號中。

實例

在下面的表格中，咱們列出了帶有謂語的一些路徑表達式，以及表達式的結果：

路徑表達式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取全部擁有名爲 lang 的屬性的 title 元素。
//title[@lang='eng']	選取全部 title 元素，且這些元素擁有值爲 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的全部 book 元素，且其中的 price 元素的值須大於 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的全部 title 元素，且其中的 price 元素的值須大於 35.00。

選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

實例

在下面的表格中，咱們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
/bookstore/*	選取 bookstore 元素的全部子元素。
//*	選取文檔中的全部元素。
//title[@*]	選取全部帶有屬性的 title 元素。來源： http://www.w3school.com.cn/xpath/xpath_syntax.asp

如圖，假設我要抓取首頁一個banner圖，能夠在chrome按下F12參考該元素的Xpath，

即該圖片對應的Xpth爲: //*[@id="loginWrap"]/div[4]/div[1]/div[1]/div/a[4]/img

解讀：該圖片位於ID= loginWrap下面的第4個div下的...的img標籤內

爲何這裏介紹Xpath，是由於咱們網頁分析是使用HtmlAgilityPack來解析, 他能夠把根據Xpath解析咱們所需的元素。

好比咱們調試肯定一個文章頁面的特定位置爲標題，圖片，做者，內容，連接的Xpath便可徹底批量化且準確地解析以上信息

4、使用HtmlAgilityPack解析網頁內容

HttpTool類裏封裝了一個較多參數的HTTP GET操做,用於獲取搜狗的頁面：

由於搜狗自己是作搜索引擎的緣故，因此反爬蟲是很是嚴厲的，所以HTTP GET的方法要注意攜帶不少參數，且不一樣頁面要求不同.通常地，要帶上默認的

referer和host 而後請求頭的UserAgent 要僞造，經常使用的useragent有

public static List<string> _agent = new List<string>
{
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
};

自定義的GET 方法

  /// <summary>
        /// 指定header參數的HTTP Get方法
        /// </summary>
        /// <param name="headers"></param>
        /// <param name="url"></param>
        /// <returns>respondse</returns>
        public string Get(WebHeaderCollection headers, string url ,string responseEncoding="UTF-8",bool isUseCookie = false)
        {
            string responseText = "";
            try
            {
                var request = (HttpWebRequest)WebRequest.Create(url);
                    request.Method = "GET";
                foreach (string key in headers.Keys)
                {
                    switch (key.ToLower())
                    {
                        case "user-agent":
                            request.UserAgent = headers[key];
                            break;
                        case "referer":
                            request.Referer = headers[key];
                            break;
                        case "host":
                            request.Host = headers[key];
                            break;
                        case "contenttype":
                            request.ContentType = headers[key];
                            break;
                        case "accept":
                            request.Accept = headers[key];
                            break;
                        default:
                            break;
                    }
               }
                if (string.IsNullOrEmpty(request.Referer))
                {
                    request.Referer = "http://weixin.sogou.com/";
                };
                if (string.IsNullOrEmpty(request.Host))
                {
                    request.Host = "weixin.sogou.com";
                };
                if (string.IsNullOrEmpty(request.UserAgent))
                {
                    Random r = new Random();
                    int index = r.Next(WechatSogouBasic._agent.Count - 1);
                    request.UserAgent = WechatSogouBasic._agent[index];
                }
                if (isUseCookie)
                {
                    CookieCollection cc = Tools.LoadCookieFromCache();
                    request.CookieContainer = new CookieContainer();
                    request.CookieContainer.Add(cc);
                }
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                if (isUseCookie && response.Cookies.Count >0)
                {
                    var cookieCollection = response.Cookies;
                    WechatCache cache = new WechatCache(Config.CacheDir, 3000);
                    if (!cache.Add("cookieCollection", cookieCollection, 3000)) { cache.Update("cookieCollection", cookieCollection, 3000); };
                }
                // Get the stream containing content returned by the server.
                Stream dataStream = response.GetResponseStream();
                //若是response是圖片，則返回以base64方式返回圖片內容，不然返回html內容
                if (response.Headers.Get("Content-Type") == "image/jpeg" || response.Headers.Get("Content-Type") == "image/jpg")
                {
                    Image img = Image.FromStream(dataStream, true);
                    using (MemoryStream ms = new MemoryStream())
                    {
                        // Convert Image to byte[]
                        //img.Save("myfile.jpg");
                        img.Save(ms,System.Drawing.Imaging.ImageFormat.Jpeg);
                        byte[] imageBytes = ms.ToArray();
                        // Convert byte[] to Base64 String
                        string base64String = Convert.ToBase64String(imageBytes);
                        responseText = base64String;
                    }
                }
                else //read response string
                {
                    // Open the stream using a StreamReader for easy access.
                    Encoding encoding;
                    switch (responseEncoding.ToLower())
                    {
                        case "utf-8":
                            encoding = Encoding.UTF8;
                            break;
                        case "unicode":
                            encoding = Encoding.Unicode;
                            break;
                        case "ascii":
                            encoding = Encoding.ASCII;
                            break;
                        default:
                            encoding = Encoding.Default;
                            break;
                               
                    }
                    StreamReader reader = new StreamReader(dataStream, encoding);//System.Text.Encoding.Default
                    // Read the content.
                    if (response.StatusCode == HttpStatusCode.OK)
                    {
                        responseText = reader.ReadToEnd();
                        if (responseText.Contains("用戶您好，您的訪問過於頻繁，爲確認本次訪問爲正經常使用戶行爲，須要您協助驗證"))
                        {
                            _vcode_url = url;
                            throw new Exception("weixin.sogou.com verification code");
                        }
                    }
                    else
                    {
                        logger.Error("requests status_code error" + response.StatusCode);
                        throw new Exception("requests status_code error");
                    }
                    reader.Close();
                }

                dataStream.Close();
                response.Close();
            }
            catch (Exception e)
            {
                logger.Error(e);
            }
            return responseText;
        }

前面關於Xpath廢話太多，直接上一個案例，解析公衆號頁面：

public List<OfficialAccount> SearchOfficialAccount(string keyword, int page = 1)
        {
            List<OfficialAccount> accountList = new List<OfficialAccount>();
            string text = this._SearchAccount_Html(keyword, page);//返回了一個搜索頁面的html代碼
            HtmlDocument pageDoc = new HtmlDocument();
            pageDoc.LoadHtml(text);
            HtmlNodeCollection targetArea = pageDoc.DocumentNode.SelectNodes("//ul[@class='news-list2']/li");
            if (targetArea != null)
            {
                foreach (HtmlNode node in targetArea)
                {
                    try
                    {
                        OfficialAccount accountInfo = new OfficialAccount();
                        //連接中包含了&amp; html編碼符，要用htmdecode，不是urldecode
                        accountInfo.AccountPageurl = WebUtility.HtmlDecode(node.SelectSingleNode("div/div[@class='img-box']/a").GetAttributeValue("href", ""));
                        //accountInfo.ProfilePicture = node.SelectSingleNode("div/div[1]/a/img").InnerHtml;
                        accountInfo.ProfilePicture = WebUtility.HtmlDecode(node.SelectSingleNode("div/div[@class='img-box']/a/img").GetAttributeValue("src", ""));
                        accountInfo.Name = node.SelectSingleNode("div/div[2]/p[1]").InnerText.Trim().Replace("<!--red_beg-->", "").Replace("<!--red_end-->", "");
                        accountInfo.WeChatId = node.SelectSingleNode("div/div[2]/p[2]/label").InnerText.Trim();
                        accountInfo.QrCode = WebUtility.HtmlDecode(node.SelectSingleNode("div/div[3]/span/img").GetAttributeValue("src", ""));
                        accountInfo.Introduction = node.SelectSingleNode("dl[1]/dd").InnerText.Trim().Replace("<!--red_beg-->","").Replace("<!--red_end-->", "");
                        //早期的帳號認證和後期的認證顯示不同？，對比 bitsea 和 NUAA_1952 兩個帳號
                        //如今改成包含該script的即認證了
                        if (node.InnerText.Contains("document.write(authname('2'))"))
                        {
                            accountInfo.IsAuth = true;
                        }
                        else
                        {
                            accountInfo.IsAuth = false;
                        }
                        accountList.Add(accountInfo);
                    }
                    catch (Exception e)
                    {
                        logger.Warn(e);
                    }
                }
            }
            
          
            return accountList; 
        }

以上，說白了，解析就是Xpath調試，關鍵是看目標內容是是元素標籤內容，仍是標籤屬性，

若是是標籤內容即形式爲 <h> 我是內容</h>

則： node.SelectSingleNode(" div/div[2]/p[2]/label").InnerText.Trim();

若是要提取的目標內容是標籤屬性，如 <a href="/im_target_url.htm" >點擊連接</a>

則node.SelectSingleNode(" div/div[@class='img-box']/a").GetAttributeValue("href", "")

五、驗證碼處理以及文件緩存

公衆號的主頁(示例廣州大學公衆號 https://mp.weixin.qq.com/profile?src=3×tamp=1505923231&ver=1&signature=gWXdb*Jzt1oByDAzW5aTzEWnXo6mkUwg3Ynjm3CYvKV0kdCLxALBR7JJ-EheLBI-v6UcocJqGmPbUY2KMXuSsg==)由於頁面是屬於微信的，反爬蟲很是嚴格，所以屢次刷新容易產生要輸入驗證碼的頁面

好比公號主頁屢次刷新會出現驗證碼

此時要向一個網址post驗證碼才能夠解封

解封操做以下

/// <summary>
        /// 頁面出現驗證碼，輸入才能繼續,此驗證依賴cookie, 獲取驗證碼的requset有個cookie，每次不一樣，須要在post驗證碼的時候帶上
        /// </summary>
        /// <returns></returns>
        public bool VerifyCodeForContinute(string url ,bool isUseOCR)
        {
            bool isSuccess = false;
            logger.Debug("vcode appear, use VerifyCodeForContinute()");
            DateTime Epoch = new DateTime(1970, 1, 1,0,0,0,0);
            var timeStamp17 = (DateTime.UtcNow - Epoch).TotalMilliseconds.ToString("R"); //get timestamp with 17 bit
            string codeurl = "https://mp.weixin.qq.com/mp/verifycode?cert=" + timeStamp17;
            WebHeaderCollection headers = new WebHeaderCollection();
            var content = this.Get(headers, codeurl,"UTF-8",true);
            ShowImageHandle showImageHandle = new ShowImageHandle(DisplayImageFromBase64);
            showImageHandle.BeginInvoke(content, null, null);
            Console.WriteLine("請輸入驗證碼：");
            string verifyCode = Console.ReadLine();
            string postURL = "https://mp.weixin.qq.com/mp/verifycode";
            timeStamp17 = (DateTime.UtcNow - Epoch).TotalMilliseconds.ToString("R"); //get timestamp with 17 bit
            string postData = string.Format("cert={0}&input={1}",timeStamp17,verifyCode );// "{" + string.Format(@"'cert':'{0}','input':'{1}'", timeStamp17, verifyCode) + "}";
            headers.Add("Host", "mp.weixin.qq.com");
            headers.Add("Referer", url);
            string remsg = this.Post(postURL, headers, postData,true);
            try
            {
                JObject jo = JObject.Parse(remsg);//把json字符串轉化爲json對象  
                int statusCode = (int)jo.GetValue("ret");
                if (statusCode == 0)
                {
                    isSuccess = true;
                }
                else
                {
                    logger.Error("cannot unblock because " + jo.GetValue("msg"));
                    var vcodeException = new WechatSogouVcodeException();
                    vcodeException.MoreInfo = "cannot jiefeng because " + jo.GetValue("msg");
                    throw vcodeException;
                }
            }catch(Exception e)
            {
                logger.Error(e);
            }
            return isSuccess;
        }

解釋下：

先訪問一個驗證碼產生頁面，帶17位時間戳

var timeStamp17 = (DateTime.UtcNow - Epoch).TotalMilliseconds.ToString("R"); //get timestamp with 17 bit

再向這個url query post你的驗證碼:

所以這裏記得要啓用若是啓用了cookie，會經過FileCache類將cookie保存在緩存文件，下次請求若是開啓cookie container的話就會帶上此cookie

CookieCollection cc = Tools.LoadCookieFromCache();
request.CookieContainer = new CookieContainer();
request.CookieContainer.Add(cc);

6、後話

上面只是一部分，剛開始寫的時候也沒想到會有這麼多坑，可是沒辦法，坑再多隻能本身慢慢填了，好比OCR，第三方打碼接入，多線程等等後期再實現。一我的的精力畢竟有限，相對滿大街的Python爬蟲，C#的爬蟲性質的項目原本就很少，儘管代碼寫得很是粗糙，可是我選擇了開放源碼但願更多人蔘與，歡迎各位看官收藏，能夠的話給個星或者提交代碼

項目地址