老蝸牛寫採集：獲取數據（正則篇）

時間 2019-11-19

標籤蝸牛採集獲取數據正則欄目網絡爬蟲简体版

原文原文鏈接

致歉

首先感謝博友對這個系列的支持，不少加羣的人都問我啥時候更新，我一直回答儘快，結果一拖就一年了。由於工做和生活佔據我大量的時間，因此只能跟大夥說聲抱歉。javascript

使用正則獲取數據

　　前兩篇講到如何採集html數據，那採集回來確定要截取咱們有用的部分，舉個例子。咱們要採集搜狐新聞的社會欄目，地址以下：php

　　http://news.sohu.com/shehuixinwen.shtmlcss

　　咱們首先獲取到新聞列表，看上兩章介紹到使用xNet獲取到搜狐新聞的社會欄目的html源碼，固然你可使用httprequest或者第三方組件。代碼以下：html

var html = string.Empty;
            using (var request = new xNet.HttpRequest())
            {
                html = request.Get("http://news.sohu.com/shehuixinwen.shtml").ToString();
            }

　　獲得html值：java

<!doctype html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/ xhtml1-transitional.dtd">


<script type="text/javascript">
  var pvinsight_page_ancestors = '143746642;143746651';
</script>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="text/html; charset=gb2312" />
<title>社會新聞-搜狐新聞</title>
<script type="text/javascript" src="http://www.sohu.com/sohuflash_1.js"></script>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
<meta name="description" content="搜狐社會新聞關注社會民生、百姓問題。精品欄目：社會萬象、百姓生活" />
<meta name="keywords" content="社會,社會新聞,萬象,百姓,口述" />
<meta name="robots" content="all" />


<link type="text/css" rel="stylesheet" href="http://css.sohu.com/upload/global1.4.1.css" />
<link rel="stylesheet" href="http://news.sohu.com/upload/zhengyufeng/xinwenerjiye/style1.css" />
<script type="text/javascript" src="http://js.sohu.com/library/jquery-1.7.1.min.js"></script>
<script type="text/javascript" src="http://news.sohu.com/upload/2013page/j.js"></script>
<!--[if IE 6]>
    <script src="http://news.sohu.com/upload/zhengyufeng/pngAlaph.js"></script>
    <script>DD_belatedPNG.fix('#header,h3,h3 span,.shipin,.dashiye_list li,.liushengji_list li,#imgText b,img,.followScroll a,#contentA .left .tit span');
    </script>
<![endif]-->

.....



<script language="javascript">
if(_wratingId !=null){
document.write('<scr'+'ipt type="text/javascript">');
document.write('var vjAcc="'+_wratingId+'";');
document.write('var wrUrl="http://sohu.wrating.com/";');
document.write('try{vjTrack();}catch(e){}');
document.write('</scr'+'ipt>');
}
</script>
<script> require(["sjs/matrix/ad/passion"]);</script>
<!--SOHU:SUB_FOOT_DIV-->
 


</body>
</html>

　　由於html比較大，因此不顯示所有，爲了防止搜狐改版，我仍是截取一段樣板jquery

					<div class="article-list">
					<div class="article">
                        <h3><span class="com-num"><a target="_blank" href="#">comment num</a></span><a target="_blank" href="http://www.sohu.com/a/190778382_119562">「五假副部」現形始末：被指講話稿都念不順</a></h3>
                        <p>...<a target="_blank" href="http://www.sohu.com/a/190778382_119562">閱讀全文>></a></p>
					</div>
					
					<div class="pic-group">
					</div>
					
					<div class="fun clear">
                        <div class="share">
                                <ul>
                                        <li class="s-t">分享到 |</li>
                                        <li class="blg"><a title="·??í?????ü????" href="javascript:void(0)"></a></li>
                                        <li class="qq"><a title="·??í??QQ????" href="javascript:void(0)"></a></li>
                                        <li class="rrw"><a title="·??í????????" href="javascript:void(0)"></a></li>
                                        <li class="db"><a title="·??í????°ê" href="javascript:void(0)"></a></li>
                                        <li class="itb"><a title="·??í??i?ù°?" href="javascript:void(0)"></a></li>
                                </ul>
                        </div>
					<!--	<div class="label">±ê????<a target="_blank" href="#">??????</a> <a target="_blank" href="#">???ú</a> <a target="_blank" href="#">????</a></div> -->
						<div class="time"> 發表於 2017-09-09 13:03</div>
					</div>
					</div>

　　那咱們要獲取新聞列表的標題和鏈接地址怎麼獲取了？那麼就要介紹本篇的核心，使用正則，一講到正則不少人會以爲很難，由於寫法比較火星語。第二就是測試正則，市面上有不少測試工具，包括在線的都有，看你的喜愛了，這裏我要介紹一個超級無敵好用的測試工具，你們能夠去網上下載或者在本文最後的會有下載連接，這個工具名叫：RegExBuilder 爲啥說他好用，主要是他採用即時匹配，這樣對新手能夠一步步的調試編寫正則。使用上面的工具能夠獲得如下正則匹配新聞列表和鏈接地址代碼：機器學習

<h3>[^>]*>[^>]*>[^>]*>[^>]*><a.target="_blank"\shref="(?<url>[^"]*)[^>]*>(?<title>[^<]*)

　　寫得比較粗獷，估計一百我的有一百個寫法，因此這也是正則有魔力的地方，入門難，入門後小菜一碟。這裏要說明一下，(?<title>[^<]*)是能夠經過 title 這個關鍵字獲取值，後面代碼會寫到。但在javascript、java、php等是按索引獲取的，因此C#仍是比較人性化滴工具

　　使用代碼獲取咱們要的數據，首先得定義一個新聞類：學習

    class NewsItem
    {
        public string Title { get; set; }
        public string Url { get; set; }
        public string Content { get; set; }
    }

　　邏輯代碼測試

            var html = string.Empty;
            using (var request = new xNet.HttpRequest())
            {
                html = request.Get("http://news.sohu.com/shehuixinwen.shtml").ToString();
            }

            var newsList = new List<NewsItem>();
            var mc = Regex.Matches(html, @"<h3>[^>]*>[^>]*>[^>]*>[^>]*><a.target=""_blank""\shref=""(?<url>[^""]*)[^>]*>(?<title>[^<]*)");
            foreach (Match m in mc)
            {
                var newsItem = new NewsItem();
                newsItem.Title = m.Groups["title"].Value;
                newsItem.Url= m.Groups["url"].Value;

                //按索引獲取，具體看RegExBuilder工具的索引
                newsItem.Url = m.Groups[1].Value;
                newsItem.Title = m.Groups[2].Value;
　　　　　　　　　 newsList.Add(newsItem);
            }

　　注意轉譯字符串哦，不是兩個正則不同，而後就能夠獲得整個新聞的列表，有人會提問，那下一頁呢？嘿嘿...車只能開到這了。

　　獲取新聞內容，拿到地址後，能夠用xNet獲取html源碼，而後分析，示例以下：

　　先用RegExBuilder編寫正則代碼，獲得：

<article\sclass="article">(?<content>.+?)</article>

　　這裏要勾選Singleline選項，Singleline顧名思義就是所有按單行匹配，就是有換行也按單行，還有其餘幾種匹配模式，其實均可以按字面意思去理解。譬如：

　　IgnoreCase：忽略大小寫

　　Multiline：多行匹配

　　RightToLeft：從右到左匹配

　　其餘能夠參考這位道友的文章，http://blog.csdn.net/qq_33729889/article/details/63035440

　　獲取數據的邏輯代碼

            foreach (var newItem in newsList)
            {
                using (var request = new xNet.HttpRequest())
                {
                    html = request.Get(newItem.Url).ToString();
                }

                Match m = Regex.Match(html, @"<article\sclass=""article"">(?<content>.+?)</article>", RegexOptions.Singleline);
                if (m.Success)
                {
                    newItem.Content = m.Groups["content"].Value;
                }
            }

　　以上就是簡單使用正則獲取html數據的案例。