網絡爬蟲+HtmlAgilityPack+windows服務從博客園爬取20萬博文

1.前言

最新在公司作一個項目,須要一些文章類的數據,當時就想到了用網絡爬蟲去一些技術性的網站爬一些,固然我常常去的就是博客園,因而就有下面的這篇文章。javascript

程序源碼:CSDN下載地址html

2.準備工做

我須要把我從博客園爬取的數據,保存起來,最好的方式固然是保存到數據庫中去了,好了咱們先建一個數據庫,在來一張表,保存咱們的數據,其實都很簡單的了啊,以下圖所示java

BlogArticleId博文自增ID,BlogTitle博文標題,BlogUrl博文地址,BlogAuthor博文做者,BlogTime博文發佈時間,BlogMotto做者座右銘,BlogDepth蜘蛛爬蟲爬取的深度,IsDeleted是否刪除。node

數據庫表也建立好了,咱們先來一個數據庫的幫助類。web

    /// <summary>
    /// 數據庫幫助類
    /// </summary>
    public class MssqlHelper
    {
        #region 字段屬性
        /// <summary>
        /// 數據庫鏈接字符串
        /// </summary>
        private static string conn = "Data Source=.;Initial Catalog=Cnblogs;User ID=sa;Password=123";
        #endregion

        #region DataTable寫入數據
        public static void GetData(string title, string url, string author, string time, string motto, string depth, DataTable dt)
        {
            DataRow dr;

            dr = dt.NewRow();
            dr["BlogTitle"] = title;
            dr["BlogUrl"] = url;
            dr["BlogAuthor"] = author;
            dr["BlogTime"] = time;
            dr["BlogMotto"] = motto;
            dr["BlogDepth"] = depth;

            //2.0 將dr追加到dt中
            dt.Rows.Add(dr);
        }
        #endregion

        #region 插入數據到數據庫
        /// <summary>
        /// 插入數據到數據庫
        /// </summary>
        public static void InsertDb(DataTable dt)
        {
            try
            {
                using (System.Data.SqlClient.SqlBulkCopy copy = new System.Data.SqlClient.SqlBulkCopy(conn))
                {
                    //3.0.1 指定數據插入目標表名稱
                    copy.DestinationTableName = "BlogArticle";

                    //3.0.2 告訴SqlBulkCopy對象 內存表中的 OrderNO1和Userid1插入到OrderInfos表中的哪些列中
                    copy.ColumnMappings.Add("BlogTitle", "BlogTitle");
                    copy.ColumnMappings.Add("BlogUrl", "BlogUrl");
                    copy.ColumnMappings.Add("BlogAuthor", "BlogAuthor");
                    copy.ColumnMappings.Add("BlogTime", "BlogTime");
                    copy.ColumnMappings.Add("BlogMotto", "BlogMotto");
                    copy.ColumnMappings.Add("BlogDepth", "BlogDepth");

                    //3.0.3 將內存表dt中的數據一次性批量插入到OrderInfos表中
                    copy.WriteToServer(dt);
                    dt.Rows.Clear();
                }
            }
            catch (Exception)
            {
                dt.Rows.Clear();

            }
        }
        #endregion

    }

3.日誌

來個日誌,方便咱們查看,代碼以下。正則表達式

    /// <summary>
    /// 日誌幫助類
    /// </summary>
    public class LogHelper
    {
        #region 寫入日誌
        //寫入日誌
        public static void WriteLog(string text)
        {
            //StreamWriter sw = new StreamWriter(AppDomain.CurrentDomain.BaseDirectory + "\\log.txt", true);
            StreamWriter sw = new StreamWriter("F:" + "\\log.txt", true);
            sw.WriteLine(text);
            sw.Close();//寫入
        }
        #endregion

    }

4.爬蟲

個人網絡蜘蛛爬蟲,用的一個第三方類庫,代碼以下。sql

namespace Feng.SimpleCrawler
{
    using System;

    /// <summary>
    /// The add url event handler.
    /// </summary>
    /// <param name="args">
    /// The args.
    /// </param>
    /// <returns>
    /// The <see cref="bool"/>.
    /// </returns>
    public delegate bool AddUrlEventHandler(AddUrlEventArgs args);

    /// <summary>
    /// The add url event args.
    /// </summary>
    public class AddUrlEventArgs : EventArgs
    {
        #region Public Properties

        /// <summary>
        /// Gets or sets the depth.
        /// </summary>
        public int Depth { get; set; }

        /// <summary>
        /// Gets or sets the title.
        /// </summary>
        public string Title { get; set; }

        /// <summary>
        /// Gets or sets the url.
        /// </summary>
        public string Url { get; set; }

        #endregion
    }
}
AddUrlEventArgs.cs
namespace Feng.SimpleCrawler
{
    using System;
    using System.Collections;

    /// <summary>
    /// The bloom filter.
    /// </summary>
    /// <typeparam name="T">
    /// The generic type.
    /// </typeparam>
    public class BloomFilter<T>
    {
        #region Fields

        /// <summary>
        /// The get hash secondary.
        /// </summary>
        private readonly HashFunction getHashSecondary;

        /// <summary>
        /// The hash bits.
        /// </summary>
        private readonly BitArray hashBits;

        /// <summary>
        /// The hash function count.
        /// </summary>
        private readonly int hashFunctionCount;

        #endregion

        #region Constructors and Destructors

        /// <summary>
        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        public BloomFilter(int capacity)
            : this(capacity, null)
        {
        }

        /// <summary>
        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <param name="errorRate">
        /// The error rate.
        /// </param>
        public BloomFilter(int capacity, int errorRate)
            : this(capacity, errorRate, null)
        {
        }

        /// <summary>
        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <param name="hashFunction">
        /// The hash function.
        /// </param>
        public BloomFilter(int capacity, HashFunction hashFunction)
            : this(capacity, BestErrorRate(capacity), hashFunction)
        {
        }

        /// <summary>
        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <param name="errorRate">
        /// The error rate.
        /// </param>
        /// <param name="hashFunction">
        /// The hash function.
        /// </param>
        public BloomFilter(int capacity, float errorRate, HashFunction hashFunction)
            : this(capacity, errorRate, hashFunction, BestM(capacity, errorRate), BestK(capacity, errorRate))
        {
        }

        /// <summary>
        /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <param name="errorRate">
        /// The error rate.
        /// </param>
        /// <param name="hashFunction">
        /// The hash function.
        /// </param>
        /// <param name="m">
        /// The m.
        /// </param>
        /// <param name="k">
        /// The k.
        /// </param>
        public BloomFilter(int capacity, float errorRate, HashFunction hashFunction, int m, int k)
        {
            if (capacity < 1)
            {
                throw new ArgumentOutOfRangeException("capacity", capacity, "capacity must be > 0");
            }

            if (errorRate >= 1 || errorRate <= 0)
            {
                throw new ArgumentOutOfRangeException(
                    "errorRate", 
                    errorRate, 
                    string.Format("errorRate must be between 0 and 1, exclusive. Was {0}", errorRate));
            }

            if (m < 1)
            {
                throw new ArgumentOutOfRangeException(
                    string.Format(
                        "The provided capacity and errorRate values would result in an array of length > int.MaxValue. Please reduce either of these values. Capacity: {0}, Error rate: {1}", 
                        capacity, 
                        errorRate));
            }

            if (hashFunction == null)
            {
                if (typeof(T) == typeof(string))
                {
                    this.getHashSecondary = HashString;
                }
                else if (typeof(T) == typeof(int))
                {
                    this.getHashSecondary = HashInt32;
                }
                else
                {
                    throw new ArgumentNullException(
                        "hashFunction", 
                        "Please provide a hash function for your type T, when T is not a string or int.");
                }
            }
            else
            {
                this.getHashSecondary = hashFunction;
            }

            this.hashFunctionCount = k;
            this.hashBits = new BitArray(m);
        }

        #endregion

        #region Delegates

        /// <summary>
        /// The hash function.
        /// </summary>
        /// <param name="input">
        /// The input.
        /// </param>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        public delegate int HashFunction(T input);

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets the truthiness.
        /// </summary>
        public double Truthiness
        {
            get
            {
                return (double)this.TrueBits() / this.hashBits.Count;
            }
        }

        #endregion

        #region Public Methods and Operators

        /// <summary>
        /// The add.
        /// </summary>
        /// <param name="item">
        /// The item.
        /// </param>
        public void Add(T item)
        {
            int primaryHash = item.GetHashCode();
            int secondaryHash = this.getHashSecondary(item);

            for (int i = 0; i < this.hashFunctionCount; i++)
            {
                int hash = this.ComputeHash(primaryHash, secondaryHash, i);
                this.hashBits[hash] = true;
            }
        }

        /// <summary>
        /// The contains.
        /// </summary>
        /// <param name="item">
        /// The item.
        /// </param>
        /// <returns>
        /// The <see cref="bool"/>.
        /// </returns>
        public bool Contains(T item)
        {
            int primaryHash = item.GetHashCode();
            int secondaryHash = this.getHashSecondary(item);

            for (int i = 0; i < this.hashFunctionCount; i++)
            {
                int hash = this.ComputeHash(primaryHash, secondaryHash, i);
                if (this.hashBits[hash] == false)
                {
                    return false;
                }
            }

            return true;
        }

        #endregion

        #region Methods

        /// <summary>
        /// The best error rate.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <returns>
        /// The <see cref="float"/>.
        /// </returns>
        private static float BestErrorRate(int capacity)
        {
            var c = (float)(1.0 / capacity);
            if (Math.Abs(c) > 0)
            {
                return c;
            }

            double y = int.MaxValue / (double)capacity;
            return (float)Math.Pow(0.6185, y);
        }

        /// <summary>
        /// The best k.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <param name="errorRate">
        /// The error rate.
        /// </param>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        private static int BestK(int capacity, float errorRate)
        {
            return (int)Math.Round(Math.Log(2.0) * BestM(capacity, errorRate) / capacity);
        }

        /// <summary>
        /// The best m.
        /// </summary>
        /// <param name="capacity">
        /// The capacity.
        /// </param>
        /// <param name="errorRate">
        /// The error rate.
        /// </param>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        private static int BestM(int capacity, float errorRate)
        {
            return (int)Math.Ceiling(capacity * Math.Log(errorRate, 1.0 / Math.Pow(2, Math.Log(2.0))));
        }

        /// <summary>
        /// The hash int 32.
        /// </summary>
        /// <param name="input">
        /// The input.
        /// </param>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        private static int HashInt32(T input)
        {
            var x = input as uint?;
            unchecked
            {
                x = ~x + (x << 15);
                x = x ^ (x >> 12);
                x = x + (x << 2);
                x = x ^ (x >> 4);
                x = x * 2057;
                x = x ^ (x >> 16);

                return (int)x;
            }
        }

        /// <summary>
        /// The hash string.
        /// </summary>
        /// <param name="input">
        /// The input.
        /// </param>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        private static int HashString(T input)
        {
            var str = input as string;
            int hash = 0;

            if (str != null)
            {
                for (int i = 0; i < str.Length; i++)
                {
                    hash += str[i];
                    hash += hash << 10;
                    hash ^= hash >> 6;
                }

                hash += hash << 3;
                hash ^= hash >> 11;
                hash += hash << 15;
            }

            return hash;
        }

        /// <summary>
        /// The compute hash.
        /// </summary>
        /// <param name="primaryHash">
        /// The primary hash.
        /// </param>
        /// <param name="secondaryHash">
        /// The secondary hash.
        /// </param>
        /// <param name="i">
        /// The i.
        /// </param>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        private int ComputeHash(int primaryHash, int secondaryHash, int i)
        {
            int resultingHash = (primaryHash + (i * secondaryHash)) % this.hashBits.Count;
            return Math.Abs(resultingHash);
        }

        /// <summary>
        /// The true bits.
        /// </summary>
        /// <returns>
        /// The <see cref="int"/>.
        /// </returns>
        private int TrueBits()
        {
            int output = 0;

            foreach (bool bit in this.hashBits)
            {
                if (bit)
                {
                    output++;
                }
            }

            return output;
        }

        #endregion
    }
}
BloomFilter.cs
namespace Feng.SimpleCrawler
{
    using System;

    /// <summary>
    /// The crawl error event handler.
    /// </summary>
    /// <param name="args">
    /// The args.
    /// </param>
    public delegate void CrawlErrorEventHandler(CrawlErrorEventArgs args);

    /// <summary>
    /// The crawl error event args.
    /// </summary>
    public class CrawlErrorEventArgs : EventArgs
    {
        #region Public Properties

        /// <summary>
        /// Gets or sets the exception.
        /// </summary>
        public Exception Exception { get; set; }

        /// <summary>
        /// Gets or sets the url.
        /// </summary>
        public string Url { get; set; }

        #endregion
    }
}
CrawlErrorEventArgs.cs
namespace Feng.SimpleCrawler
{
    using System;

    /// <summary>
    /// The crawl error event handler.
    /// </summary>
    /// <param name="args">
    /// The args.
    /// </param>
    public delegate void CrawlErrorEventHandler(CrawlErrorEventArgs args);

    /// <summary>
    /// The crawl error event args.
    /// </summary>
    public class CrawlErrorEventArgs : EventArgs
    {
        #region Public Properties

        /// <summary>
        /// Gets or sets the exception.
        /// </summary>
        public Exception Exception { get; set; }

        /// <summary>
        /// Gets or sets the url.
        /// </summary>
        public string Url { get; set; }

        #endregion
    }
}
CrawlExtension.cs
namespace Feng.SimpleCrawler
{
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.IO.Compression;
    using System.Linq;
    using System.Net;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading;
    
    /// <summary>
    /// The crawl master.
    /// </summary>
    public class CrawlMaster
    {
        #region Constants

        /// <summary>
        /// The web url regular expressions.
        /// </summary>
        private const string WebUrlRegularExpressions = @"^(http|https)://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";

        #endregion

        #region Fields

        /// <summary>
        /// The cookie container.
        /// </summary>
        private readonly CookieContainer cookieContainer;

        /// <summary>
        /// The random.
        /// </summary>
        private readonly Random random;

        /// <summary>
        /// The thread status.
        /// </summary>
        private readonly bool[] threadStatus;

        /// <summary>
        /// The threads.
        /// </summary>
        private readonly Thread[] threads;

        #endregion

        #region Constructors and Destructors

        /// <summary>
        /// Initializes a new instance of the <see cref="CrawlMaster"/> class.
        /// </summary>
        /// <param name="settings">
        /// The settings.
        /// </param>
        public CrawlMaster(CrawlSettings settings)
        {
            this.cookieContainer = new CookieContainer();
            this.random = new Random();

            this.Settings = settings;
            this.threads = new Thread[settings.ThreadCount];
            this.threadStatus = new bool[settings.ThreadCount];
        }

        #endregion

        #region Public Events

        /// <summary>
        /// The add url event.
        /// </summary>
        public event AddUrlEventHandler AddUrlEvent;

        /// <summary>
        /// The crawl error event.
        /// </summary>
        public event CrawlErrorEventHandler CrawlErrorEvent;

        /// <summary>
        /// The data received event.
        /// </summary>
        public event DataReceivedEventHandler DataReceivedEvent;

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets the settings.
        /// </summary>
        public CrawlSettings Settings { get; private set; }

        #endregion

        #region Public Methods and Operators

        /// <summary>
        /// The crawl.
        /// </summary>
        public void Crawl()
        {
            this.Initialize();

            for (int i = 0; i < this.threads.Length; i++)
            {
                this.threads[i].Start(i);
                this.threadStatus[i] = false;
            }
        }

        /// <summary>
        /// The stop.
        /// </summary>
        public void Stop()
        {
            foreach (Thread thread in this.threads)
            {
                thread.Abort();
            }
        }

        #endregion

        #region Methods

        /// <summary>
        /// The config request.
        /// </summary>
        /// <param name="request">
        /// The request.
        /// </param>
        private void ConfigRequest(HttpWebRequest request)
        {
            request.UserAgent = this.Settings.UserAgent;
            request.CookieContainer = this.cookieContainer;
            request.AllowAutoRedirect = true;
            request.MediaType = "text/html";
            request.Headers["Accept-Language"] = "zh-CN,zh;q=0.8";

            if (this.Settings.Timeout > 0)
            {
                request.Timeout = this.Settings.Timeout;
            }
        }

        /// <summary>
        /// The crawl process.
        /// </summary>
        /// <param name="threadIndex">
        /// The thread index.
        /// </param>
        private void CrawlProcess(object threadIndex)
        {
            var currentThreadIndex = (int)threadIndex;
            while (true)
            {
                // 根據隊列中的 Url 數量和空閒線程的數量,判斷線程是睡眠仍是退出
                if (UrlQueue.Instance.Count == 0)
                {
                    this.threadStatus[currentThreadIndex] = true;
                    if (!this.threadStatus.Any(t => t == false))
                    {
                        break;
                    }

                    Thread.Sleep(2000);
                    continue;
                }

                this.threadStatus[currentThreadIndex] = false;

                if (UrlQueue.Instance.Count == 0)
                {
                    continue;
                }

                UrlInfo urlInfo = UrlQueue.Instance.DeQueue();

                HttpWebRequest request = null;
                HttpWebResponse response = null;

                try
                {
                    if (urlInfo == null)
                    {
                        continue;
                    }

                    // 1~5 秒隨機間隔的自動限速
                    if (this.Settings.AutoSpeedLimit)
                    {
                        int span = this.random.Next(1000, 5000);
                        Thread.Sleep(span);
                    }

                    // 建立並配置Web請求
                    request = WebRequest.Create(urlInfo.UrlString) as HttpWebRequest;
                    this.ConfigRequest(request);

                    if (request != null)
                    {
                        response = request.GetResponse() as HttpWebResponse;
                    }

                    if (response != null)
                    {
                        this.PersistenceCookie(response);

                        Stream stream = null;

                        // 若是頁面壓縮,則解壓數據流
                        if (response.ContentEncoding == "gzip")
                        {
                            Stream responseStream = response.GetResponseStream();
                            if (responseStream != null)
                            {
                                stream = new GZipStream(responseStream, CompressionMode.Decompress);
                            }
                        }
                        else
                        {
                            stream = response.GetResponseStream();
                        }

                        using (stream)
                        {
                            string html = this.ParseContent(stream, response.CharacterSet);

                            this.ParseLinks(urlInfo, html);

                            if (this.DataReceivedEvent != null)
                            {
                                this.DataReceivedEvent(
                                    new DataReceivedEventArgs
                                        {
                                            Url = urlInfo.UrlString, 
                                            Depth = urlInfo.Depth, 
                                            Html = html
                                        });
                            }

                            if (stream != null)
                            {
                                stream.Close();
                            }
                        }
                    }
                }
                catch (Exception exception)
                {
                    if (this.CrawlErrorEvent != null)
                    {
                        if (urlInfo != null)
                        {
                            this.CrawlErrorEvent(
                                new CrawlErrorEventArgs { Url = urlInfo.UrlString, Exception = exception });
                        }
                    }
                }
                finally
                {
                    if (request != null)
                    {
                        request.Abort();
                    }

                    if (response != null)
                    {
                        response.Close();
                    }
                }
            }
        }

        /// <summary>
        /// The initialize.
        /// </summary>
        private void Initialize()
        {
            if (this.Settings.SeedsAddress != null && this.Settings.SeedsAddress.Count > 0)
            {
                foreach (string seed in this.Settings.SeedsAddress)
                {
                    if (Regex.IsMatch(seed, WebUrlRegularExpressions, RegexOptions.IgnoreCase))
                    {
                        UrlQueue.Instance.EnQueue(new UrlInfo(seed) { Depth = 1 });
                    }
                }
            }

            for (int i = 0; i < this.Settings.ThreadCount; i++)
            {
                var threadStart = new ParameterizedThreadStart(this.CrawlProcess);

                this.threads[i] = new Thread(threadStart);
            }

            ServicePointManager.DefaultConnectionLimit = 256;
        }

        /// <summary>
        /// The is match regular.
        /// </summary>
        /// <param name="url">
        /// The url.
        /// </param>
        /// <returns>
        /// The <see cref="bool"/>.
        /// </returns>
        private bool IsMatchRegular(string url)
        {
            bool result = false;

            if (this.Settings.RegularFilterExpressions != null && this.Settings.RegularFilterExpressions.Count > 0)
            {
                if (
                    this.Settings.RegularFilterExpressions.Any(
                        pattern => Regex.IsMatch(url, pattern, RegexOptions.IgnoreCase)))
                {
                    result = true;
                }
            }
            else
            {
                result = true;
            }

            return result;
        }

        /// <summary>
        /// The parse content.
        /// </summary>
        /// <param name="stream">
        /// The stream.
        /// </param>
        /// <param name="characterSet">
        /// The character set.
        /// </param>
        /// <returns>
        /// The <see cref="string"/>.
        /// </returns>
        private string ParseContent(Stream stream, string characterSet)
        {
            var memoryStream = new MemoryStream();
            stream.CopyTo(memoryStream);

            byte[] buffer = memoryStream.ToArray();

            Encoding encode = Encoding.ASCII;
            string html = encode.GetString(buffer);

            string localCharacterSet = characterSet;

            Match match = Regex.Match(html, "<meta([^<]*)charset=([^<]*)\"", RegexOptions.IgnoreCase);
            if (match.Success)
            {
                localCharacterSet = match.Groups[2].Value;

                var stringBuilder = new StringBuilder();
                foreach (char item in localCharacterSet)
                {
                    if (item == ' ')
                    {
                        break;
                    }

                    if (item != '\"')
                    {
                        stringBuilder.Append(item);
                    }
                }

                localCharacterSet = stringBuilder.ToString();
            }

            if (string.IsNullOrEmpty(localCharacterSet))
            {
                localCharacterSet = characterSet;
            }

            if (!string.IsNullOrEmpty(localCharacterSet))
            {
                encode = Encoding.GetEncoding(localCharacterSet);
            }

            memoryStream.Close();

            return encode.GetString(buffer);
        }

        /// <summary>
        /// The parse links.
        /// </summary>
        /// <param name="urlInfo">
        /// The url info.
        /// </param>
        /// <param name="html">
        /// The html.
        /// </param>
        private void ParseLinks(UrlInfo urlInfo, string html)
        {
            if (this.Settings.Depth > 0 && urlInfo.Depth >= this.Settings.Depth)
            {
                return;
            }

            var urlDictionary = new Dictionary<string, string>();

            Match match = Regex.Match(html, "(?i)<a .*?href=\"([^\"]+)\"[^>]*>(.*?)</a>");
            while (match.Success)
            {
                // 以 href 做爲 key
                string urlKey = match.Groups[1].Value;

                // 以 text 做爲 value
                string urlValue = Regex.Replace(match.Groups[2].Value, "(?i)<.*?>", string.Empty);

                urlDictionary[urlKey] = urlValue;
                match = match.NextMatch();
            }

            foreach (var item in urlDictionary)
            {
                string href = item.Key;
                string text = item.Value;

                if (!string.IsNullOrEmpty(href))
                {
                    bool canBeAdd = true;

                    if (this.Settings.EscapeLinks != null && this.Settings.EscapeLinks.Count > 0)
                    {
                        if (this.Settings.EscapeLinks.Any(suffix => href.EndsWith(suffix, StringComparison.OrdinalIgnoreCase)))
                        {
                            canBeAdd = false;
                        }
                    }

                    if (this.Settings.HrefKeywords != null && this.Settings.HrefKeywords.Count > 0)
                    {
                        if (!this.Settings.HrefKeywords.Any(href.Contains))
                        {
                            canBeAdd = false;
                        }
                    }

                    if (canBeAdd)
                    {
                        string url = href.Replace("%3f", "?")
                            .Replace("%3d", "=")
                            .Replace("%2f", "/")
                            .Replace("&amp;", "&");

                        if (string.IsNullOrEmpty(url) || url.StartsWith("#")
                            || url.StartsWith("mailto:", StringComparison.OrdinalIgnoreCase)
                            || url.StartsWith("javascript:", StringComparison.OrdinalIgnoreCase))
                        {
                            continue;
                        }

                        var baseUri = new Uri(urlInfo.UrlString);
                        Uri currentUri = url.StartsWith("http", StringComparison.OrdinalIgnoreCase)
                                             ? new Uri(url)
                                             : new Uri(baseUri, url);

                        url = currentUri.AbsoluteUri;

                        if (this.Settings.LockHost)
                        {
                            // 去除二級域名後,判斷域名是否相等,相等則認爲是同一個站點
                            // 例如:mail.pzcast.com 和 www.pzcast.com
                            if (baseUri.Host.Split('.').Skip(1).Aggregate((a, b) => a + "." + b)
                                != currentUri.Host.Split('.').Skip(1).Aggregate((a, b) => a + "." + b))
                            {
                                continue;
                            }
                        }

                        if (!this.IsMatchRegular(url))
                        {
                            continue;
                        }

                        var addUrlEventArgs = new AddUrlEventArgs { Title = text, Depth = urlInfo.Depth + 1, Url = url };
                        if (this.AddUrlEvent != null && !this.AddUrlEvent(addUrlEventArgs))
                        {
                            continue;
                        }

                        UrlQueue.Instance.EnQueue(new UrlInfo(url) { Depth = urlInfo.Depth + 1 });
                    }
                }
            }
        }

        /// <summary>
        /// The persistence cookie.
        /// </summary>
        /// <param name="response">
        /// The response.
        /// </param>
        private void PersistenceCookie(HttpWebResponse response)
        {
            if (!this.Settings.KeepCookie)
            {
                return;
            }

            string cookies = response.Headers["Set-Cookie"];
            if (!string.IsNullOrEmpty(cookies))
            {
                var cookieUri =
                    new Uri(
                        string.Format(
                            "{0}://{1}:{2}/", 
                            response.ResponseUri.Scheme, 
                            response.ResponseUri.Host, 
                            response.ResponseUri.Port));

                this.cookieContainer.SetCookies(cookieUri, cookies);
            }
        }

        #endregion
    }
}
CrawlMaster.cs
namespace Feng.SimpleCrawler
{
    using System;
    using System.Collections.Generic;

    /// <summary>
    /// The crawl settings.
    /// </summary>
    [Serializable]
    public class CrawlSettings
    {
        #region Fields

        /// <summary>
        /// The depth.
        /// </summary>
        private byte depth = 3;

        /// <summary>
        /// The lock host.
        /// </summary>
        private bool lockHost = true;

        /// <summary>
        /// The thread count.
        /// </summary>
        private byte threadCount = 1;

        /// <summary>
        /// The timeout.
        /// </summary>
        private int timeout = 15000;

        /// <summary>
        /// The user agent.
        /// </summary>
        private string userAgent = 
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";

        #endregion

        #region Constructors and Destructors

        /// <summary>
        /// Initializes a new instance of the <see cref="CrawlSettings"/> class.
        /// </summary>
        public CrawlSettings()
        {
            this.AutoSpeedLimit = false;
            this.EscapeLinks = new List<string>();
            this.KeepCookie = true;
            this.HrefKeywords = new List<string>();
            this.LockHost = true;
            this.RegularFilterExpressions = new List<string>();
            this.SeedsAddress = new List<string>();
        }

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets or sets a value indicating whether auto speed limit.
        /// </summary>
        public bool AutoSpeedLimit { get; set; }

        /// <summary>
        /// Gets or sets the depth.
        /// </summary>
        public byte Depth
        {
            get
            {
                return this.depth;
            }

            set
            {
                this.depth = value;
            }
        }

        /// <summary>
        /// Gets the escape links.
        /// </summary>
        public List<string> EscapeLinks { get; private set; }

        /// <summary>
        /// Gets or sets a value indicating whether keep cookie.
        /// </summary>
        public bool KeepCookie { get; set; }

        /// <summary>
        /// Gets the href keywords.
        /// </summary>
        public List<string> HrefKeywords { get; private set; }

        /// <summary>
        /// Gets or sets a value indicating whether lock host.
        /// </summary>
        public bool LockHost
        {
            get
            {
                return this.lockHost;
            }

            set
            {
                this.lockHost = value;
            }
        }

        /// <summary>
        /// Gets the regular filter expressions.
        /// </summary>
        public List<string> RegularFilterExpressions { get; private set; }

        /// <summary>
        /// Gets  the seeds address.
        /// </summary>
        public List<string> SeedsAddress { get; private set; }

        /// <summary>
        /// Gets or sets the thread count.
        /// </summary>
        public byte ThreadCount
        {
            get
            {
                return this.threadCount;
            }

            set
            {
                this.threadCount = value;
            }
        }

        /// <summary>
        /// Gets or sets the timeout.
        /// </summary>
        public int Timeout
        {
            get
            {
                return this.timeout;
            }

            set
            {
                this.timeout = value;
            }
        }

        /// <summary>
        /// Gets or sets the user agent.
        /// </summary>
        public string UserAgent
        {
            get
            {
                return this.userAgent;
            }

            set
            {
                this.userAgent = value;
            }
        }

        #endregion
    }
}
CrawlSettings.cs
namespace Feng.SimpleCrawler
{
    /// <summary>
    /// The crawl status.
    /// </summary>
    public enum CrawlStatus
    {
        /// <summary>
        /// The completed.
        /// </summary>
        Completed = 1, 

        /// <summary>
        /// The never been.
        /// </summary>
        NeverBeen = 2
    }
}
CrawlStatus.cs
namespace Feng.SimpleCrawler
{
    using System;

    /// <summary>
    /// The data received event handler.
    /// </summary>
    /// <param name="args">
    /// The args.
    /// </param>
    public delegate void DataReceivedEventHandler(DataReceivedEventArgs args);

    /// <summary>
    /// The data received event args.
    /// </summary>
    public class DataReceivedEventArgs : EventArgs
    {
        #region Public Properties

        /// <summary>
        /// Gets or sets the depth.
        /// </summary>
        public int Depth { get; set; }

        /// <summary>
        /// Gets or sets the html.
        /// </summary>
        public string Html { get; set; }

        /// <summary>
        /// Gets or sets the url.
        /// </summary>
        public string Url { get; set; }

        #endregion
    }
}
DataReceivedEventArgs.cs
namespace Feng.SimpleCrawler
{
    using System.Collections.Generic;
    using System.Threading;

    /// <summary>
    /// The security queue.
    /// </summary>
    /// <typeparam name="T">
    /// Any type.
    /// </typeparam>
    public abstract class SecurityQueue<T>
        where T : class
    {
        #region Fields

        /// <summary>
        /// The inner queue.
        /// </summary>
        protected readonly Queue<T> InnerQueue = new Queue<T>();

        /// <summary>
        /// The sync object.
        /// </summary>
        protected readonly object SyncObject = new object();

        /// <summary>
        /// The auto reset event.
        /// </summary>
        private readonly AutoResetEvent autoResetEvent;

        #endregion

        #region Constructors and Destructors

        /// <summary>
        /// Initializes a new instance of the <see cref="SecurityQueue{T}"/> class.
        /// </summary>
        protected SecurityQueue()
        {
            this.autoResetEvent = new AutoResetEvent(false);
        }

        #endregion

        #region Delegates

        /// <summary>
        /// The before en queue event handler.
        /// </summary>
        /// <param name="target">
        /// The target.
        /// </param>
        /// <returns>
        /// The <see cref="bool"/>.
        /// </returns>
        public delegate bool BeforeEnQueueEventHandler(T target);

        #endregion

        #region Public Events

        /// <summary>
        /// The before en queue event.
        /// </summary>
        public event BeforeEnQueueEventHandler BeforeEnQueueEvent;

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets the auto reset event.
        /// </summary>
        public AutoResetEvent AutoResetEvent
        {
            get
            {
                return this.autoResetEvent;
            }
        }

        /// <summary>
        /// Gets the count.
        /// </summary>
        public int Count
        {
            get
            {
                lock (this.SyncObject)
                {
                    return this.InnerQueue.Count;
                }
            }
        }

        /// <summary>
        /// Gets a value indicating whether has value.
        /// </summary>
        public bool HasValue
        {
            get
            {
                return this.Count != 0;
            }
        }

        #endregion

        #region Public Methods and Operators

        /// <summary>
        /// The de queue.
        /// </summary>
        /// <returns>
        /// The <see cref="T"/>.
        /// </returns>
        public T DeQueue()
        {
            lock (this.SyncObject)
            {
                if (this.InnerQueue.Count > 0)
                {
                    return this.InnerQueue.Dequeue();
                }

                return default(T);
            }
        }

        /// <summary>
        /// The en queue.
        /// </summary>
        /// <param name="target">
        /// The target.
        /// </param>
        public void EnQueue(T target)
        {
            lock (this.SyncObject)
            {
                if (this.BeforeEnQueueEvent != null)
                {
                    if (this.BeforeEnQueueEvent(target))
                    {
                        this.InnerQueue.Enqueue(target);
                    }
                }
                else
                {
                    this.InnerQueue.Enqueue(target);
                }

                this.AutoResetEvent.Set();
            }
        }

        #endregion
    }
}
SecurityQueue.cs
namespace Feng.SimpleCrawler
{
    /// <summary>
    /// The url info.
    /// </summary>
    public class UrlInfo
    {
        #region Fields

        /// <summary>
        /// The url.
        /// </summary>
        private readonly string url;

        #endregion

        #region Constructors and Destructors

        /// <summary>
        /// Initializes a new instance of the <see cref="UrlInfo"/> class.
        /// </summary>
        /// <param name="urlString">
        /// The url string.
        /// </param>
        public UrlInfo(string urlString)
        {
            this.url = urlString;
        }

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets or sets the depth.
        /// </summary>
        public int Depth { get; set; }

        /// <summary>
        /// Gets the url string.
        /// </summary>
        public string UrlString
        {
            get
            {
                return this.url;
            }
        }

        /// <summary>
        /// Gets or sets the status.
        /// </summary>
        public CrawlStatus Status { get; set; }

        #endregion
    }
}
UrlInfo.cs
namespace Feng.SimpleCrawler
{
    /// <summary>
    /// The url queue.
    /// </summary>
    public class UrlQueue : SecurityQueue<UrlInfo>
    {
        #region Constructors and Destructors

        /// <summary>
        /// Prevents a default instance of the <see cref="UrlQueue"/> class from being created.
        /// </summary>
        private UrlQueue()
        {
        }

        #endregion

        #region Public Properties

        /// <summary>
        /// Gets the instance.
        /// </summary>
        public static UrlQueue Instance
        {
            get
            {
                return Nested.Inner;
            }
        }

        #endregion

        /// <summary>
        /// The nested.
        /// </summary>
        private static class Nested
        {
            #region Static Fields

            /// <summary>
            /// The inner.
            /// </summary>
            internal static readonly UrlQueue Inner = new UrlQueue();

            #endregion
        }
    }
}
UrlQueue.cs

5.建立windows服務.

這些工做都準備完成後,終於要來咱們的重點了,咱們都知道控制檯程序很是不穩定,而咱們的這個從博客園上面爬取文章的這個事情須要長期的進行下去,這個須要 很穩定的進行下去,因此我想起了windows服務,建立好咱們的windows服務,代碼以下。數據庫

using Feng.SimpleCrawler;
using Feng.DbHelper;
using Feng.Log;
using HtmlAgilityPack;

namespace Feng.Demo
{
    /// <summary>
    /// windows服務
    /// </summary>
    partial class FengCnblogsService : ServiceBase
    {

        #region 構造函數
        /// <summary>
        /// 構造函數
        /// </summary>
        public FengCnblogsService()
        {
            InitializeComponent();
        } 
        #endregion

        #region 字段屬性

        /// <summary>
        /// 蜘蛛爬蟲的設置
        /// </summary>
        private static readonly CrawlSettings Settings = new CrawlSettings();

        /// <summary>
        /// 臨時內存表存儲數據
        /// </summary>
        private static DataTable dt = new DataTable();

        /// <summary>
        /// 關於 Filter URL:http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html
        /// </summary>
        private static BloomFilter<string> filter;

        #endregion

        #region 啓動服務
        /// <summary>
        /// TODO: 在此處添加代碼以啓動服務。
        /// </summary>
        /// <param name="args"></param>
        protected override void OnStart(string[] args)
        {

            ProcessStart();
        } 
        #endregion

        #region 中止服務
        /// <summary>
        /// TODO: 在此處添加代碼以執行中止服務所需的關閉操做。
        /// </summary>
        protected override void OnStop()
        {

        } 
        #endregion

        #region 程序開始處理
        /// <summary>
        /// 程序開始處理
        /// </summary>
        private void ProcessStart()
        {

            dt.Columns.Add("BlogTitle", typeof(string));
            dt.Columns.Add("BlogUrl", typeof(string));
            dt.Columns.Add("BlogAuthor", typeof(string));
            dt.Columns.Add("BlogTime", typeof(string));
            dt.Columns.Add("BlogMotto", typeof(string));
            dt.Columns.Add("BlogDepth", typeof(string));

            filter = new BloomFilter<string>(200000);
            const string CityName = "";
            #region 設置種子地址
            // 設置種子地址 
            Settings.SeedsAddress.Add(string.Format("http://www.cnblogs.com/{0}", CityName));

            Settings.SeedsAddress.Add("http://www.cnblogs.com/artech");
            Settings.SeedsAddress.Add("http://www.cnblogs.com/wuhuacong/");


            Settings.SeedsAddress.Add("http://www.cnblogs.com/dudu/");
            Settings.SeedsAddress.Add("http://www.cnblogs.com/guomingfeng/");

            Settings.SeedsAddress.Add("http://www.cnblogs.com/daxnet/");
            Settings.SeedsAddress.Add("http://www.cnblogs.com/fenglingyi");

            Settings.SeedsAddress.Add("http://www.cnblogs.com/ahthw/");
            Settings.SeedsAddress.Add("http://www.cnblogs.com/wangweimutou/");
            #endregion

            #region 設置 URL 關鍵字
            Settings.HrefKeywords.Add("a/");
            Settings.HrefKeywords.Add("b/");
            Settings.HrefKeywords.Add("c/");
            Settings.HrefKeywords.Add("d/");

            Settings.HrefKeywords.Add("e/");
            Settings.HrefKeywords.Add("f/");
            Settings.HrefKeywords.Add("g/");
            Settings.HrefKeywords.Add("h/");


            Settings.HrefKeywords.Add("i/");
            Settings.HrefKeywords.Add("j/");
            Settings.HrefKeywords.Add("k/");
            Settings.HrefKeywords.Add("l/");


            Settings.HrefKeywords.Add("m/");
            Settings.HrefKeywords.Add("n/");
            Settings.HrefKeywords.Add("o/");
            Settings.HrefKeywords.Add("p/");

            Settings.HrefKeywords.Add("q/");
            Settings.HrefKeywords.Add("r/");
            Settings.HrefKeywords.Add("s/");
            Settings.HrefKeywords.Add("t/");


            Settings.HrefKeywords.Add("u/");
            Settings.HrefKeywords.Add("v/");
            Settings.HrefKeywords.Add("w/");
            Settings.HrefKeywords.Add("x/");

            Settings.HrefKeywords.Add("y/");
            Settings.HrefKeywords.Add("z/");
            #endregion


            // 設置爬取線程個數
            Settings.ThreadCount = 1;

            // 設置爬取深度
            Settings.Depth = 55;

            // 設置爬取時忽略的 Link,經過後綴名的方式,能夠添加多個
            Settings.EscapeLinks.Add("http://www.oschina.net/");

            // 設置自動限速,1~5 秒隨機間隔的自動限速
            Settings.AutoSpeedLimit = false;

            // 設置都是鎖定域名,去除二級域名後,判斷域名是否相等,相等則認爲是同一個站點
            Settings.LockHost = false;

            Settings.RegularFilterExpressions.Add(@"http://([w]{3}.)+[cnblogs]+.com/");
            var master = new CrawlMaster(Settings);
            master.AddUrlEvent += MasterAddUrlEvent;
            master.DataReceivedEvent += MasterDataReceivedEvent;

            master.Crawl();

        }
        
        #endregion

        #region 打印Url
        /// <summary>
        /// The master add url event.
        /// </summary>
        /// <param name="args">
        /// The args.
        /// </param>
        /// <returns>
        /// The <see cref="bool"/>.
        /// </returns>
        private static bool MasterAddUrlEvent(AddUrlEventArgs args)
        {
            if (!filter.Contains(args.Url))
            {
                filter.Add(args.Url);
                Console.WriteLine(args.Url);

                if (dt.Rows.Count > 200)
                {
                    MssqlHelper.InsertDb(dt);
                    dt.Rows.Clear();
                }

                return true;
            }

            return false; // 返回 false 表明:不添加到隊列中
        }
        #endregion

        #region 解析HTML
        /// <summary>
        /// The master data received event.
        /// </summary>
        /// <param name="args">
        /// The args.
        /// </param>
        private static void MasterDataReceivedEvent(SimpleCrawler.DataReceivedEventArgs args)
        {
            // 在此處解析頁面,能夠用相似於 HtmlAgilityPack(頁面解析組件)的東東、也能夠用正則表達式、還能夠本身進行字符串分析

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(args.Html);

            HtmlNode node = doc.DocumentNode.SelectSingleNode("//title");
            string title = node.InnerText;

            HtmlNode node2 = doc.DocumentNode.SelectSingleNode("//*[@id='post-date']");
            string time = node2.InnerText;

            HtmlNode node3 = doc.DocumentNode.SelectSingleNode("//*[@id='topics']/div/div[3]/a[1]");
            string author = node3.InnerText;

            HtmlNode node6 = doc.DocumentNode.SelectSingleNode("//*[@id='blogTitle']/h2");
            string motto = node6.InnerText;

            MssqlHelper.GetData(title, args.Url, author, time, motto, args.Depth.ToString(), dt);

            LogHelper.WriteLog(title);
            LogHelper.WriteLog(args.Url);
            LogHelper.WriteLog(author);
            LogHelper.WriteLog(time);
            LogHelper.WriteLog(motto == "" ? "null" : motto);
            LogHelper.WriteLog(args.Depth + "&dt.Rows.Count=" + dt.Rows.Count);

            //每次超過100條數據就存入數據庫,能夠根據本身的狀況設置數量
            if (dt.Rows.Count > 100)
            {
                MssqlHelper.InsertDb(dt);
                dt.Rows.Clear();
            }

        }
        #endregion


    }
}

這裏咱們用爬蟲從博客園爬取來了博文,咱們須要用這個HtmlAgilityPack第三方工具來解析出咱們須要的字段,博文標題,博文做者,博文URL,等等一些信息。同時咱們能夠設置服務的一些信息express

在網絡爬蟲中,咱們要設置一些參數,設置種子地址,URL關鍵字,還有爬取的深度等等,這些工做都完成後,咱們就只須要安裝咱們的windows服務,就大功告成了。嘿嘿...windows

 6.0安裝windows服務

在這裏咱們採用vs自帶的工具來安裝windows服務。

安裝成功後,打開咱們的windows服務就能夠看到咱們安裝的windows服務。

同時能夠查看咱們的日誌文件,查看咱們爬取的博文解析出來的信息。以下圖所示。

這個時候去查看咱們的數據庫,個人這個服務已經運行了一天。。。

 若是你以爲本文不錯的話,幫我推薦一下,本人能力有限,文中若有不妥之處,歡迎拍磚,若是須要源碼的童鞋,能夠留下你的郵箱...

相關文章
相關標籤/搜索