廢話很少說, 直接說需求。html
公司的網站須要抓取其餘網站的文章,但任務沒到我這,同事搞了一下午沒搞出來。因爲剛剛到公司, 想證實下本身,就把活攬過來了。由於之前作過,以爲應該很簡單,但當我開始作的時候,我崩潰了,http請求後,獲得的是字符串居然是亂碼,而後就各類百度(谷歌一直崩潰中),最後找到了緣由。因爲我要抓取的網頁作了壓縮,因此當我抓的時候,抓過來的是壓縮後的,因此必須解壓一下,若是不解壓,無論用什麼編碼方式,結果仍是亂碼。直接上代碼:正則表達式
1 public Encoding GetEncoding(string CharacterSet) 2 { 3 switch (CharacterSet) 4 { 5 case "gb2312": return Encoding.GetEncoding("gb2312"); 6 case "utf-8": return Encoding.UTF8; 7 default: return Encoding.Default; 8 } 9 }
public string HttpGet(string url) { string responsestr = ""; HttpWebRequest req = HttpWebRequest.Create(url) as HttpWebRequest; req.Accept = "*/*"; req.Method = "GET"; req.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"; using (HttpWebResponse response = req.GetResponse() as HttpWebResponse) { Stream stream; if (response.ContentEncoding.ToLower().Contains("gzip")) { stream = new GZipStream(response.GetResponseStream(), CompressionMode.Decompress); } else if (response.ContentEncoding.ToLower().Contains("deflate")) { stream = new DeflateStream(response.GetResponseStream(), CompressionMode.Decompress); } else { stream = response.GetResponseStream(); } using (StreamReader reader = new StreamReader(stream, GetEncoding(response.CharacterSet))) { responsestr = reader.ReadToEnd(); stream.Dispose(); } } return responsestr; }
調用HttpGet就能夠獲取網址的源碼了,獲得源碼後, 如今用一個利器HtmlAgility來解析html了,不會正則沒關係,此乃神器啊。老闆不再用擔憂個人正則表達式了。ide
至於這個神器的用法,園子文章不少,寫的也都挺詳細的,在此不贅餘了。post
下面是抓取園子首頁的文章列表:網站
string html = HttpGet("http://www.cnblogs.com/"); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); //獲取文章列表 var artlist = doc.DocumentNode.SelectNodes("//div[@class='post_item']"); foreach (var item in artlist) { HtmlDocument adoc = new HtmlDocument(); adoc.LoadHtml(item.InnerHtml); var html_a = adoc.DocumentNode.SelectSingleNode("//a[@class='titlelnk']"); Response.Write(string.Format("標題爲:{0},連接爲:{1}<br>",html_a.InnerText,html_a.Attributes["href"].Value)); }
運行結果如圖:編碼
打完收工。url
因爲時間倉促,加上本人文筆不行,若有疑問,歡迎吐槽,吐吐更健康。spa