有這麼一本Python的書: <<Python 網絡數據採集>>html
我準備用.NET Core及第三方庫實現裏面全部的例子. python
這是第一部分, 主要使用的是AngleSharp: https://anglesharp.github.io/git
(文章的章節書與該書是對應的)github
在python裏面這樣發送http請求, 它使用的是python的標準庫urllib:web
在.NET Core裏面, 你可使用HttpClient, 相應的C#代碼以下:正則表達式
var client = new HttpClient(); HttpResponseMessage response = await client.GetAsync("http://pythonscraping.com/pages/page1.html"); response.EnsureSuccessStatusCode(); var responseBody = await response.Content.ReadAsStringAsync(); Console.WriteLine(responseBody); return responseBody;
或者能夠簡寫爲:api
var client = new HttpClient(); var responseBody = await client.GetStringAsync("http://pythonscraping.com/pages/page1.html"); Console.WriteLine(responseBody);
其結果以下:安全
python裏面可使用BeautifulSoup或者MechanicalSoup等庫對html源碼進行解析.服務器
而.NET Core可使用AngleSharp, Html Agility Pack, DotnetSpider(國產, 也支持元素抽取).等庫來操做Html文檔.網絡
這裏我先使用的是AngleSharp, AngleSharp的解析庫可使用標準的W3C規範來解析HTML, MathML, XML, SVG和CSS. 它支持.NET Standard 1.0.
經過Nuget便可: https://www.nuget.org/packages/AngleSharp/
Install-Package AngleSharp
或者dotnet-cli:
dotnet add package AngleSharp
下面這個例子(1.2.2)是把頁面中h1元素的內容顯示出來.
書中Python的代碼:
下面是.NET Core的C#代碼:
public static async Task ReadWithAngleSharpAsync() { var htmlSourceCode = await SendRequestWithHttpClientAsync(); var parser = new HtmlParser(); var document = await parser.ParseAsync(htmlSourceCode); Console.WriteLine($"Serializing the (original) document: {document.QuerySelector("h1").OuterHtml}"); Console.WriteLine($"Serializing the (original) document: {document.QuerySelector("html > body > h1").OuterHtml}"); }
在這裏AngleSharp首先須要建立一個能夠循環使用的HtmlParser(Html解析器), 而後使用解析器解析html源碼便可: parser.Parse() 或者異步版本 parser.ParseAsync().
解析返回對象的類型是IHtmlDocument, 裏面是解析好的DOM. 其中DOM是和AngleSharp裏的類這樣對應的:
這個圖實際上是老一點的版本, 新版本的DOM模型是稍微有點不一樣的, 不過你只要理解這個意思就行...
AngleSharp有不少特色, 可是最重要的特色就是它支持querySelector()和querySelectorAll()方法, 就像DOM的方法同樣.
上面這個例子裏, 其html的結構大體以下:
因此針對返回的IHtmlDocument對象document, 咱們使用document.QuerySelector("h1").OuterHtml, 就能夠返回h1的OuterHtml. 而使用document.QuerySelector("html > body > h1").OuterHtml 也是一樣的效果, 由於標準的CSS選擇器是都支持的.
QuerySelector()返回的是一個/0個元素, 至關於Linq的FirstOrDefault().
其運行結果以下:
發送Http請求以後, 可能會發生錯誤, 例如網頁不存在(或者請求時出錯), 服務器不存在等等.
針對這些狀況, .NET Core程序會返回HTTP錯誤, 多是404也多是500等. 可是全部的類型HttpClient都會拋出HttpRequestException, 咱們能夠這樣處理這種異常:
public static async Task ResponseWithErrorsAsync() { try { var client = new HttpClient(); var responseBody = await client.GetStringAsync("http://notexistwebsite"); Console.WriteLine(responseBody); } catch (HttpRequestException e) { Console.ForegroundColor = ConsoleColor.Red; Console.WriteLine("\nException Caught!"); Console.WriteLine("Message :{0} ", e.Message); } }
可是即便網頁獲取成功了, 網頁上的內容也並不是徹底是咱們所期待的, 仍可能會拋出異常. 好比說你想要找的標籤不存在, 那麼就會返回null, 而後再調用改標籤的屬性, 就會發生NullReferenceException.
因此這種狀況能夠捕獲NullReferenceException, 也可使用代碼判斷:
public static async Task ReadNonExistTagAsync() { var htmlSourceCode = await SendRequestWithHttpClientAsync(); var parser = new HtmlParser(); var document = await parser.ParseAsync(htmlSourceCode); var nonExistTag = document.QuerySelector("h8"); Console.WriteLine(nonExistTag); Console.WriteLine($"nonExistTag is null: {nonExistTag is null}"); try { Console.WriteLine(nonExistTag.QuerySelector("p").OuterHtml); } catch (NullReferenceException) { Console.ForegroundColor = ConsoleColor.Red; Console.WriteLine("Tag was not found"); } }
完整的例子:
public static async Task RunAllAsync() { Console.ForegroundColor = ConsoleColor.Red; async Task<string> GetTileAsync(string uri) { var httpClient = new HttpClient(); try { var responseHtml = await httpClient.GetStringAsync(uri); var parser = new HtmlParser(); var document = await parser.ParseAsync(responseHtml); var tagContent = document.QuerySelector("body > h8").TextContent; return tagContent; } catch (HttpRequestException e) { Console.WriteLine($"{nameof(HttpRequestException)}:"); Console.WriteLine("Message :{0} ", e.Message); return null; } catch (NullReferenceException) { Console.WriteLine($"{nameof(NullReferenceException)}:"); Console.WriteLine("Tag was not found"); return null; } } var title = await GetTileAsync("http://www.pythonscraping.com/pages/page1.html"); if (string.IsNullOrWhiteSpace(title)) { Console.WriteLine("Title was not found"); } else { Console.ForegroundColor = ConsoleColor.Green; Console.WriteLine(title); } }
首先我把請求Http返回HTML代碼的那部分封裝成了一個方法以便複用:
public static async Task<string> GetHtmlSourceCodeAsync(string uri) { var httpClient = new HttpClient(); try { var htmlSource = await httpClient.GetStringAsync(uri); return htmlSource; } catch (HttpRequestException e) { Console.ForegroundColor = ConsoleColor.Red; Console.WriteLine($"{nameof(HttpRequestException)}: {e.Message}"); return null; } }
CSS是網絡爬蟲的福音, 下面這兩個元素在頁面中可能會出現不少次:
咱們可使用AngleSharp裏面的QuerySelectorAll()方法把全部符合條件的元素都找出來, 返回到一個結果集合裏.
public static async Task FindGreenClassAsync() { const string url = "http://www.pythonscraping.com/pages/warandpeace.html"; var html = await GetHtmlSourceCodeAsync(url); if (!string.IsNullOrWhiteSpace(html)) { var parser = new HtmlParser(); var document = await parser.ParseAsync(html); var nameList = document.QuerySelectorAll("span > .green"); Console.WriteLine("Green names are:"); Console.ForegroundColor = ConsoleColor.Green; foreach (var item in nameList) { Console.WriteLine(item.TextContent); } } else { Console.WriteLine("No html source code returned."); } }
很是簡單, 和DOM的標準操做是同樣的.
若是隻須要元素的文字部分, 那麼就是用其TextContent屬性便可.
再看個例子
1. 找出頁面中全部的h1, h2, h3, h4, h5, h6元素
2. 找出class爲green或red的span元素.
public static async Task FindByAttributeAsync() { const string url = "http://www.pythonscraping.com/pages/warandpeace.html"; var html = await GetHtmlSourceCodeAsync(url); if (!string.IsNullOrWhiteSpace(html)) { var parser = new HtmlParser(); var document = await parser.ParseAsync(html); var headers = document.QuerySelectorAll("*") .Where(x => new[] { "h1", "h2", "h3", "h4", "h5", "h6" }.Contains(x.TagName.ToLower())); Console.WriteLine("Headers are:"); PrintItemsText(headers); var greenAndRed = document.All .Where(x => x.TagName == "span" && (x.ClassList.Contains("green") || x.ClassList.Contains("red"))); Console.WriteLine("Green and Red spans are:"); PrintItemsText(greenAndRed); var thePrinces = document.QuerySelectorAll("*").Where(x => x.TextContent == "the prince"); Console.WriteLine(thePrinces.Count()); } else { Console.WriteLine("No html source code returned."); } void PrintItemsText(IEnumerable<IElement> elements) { foreach (var item in elements) { Console.WriteLine(item.TextContent); } } }
這裏咱們能夠看到QuerySelectorAll()的返回結果能夠使用Linq的Where方法進行過濾, 這樣就很強大了.
TagName屬性就是元素的標籤名.
此外, 還有一個document.All, All屬性是該Document全部元素的集合, 它一樣也支持Linq.
(該方法中使用了一個本地方法).
因爲同時支持CSS選擇器和Linq, 因此抽取元素的工做簡單多了.
一個頁面, 它的結構能夠是這樣的:
這裏面有幾個概念:
子標籤是父標籤的下一級, 然後代標籤則是指父標籤下面全部級別的標籤.
tr是table的子標籤, tr, th, td, img都是table的後代標籤.
使用AngleSharp, 找出子標籤可使用.Children屬性. 而找出後代標籤, 可使用CSS選擇器.
找到前一個兄弟標籤使用.PreviousElementSibling屬性, 後一個兄弟標籤是.NextElementSibling屬性.
.ParentElement屬性就是父標籤.
public static async Task FindDescendantAsync() { const string url = "http://www.pythonscraping.com/pages/page3.html"; var html = await GetHtmlSourceCodeAsync(url); if (!string.IsNullOrWhiteSpace(html)) { var parser = new HtmlParser(); var document = await parser.ParseAsync(html); var tableChildren = document.QuerySelector("table#giftList > tbody").Children; Console.WriteLine("Table's children are:"); foreach (var child in tableChildren) { System.Console.WriteLine(child.LocalName); } var descendants = document.QuerySelectorAll("table#giftList > tbody *"); Console.WriteLine("Table's descendants are:"); foreach (var item in descendants) { Console.WriteLine(item.LocalName); } var siblings = document.QuerySelectorAll("table#giftList > tbody > tr").Select(x => x.NextElementSibling); Console.WriteLine("Table's descendants are:"); foreach (var item in siblings) { Console.WriteLine(item?.LocalName); } var parentSibling = document.All.SingleOrDefault(x => x.HasAttribute("src") && x.GetAttribute("src") == "../img/gifts/img1.jpg") ?.ParentElement.PreviousElementSibling; if (parentSibling != null) { Console.WriteLine($"Parent's previous sibling is: {parentSibling.TextContent}"); } } else { Console.WriteLine("No html source code returned."); } }
結果:
"若是你有一個問題打算使用正則表達式來解決, 那麼如今你有兩個問題了".
這裏有一個測試正則表達式的網站: https://www.regexpal.com/
目前, AngleSharp支持經過CSS選擇器來查找元素, 也可使用Linq來過濾元素, 固然也能夠經過多種方式使用正則表達式進行更復雜的查找動做.
關於正則表達式我就不介紹了. 直接看例子.
我想找到頁面中全部的知足下列要求的圖片, 其src的值以../img/gifts/img開頭而且隨後跟着數字, 而後格式爲.jpg的圖標.
public static async Task FindByRegexAsync() { const string url = "http://www.pythonscraping.com/pages/page3.html"; var html = await GetHtmlSourceCodeAsync(url); if (!string.IsNullOrWhiteSpace(html)) { var parser = new HtmlParser(); var document = await parser.ParseAsync(html); var images = document.QuerySelectorAll("img") .Where(x => x.HasAttribute("src") && Regex.Match(x.Attributes["src"].Value, @"\.\.\/img\/gifts/img.*\.jpg").Success); foreach (var item in images) { Console.WriteLine(item.Attributes["src"].Value); } var elementsWith2Attributes = document.All.Where(x => x.Attributes.Length == 2); foreach (var item in elementsWith2Attributes) { Console.WriteLine(item.LocalName); foreach (var attr in item.Attributes) { Console.WriteLine($"\t{attr.Name} - {attr.Value}"); } } } else { Console.WriteLine("No html source code returned."); } }
這個其實沒有任何難度.
但從本例能夠看到, 判斷元素有沒有一個屬性可使用HasAttribute("xxx")方法, 能夠經過.Attributes索引來獲取屬性, 其屬性值就是.Attributes["xxx"].Value.
若是不會正則表達式, 我相信多寫的Linq的過濾代碼也差很少能達到要求.
就是幾個應用的例子, 直接貼代碼吧.
打印出一個頁面內全部的超連接地址:
public static async Task TraversingASingleDomainAsync() { var httpClient = new HttpClient(); var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon"); var parser = new HtmlParser(); var document = await parser.ParseAsync(htmlSource); var links = document.QuerySelectorAll("a"); foreach (var link in links) { Console.WriteLine(link.Attributes["href"]?.Value); } }
找出知足下列條件的超連接:
public static async Task FindSpecificLinksAsync() { var httpClient = new HttpClient(); var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon"); var parser = new HtmlParser(); var document = await parser.ParseAsync(htmlSource); var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a") .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success); foreach (var link in links) { Console.WriteLine(link.Attributes["href"]?.Value); } }
隨機找到頁面裏面一個鏈接, 而後遞歸調用本身的方法, 直到主動中止:
private static async Task<IEnumerable<IElement>> GetLinksAsync(string uri) { var httpClient = new HttpClient(); var htmlSource = await httpClient.GetStringAsync($"http://en.wikipedia.org{uri}"); var parser = new HtmlParser(); var document = await parser.ParseAsync(htmlSource); var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a") .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success); return links; } public static async Task GetRandomNestedLinksAsync() { var random = new Random(); var links = (await GetLinksAsync("/wiki/Kevin_Bacon")).ToList(); while (links.Any()) { var newArticle = links[random.Next(0, links.Count)].Attributes["href"].Value; Console.WriteLine(newArticle); links = (await GetLinksAsync(newArticle)).ToList(); } }
首先要了解幾個概念:
淺網 surface web: 是互聯網上搜索引擎能夠直接抓取到的那部分網絡.
與淺網對立的就是深網 deep web: 互聯網中90%都是深網.
暗網Darknet / dark web / dark internet: 它徹底是另一種怪獸. 它們也創建在已有的網絡基礎上, 可是使用Tor客戶端, 帶有運行在HTTP之上的新協議, 提供了一個信息交換的安全隧道. 這類網也能夠採集, 可是超出了本書的範圍.....
深網相對暗網仍是比較容易採集的.
採集整個網站的兩個好處:
因爲網站的規模和深度, 因此採集到的超連接不少多是重複的, 這時咱們就須要連接去重, 可使用Set類型的集合:
private static readonly HashSet<string> LinkSet = new HashSet<string>(); private static readonly HttpClient HttpClient = new HttpClient(); private static readonly HtmlParser Parser = new HtmlParser(); public static async Task GetUniqueLinksAsync(string uri = "") { var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}"); var document = await Parser.ParseAsync(htmlSource); var links = document.QuerySelectorAll("a") .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success); foreach (var link in links) { if (!LinkSet.Contains(link.Attributes["href"].Value)) { var newPage = link.Attributes["href"].Value; Console.WriteLine(newPage); LinkSet.Add(newPage); await GetUniqueLinksAsync(newPage); } } }
(遞歸調用的深度須要注意一下, 否則有時候能崩潰).
這個例子相對網站, 包括收集相關文字和異常處理等:
private static readonly HashSet<string> LinkSet = new HashSet<string>(); private static readonly HttpClient HttpClient = new HttpClient(); private static readonly HtmlParser Parser = new HtmlParser(); public static async Task GetLinksWithInfoAsync(string uri = "") { var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}"); var document = await Parser.ParseAsync(htmlSource); try { var title = document.QuerySelector("h1").TextContent; Console.ForegroundColor = ConsoleColor.Green; Console.WriteLine(title); var contentElement = document.QuerySelector("#mw-content-text").QuerySelectorAll("p").FirstOrDefault(); if (contentElement != null) { Console.WriteLine(contentElement.TextContent); } var alink = document.QuerySelector("#ca-edit").QuerySelectorAll("span a").SingleOrDefault(x => x.HasAttribute("href"))?.Attributes["href"].Value; Console.WriteLine(alink); } catch (NullReferenceException) { Console.ForegroundColor = ConsoleColor.Red; Console.WriteLine("Cannot find the tag!"); } var links = document.QuerySelectorAll("a") .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success).ToList(); foreach (var link in links) { if (!LinkSet.Contains(link.Attributes["href"].Value)) { var newPage = link.Attributes["href"].Value; Console.WriteLine(newPage); LinkSet.Add(newPage); await GetLinksWithInfoAsync(newPage); } } }
第一個例子, 尋找隨機外鏈:
using System; using System.Collections.Generic; using System.Linq; using System.Net.Http; using System.Text.RegularExpressions; using System.Threading.Tasks; using AngleSharp.Parser.Html; namespace WebScrapingWithDotNetCore.Chapter03 { public class CrawlingAcrossInternet { private static readonly Random Random = new Random(); private static readonly HttpClient HttpClient = new HttpClient(); private static readonly HashSet<string> InternalLinks = new HashSet<string>(); private static readonly HashSet<string> ExternalLinks = new HashSet<string>(); private static readonly HtmlParser Parser = new HtmlParser(); public static async Task FollowExternalOnlyAsync(string startingSite) { var externalLink = await GetRandomExternalLinkAsync(startingSite); if (externalLink != null) { Console.WriteLine($"External Links is: {externalLink}"); await FollowExternalOnlyAsync(externalLink); } else { Console.WriteLine("Random External link is null, Crawling terminated."); } } private static async Task<string> GetRandomExternalLinkAsync(string startingPage) { try { var htmlSource = await HttpClient.GetStringAsync(startingPage); var externalLinks = (await GetExternalLinksAsync(htmlSource, SplitAddress(startingPage)[0])).ToList(); if (externalLinks.Any()) { return externalLinks[Random.Next(0, externalLinks.Count)]; } var internalLinks = (await GetInternalLinksAsync(htmlSource, startingPage)).ToList(); if (internalLinks.Any()) { return await GetRandomExternalLinkAsync(internalLinks[Random.Next(0, internalLinks.Count)]); } return null; } catch (HttpRequestException e) { Console.WriteLine($"Error requesting: {e.Message}"); return null; } } private static string[] SplitAddress(string address) { var addressParts = address.Replace("http://", "").Replace("https://", "").Split("/"); return addressParts; } private static async Task<IEnumerable<string>> GetInternalLinksAsync(string htmlSource, string includeUrl) { var document = await Parser.ParseAsync(htmlSource); var links = document.QuerySelectorAll("a") .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, $@"^(/|.*{includeUrl})").Success) .Select(x => x.Attributes["href"].Value); foreach (var link in links) { if (!string.IsNullOrEmpty(link) && !InternalLinks.Contains(link)) { InternalLinks.Add(link); } } return InternalLinks; } private static async Task<IEnumerable<string>> GetExternalLinksAsync(string htmlSource, string excludeUrl) { var document = await Parser.ParseAsync(htmlSource); var links = document.QuerySelectorAll("a") .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, $@"^(http|www)((?!{excludeUrl}).)*$").Success) .Select(x => x.Attributes["href"].Value); foreach (var link in links) { if (!string.IsNullOrEmpty(link) && !ExternalLinks.Contains(link)) { ExternalLinks.Add(link); } } return ExternalLinks; } private static readonly HashSet<string> AllExternalLinks = new HashSet<string>(); private static readonly HashSet<string> AllInternalLinks = new HashSet<string>(); public static async Task GetAllExternalLinksAsync(string siteUrl) { try { var htmlSource = await HttpClient.GetStringAsync(siteUrl); var internalLinks = await GetInternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]); var externalLinks = await GetExternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]); foreach (var link in externalLinks) { if (!AllExternalLinks.Contains(link)) { AllExternalLinks.Add(link); Console.WriteLine(link); } } foreach (var link in internalLinks) { if (!AllInternalLinks.Contains(link)) { Console.WriteLine($"The link is: {link}"); AllInternalLinks.Add(link); await GetAllExternalLinksAsync(link); } } } catch (HttpRequestException e) { Console.WriteLine(e); Console.WriteLine($"Request error: {e.Message}"); } } } }
程序有Bug, 您能夠給解決下......
第一部分先到這....主要用的是AngleSharp. AngleSharp不止這些功能, 很強大的, 具體請看文檔.
因爲該書下一部分使用的是Python的Scrapy, 因此下篇文章我也許應該使用DotNetSpider了, 這是一個國產的庫....
項目的代碼在: https://github.com/solenovex/Web-Scraping-With-.NET-Core