最近項目有需求從一個老的站點抓取信息而後倒入到新的系統中。因爲老的系統已經沒有人維護,數據又比較分散,而要提取的數據在網頁上表現的反而更統一,因此計劃經過網絡請求而後分析頁面的方式來提取數據。而兩年前的這個時候,我彷佛作過相同的事情——緣分這件事情,真是有趣。html
在採集信息這件事情中,最麻煩的每每是不一樣的頁面的分解、數據的提取——由於頁面的設計和結構每每千差萬別。同時,對於有些頁面,一般不得不繞着彎子請求(ajax、iframe等),這致使數據提取成了最耗時也最痛苦的過程——由於你須要編寫大量的邏輯代碼將整個流程串聯起來。我隱隱記得15年的7月,也就是兩年前的這個時候,我就思考過這個問題。當時引入了一個類型CommonExtractor
來解決這個問題。整體的定義是這樣的:node
public class CommonExtractor { public CommonExtractor(PageProcessConfig config) { PageProcessConfig = config; } protected PageProcessConfig PageProcessConfig; public virtual void Extract(CrawledHtmlDocument document) { if (!PageProcessConfig.IncludedUrlPattern.Any(i => Regex.IsMatch(document.FromUrl.ToString(), i))) return; var node = new WebHtmlNode { Node = document.Contnet.DocumentNode, FromUrl = document.FromUrl }; ExtractData(node, PageProcessConfig); } protected Dictionary<string, ExtractionResult> ExtractData(WebHtmlNode node, PageProcessConfig blockConfig) { var data = new Dictionary<string, ExtractionResult>(); foreach (var config in blockConfig.DataExtractionConfigs) { if (node == null) continue; /*使用'.'將當前節點做爲上下文*/ var selectedNodes = node.Node.SelectNodes("." + config.XPath); var result = new ExtractionResult(config, node.FromUrl); if (selectedNodes != null && selectedNodes.Any()) { foreach (var sNode in selectedNodes) { if (config.Attribute != null) result.Fill(sNode.Attributes[config.Attribute].Value); else result.Fill(sNode.InnerText); } data[config.Key] = result; } else { data[config.Key] = null; } } if (DataExtracted != null) { var args = new DataExtractedEventArgs(data, node.FromUrl); DataExtracted(this, args); } return data; } public EventHandler<DataExtractedEventArgs> DataExtracted; }
代碼有點亂(由於當時使用的是Abot進行爬網),可是意圖仍是挺明確的,但願從一個html文件中提取出有用的信息,而後經過一個配置來指定如何提取信息。這種處理方式存在的主要問題是:沒法應對複雜結構,在應對特定的結構的時候必須引入新的配置,新的流程,同時這個新的流程不具有較高程度的可重用性。ajax
爲了應對現實狀況中的複雜性,最基本的處理必須設計的簡單。從之前代碼中捕捉到靈感,對於數據提取,其實咱們想要的就是:json
由此,給出了最基本的接口定義:數組
public interface IContentProcessor { /// <summary> /// 處理內容 /// </summary> /// <param name="source"></param> /// <returns></returns> object Process(object source); }
在上述的接口定義中,IContentProcessor
接口的實現方法若是足夠龐大,其實能夠解決任何html頁面的數據提取,可是,這意味着其可複用性會愈來愈低,同時維護將愈來愈困難。因此,咱們更但願其方法實現足夠小。可是,越小表明着其功能越少,那麼,爲了面對複雜的現實需求,必須讓這些接口能夠組合起來。因此,要爲接口添加新的要素:子處理器。網絡
public interface IContentProcessor { /// <summary> /// 處理內容 /// </summary> /// <param name="source"></param> /// <returns></returns> object Process(object source); /// <summary> /// 該處理器的順序,越小越先執行 /// </summary> int Order { get; } /// <summary> /// 子處理器 /// </summary> IList<IContentProcessor> SubProcessors { get; } }
這樣一來,各個Processor
就能夠進行協做了。其嵌套關係和Order
屬性共同決定了其執行的順序。同時,整個處理流程也具有了管道的特色:上一個Processor
的處理結果能夠做爲下一個Processor
的處理源。數據結構
雖然解決了處理流程的可組合性,可是就目前而言,處理的結果仍是不可組合的,由於沒法應對複雜的結構。爲了解決這個問題,引入了IContentCollector,這個接口繼承自IContentProcessor,可是提出了額外的要求,以下:app
public interface IContentCollector : IContentProcessor { /// <summary> /// 數據收集器收集的值對應的鍵 /// </summary> string Key { get; } }
該接口要求提供一個Key來標識結果。這樣,咱們就能夠用一個Dictionary<string,object>
把複雜的結構管理起來了。由於字典的項對應的值也能夠是Dictionary<string,object>
,這個時候,若是使用json做爲序列化手段的話,是很是容易將結果反序列化成複雜的類的。async
至於爲何要將這個接口繼承自IContentProcessor
,這是爲了保證節點類型的一致性,從而方便經過配置來構造整個處理流程。ide
從上面的設計中能夠看到,整個處理流程實際上是一棵樹,結構很是規範。這就爲配置提供了可行性,這裏使用一個Content-Processor-Options
類型來表示每一個Processor
節點的類型和必要的初始化信息。定義以下所示:
public class ContentProcessorOptions { /// <summary> /// 構造Processor的參數列表 /// </summary> public Dictionary<string, object> Properties { get; set; } = new Dictionary<string, object>(); /// <summary> /// Processor的類型信息 /// </summary> public string ProcessorType { get; set; } /// <summary> /// 指定一個子Processor,用於快速初始化Children,從而減小嵌套。 /// </summary> public string SubProcessorType { get; set; } /// <summary> /// 子項配置 /// </summary> public List<ContentProcessorOptions> Children { get; set; } = new List<ContentProcessorOptions>(); }
在Options中引入了SubProcessorType
屬性來快速初始化只有一個子處理節點的ContentCollector
,這樣就能夠減小配置內容的層級,從而使得配置文件更加清晰。而如下方法則表示瞭如何經過一個Content-Processor-Options
初始化Processor
。這裏使用了反射,可是因爲不會頻繁初始化,因此不會有太大的問題。
public static IContentProcessor BuildContentProcessor(ContentProcessorOptions contentProcessorOptions) { Type instanceType = null; try { instanceType = Type.GetType(contentProcessorOptions.ProcessorType, true); } catch { foreach (var assembly in AppDomain.CurrentDomain.GetAssemblies()) { if (assembly.IsDynamic) continue; instanceType = assembly.GetExportedTypes() .FirstOrDefault(i => i.FullName == contentProcessorOptions.ProcessorType); if (instanceType != null) break; } } if (instanceType == null) return null; var instance = Activator.CreateInstance(instanceType); foreach (var property in contentProcessorOptions.Properties) { var instanceProperty = instance.GetType().GetProperty(property.Key); if (instanceProperty == null) continue; var propertyType = instanceProperty.PropertyType; var sourceValue = property.Value.ToString(); var dValue = sourceValue.Convert(propertyType); instanceProperty.SetValue(instance, dValue); } var processorInstance = (IContentProcessor) instance; if (!contentProcessorOptions.SubProcessorType.IsNullOrWhiteSpace()) { var quickOptions = new ContentProcessorOptions { ProcessorType = contentProcessorOptions.SubProcessorType, Properties = contentProcessorOptions.Properties }; var quickProcessor = BuildContentProcessor(quickOptions); processorInstance.SubProcessors.Add(quickProcessor); } foreach (var processorOption in contentProcessorOptions.Children) { var processor = BuildContentProcessor(processorOption); processorInstance.SubProcessors.Add(processor); } return processorInstance; }
經過一個例子來講明問題:好比,一個html文檔中提取了n個p標籤,返回了一個string []
,同時將這個做爲源傳遞給下一個處理節點。下一個處理節點會正確的處理每一個string
,可是若是此節點也是針對一個string
返回一個string[]
的話,這個string []
應該被一個Connector
拼接起來。不然的話,結果就變成了2維
、3維度
乃至是更多維度的數組。這樣的話,每一個節點的邏輯就變複雜同時不可控了。因此集合須要收斂到一個維度。
因爲當前使用的.NET CORE的配置文件系統,沒法在一個Dictionary<string,object>
中將其子項設置爲集合。
該處理器用於從網絡上下載一段html文本,將文本內容做爲源傳遞給下一個處理器;能夠同時指定請求url或者將上一個請求節點傳遞過來的源做爲url進行請求。實現以下:
public class HttpRequestContentProcessor : BaseContentProcessor { public bool UseUrlWhenSourceIsNull { get; set; } = true; public string Url { get; set; } public bool IgnoreBadUri { get; set; } protected override object ProcessElement(object element) { if (element == null) return null; if (Uri.IsWellFormedUriString(element.ToString(), UriKind.Absolute)) { if (IgnoreBadUri) return null; throw new FormatException($"須要請求的地址{Url}格式不正確"); } return DownloadHtml(element.ToString()); } public override object Process(object source) { if (source == null && UseUrlWhenSourceIsNull && !Url.IsNullOrWhiteSpace()) return DownloadHtml(Url); return base.Process(source); } private static async Task<string> DownloadHtmlAsync(string url) { using (var client = new HttpClient()) { var result = await client.GetAsync(url); var html = await result.Content.ReadAsStringAsync(); return html; } } private string DownloadHtml(string url) { return AsyncHelper.Synchronize(() => DownloadHtmlAsync(url)); } }
測試以下:
[TestMethod] public void HttpRequestContentProcessorTest() { var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com"}; var result = processor.Process(null); Assert.IsTrue(result.ToString().Contains("baidu")); }
該處理器經過接受一個XPath路徑來獲取指定的信息。能夠經過指定ValueProvider
和ValueProviderKey
來指定如何從一個節點中獲取數據,實現以下:
public class XpathContentProcessor : BaseContentProcessor { /// <summary> /// 索引的元素路徑 /// </summary> public string Xpath { get; set; } /// <summary> /// 值得提供器的鍵 /// </summary> public string ValueProviderKey { get; set; } /// <summary> /// 提供器的類型 /// </summary> public XpathNodeValueProviderType ValueProviderType { get; set; } /// <summary> /// 節點的索引 /// </summary> public int? NodeIndex { get; set; } /// <summary> /// /// </summary> public string ResultConnector { get; set; } = Constants.DefaultResultConnector; public override object Process(object source) { var result = base.Process(source); return DeterminAndReturn(result); } protected override object ProcessElement(object element) { var result = base.ProcessElement(element); if (result == null) return null; var str = result.ToString(); return ProcessWithXpath(str, Xpath, false); } protected object ProcessWithXpath(string documentText, string xpath, bool returnArray) { if (documentText == null) return null; var document = new HtmlDocument(); document.LoadHtml(documentText); var nodes = document.DocumentNode.SelectNodes(xpath); if (nodes == null) return null; if (returnArray && nodes.Count > 1) { var result = new List<string>(); foreach (var node in nodes) { var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey); if (!nodeResult.IsNullOrWhiteSpace()) { result.Add(nodeResult); } } return result; } else { var result = string.Empty; foreach (var node in nodes) { var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey); if (!nodeResult.IsNullOrWhiteSpace()) { if (result.IsNullOrWhiteSpace()) result = nodeResult; else result = $"{result}{ResultConnector}{nodeResult}"; } } return result; } } }
將這個Processor
和上一個Processor
組合起來,咱們抓一下百度首頁的title
:
[TestMethod] public void XpathContentProcessorTest() { var xpathProcessor = new XpathContentProcessor { Xpath = "//title", ValueProviderType = XpathNodeValueProviderType.InnerText }; var processor = new HttpRequestContentProcessor { Url = "https://www.baidu.com" }; xpathProcessor.SubProcessors.Add(processor); var result = xpathProcessor.Process(null); Assert.AreEqual("百度一下,你就知道", result.ToString()); }
Collector
最大的做用是解決複雜的輸出模型的問題。一個複雜數據結構的Collector
的實現以下:
public class ComplexContentCollector : BaseContentCollector { /// <summary> /// Complex Content Collector 須要子的數據提取器提供一個Key,因此忽略Processor /// </summary> /// <param name="source"></param> /// <returns></returns> protected override object ProcessElement(object source) { var result = new Dictionary<string, object>(); foreach (var contentCollector in SubProcessors.OfType<IContentCollector>()) { result[contentCollector.Key] = contentCollector.Process(source); } return result; } }
對應的測試以下:
[TestMethod] public void ComplexContentCollectorTest2() { var xpathProcessor = new XpathContentProcessor { Xpath = "//title", ValueProviderType = XpathNodeValueProviderType.InnerText }; var xpathProcessor2 = new XpathContentProcessor { Xpath = "//p[@id=\"cp\"]", ValueProviderType = XpathNodeValueProviderType.InnerText, Order = 1 }; var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com", Order = -1}; var complexCollector = new ComplexContentCollector(); var baseCollector = new BaseContentCollector(); baseCollector.SubProcessors.Add(processor); baseCollector.SubProcessors.Add(complexCollector); var titleCollector = new BaseContentCollector{Key = "Title"}; titleCollector.SubProcessors.Add(xpathProcessor); var footerCollector = new BaseContentCollector {Key = "Footer"}; footerCollector.SubProcessors.Add(xpathProcessor2); footerCollector.SubProcessors.Add(new HtmlCleanupContentProcessor{Order = 3}); complexCollector.SubProcessors.Add(titleCollector); complexCollector.SubProcessors.Add(footerCollector); var result = (Dictionary<string,object>)baseCollector.Process(null); Assert.AreEqual("百度一下,你就知道", result["Title"]); Assert.AreEqual("©2014 Baidu 使用百度前必讀 京ICP證030173號", result["Footer"]); }
如今,使用如下代碼進行測試:
public void RunConfig(string section) { var builder = new ConfigurationBuilder() .SetBasePath(AppDomain.CurrentDomain.BaseDirectory) .AddJsonFile("appsettings1.json"); var configurationRoot = builder.Build(); var options = configurationRoot.GetSection(section).Get<ContentProcessorOptions>(); var processor = Helper.BuildContentProcessor(options); var result = processor.Process(null); var json = JsonConvert.SerializeObject(result); System.Console.WriteLine(json); }
使用的配置:
"newsListOptions": { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "Properties": {}, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor", "Properties": { "Url": "https://www.cnblogs.com/news/", "Order": "0" } }, { "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//div[@class=\"post_item\"]", "Order": "1", "ValueProviderType": "OuterHtml", "OutputToArray": true } }, { "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector", "Properties": { "Order": "2" }, "Children": [ { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Url", "ValueProviderType": "Attribute", "ValueProviderKey": "href" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//span[@class=\"article_comment\"]", "Key": "CommentCount", "ValueProviderType": "InnerText", "Order": "0" }, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor", "Properties": { "RegexPartten": "[0-9]+", "Order": "1" } } ] }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//*[@class=\"digg\"]//span", "Key": "LikeCount", "ValueProviderType": "InnerText" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Title", "ValueProviderType": "InnerText" } } ] } ] },
獲取的結果:
[ { "Url": "//news.cnblogs.com/n/574269/", "CommentCount": "1", "LikeCount": "3", "Title": "劉強東:京東13年了,真正懂咱們的人仍是不多" }, { "Url": "//news.cnblogs.com/n/574267/", "CommentCount": "0", "LikeCount": "0", "Title": "聯想也開始大談人工智能,不過它最迫切的目標是賣更多PC" }, { "Url": "//news.cnblogs.com/n/574266/", "CommentCount": "0", "LikeCount": "0", "Title": "除了小米1幾乎都支持 - 小米MIUI9升級機型一覽" }, ... ]
這裏面涉及到計算,和集合操做,同時集合元素是個字典,因此須要引入兩個一個新的Processor
,一個用於篩選,一個用於映射。
public class ListItemPickContentProcessor : BaseContentProcessor { public string Key { get; set; } /// <summary> /// 用來操做的類型 /// </summary> public string OperatorTypeFullName { get; set; } /// <summary> /// 用來對比的值 /// </summary> public string OperatorValue { get; set; } /// <summary> /// 下標 /// </summary> public int Index { get; set; } /// <summary> /// 模式 /// </summary> public ListItemPickMode PickMode { get; set; } /// <summary> /// 操做符 /// </summary> public ListItemPickOperator PickOperator { get; set; } public override object Process(object source) { var preResult = base.Process(source); if (!Helper.IsEnumerableExceptString(preResult)) { if (source is Dictionary<string, object>) return ((Dictionary<string, object>) preResult)[Key]; return preResult; } return Pick(source as IEnumerable); } private object Pick(IEnumerable source) { var objCollection = source.Cast<object>().ToList(); if (objCollection.Count == 0) return objCollection; var item = objCollection[0]; var compareDictionary = new Dictionary<object, IComparable>(); if (item is IDictionary) { foreach (Dictionary<string, object> dic in objCollection) { var key = (IComparable) dic[Key].ToString().Convert(ResolveType(OperatorTypeFullName)); compareDictionary.Add(dic, key); } } else { foreach (var objItem in objCollection) { var key = (IComparable) objItem.ToString().Convert(ResolveType(OperatorTypeFullName)); compareDictionary.Add(objItem, key); } } IEnumerable<object> result; switch (PickOperator) { case ListItemPickOperator.OrderDesc: result = compareDictionary.OrderByDescending(i => i.Value).Select(i => i.Key); break; default: throw new NotSupportedException(); } switch (PickMode) { case ListItemPickMode.First: return result.FirstOrDefault(); case ListItemPickMode.Last: return result.LastOrDefault(); case ListItemPickMode.Index: return result.Skip(Index - 1).Take(1).FirstOrDefault(); default: throw new NotImplementedException(); } } private Type ResolveType(string typeName) { if (typeName == typeof(Int32).FullName) return typeof(Int32); throw new NotSupportedException(); } public enum ListItemPickMode { First, Last, Index } public enum ListItemPickOperator { LittleThan, GreaterThan, Order, OrderDesc } }
這裏用了比較多的反射,可是暫時不考慮性能問題。
public class DictionaryPickContentProcessor : BaseContentProcessor { public string Key { get; set; } protected override object ProcessElement(object element) { if (element is IDictionary) { return (element as IDictionary)[Key]; } return element; } }
這個Processor
將從字典中抽取一條記錄。
使用的配置:
"mostCommentsOptions": { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "Properties": {}, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor", "Properties": { "Url": "https://www.cnblogs.com/news/", "Order": "0" } }, { "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//div[@class=\"post_item\"]", "Order": "1", "ValueProviderType": "OuterHtml", "OutputToArray": true } }, { "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector", "Properties": { "Order": "2" }, "Children": [ { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Url", "ValueProviderType": "Attribute", "ValueProviderKey": "href" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//span[@class=\"article_comment\"]", "Key": "CommentCount", "ValueProviderType": "InnerText", "Order": "0" }, "Children": [ { "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor", "Properties": { "RegexPartten": "[0-9]+", "Order": "1" } } ] }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//*[@class=\"digg\"]//span", "Key": "LikeCount", "ValueProviderType": "InnerText" } }, { "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector", "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//a[@class=\"titlelnk\"]", "Key": "Title", "ValueProviderType": "InnerText" } } ] }, { "ProcessorType":"IC.Robot.ContentProcessor.ListItemPickContentProcessor", "Properties":{ "OperatorTypeFullName":"System.Int32", "Key":"CommentCount", "PickMode":"First", "PickOperator":"OrderDesc", "Order":"4" } }, { "ProcessorType":"IC.Robot.ContentProcessor.DictionaryPickContentProcessor", "Properties":{ "Order":"5", "Key":"Url" } }, { "ProcessorType":"IC.Robot.ContentProcessor.FormatterContentProcessor", "Properties":{ "Formatter":"https:{0}", "Order":"6" } }, { "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor", "Properties": { "Order": "7" } }, { "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor", "Properties": { "Xpath": "//div[@id=\"news_content\"]//p[2]", "Order": "8", "ValueProviderType": "InnerHtml", "OutputToArray": false } } ] }
獲取的結果:
昨日,京東忽然通知平臺商戶,將關閉每天快遞服務接口。這意味着京東平臺上的商戶之後不能再用每天快遞發貨了。
Processor
調度的問題(深度優先、廣度優先等)寫代碼仍是頗有趣的,不是嗎?