編寫一個可配置的網頁信息提取組件

引言

最近項目有需求從一個老的站點抓取信息而後倒入到新的系統中。因爲老的系統已經沒有人維護,數據又比較分散,而要提取的數據在網頁上表現的反而更統一,因此計劃經過網絡請求而後分析頁面的方式來提取數據。而兩年前的這個時候,我彷佛作過相同的事情——緣分這件事情,真是有趣。html

設想

在採集信息這件事情中,最麻煩的每每是不一樣的頁面的分解、數據的提取——由於頁面的設計和結構每每千差萬別。同時,對於有些頁面,一般不得不繞着彎子請求(ajax、iframe等),這致使數據提取成了最耗時也最痛苦的過程——由於你須要編寫大量的邏輯代碼將整個流程串聯起來。我隱隱記得15年的7月,也就是兩年前的這個時候,我就思考過這個問題。當時引入了一個類型CommonExtractor來解決這個問題。整體的定義是這樣的:node

public class CommonExtractor
    {
        public CommonExtractor(PageProcessConfig config)
        {
            PageProcessConfig = config;
        }

        protected PageProcessConfig PageProcessConfig;

        public virtual void Extract(CrawledHtmlDocument document)
        {
            if (!PageProcessConfig.IncludedUrlPattern.Any(i => Regex.IsMatch(document.FromUrl.ToString(), i)))
                return;
            var node = new WebHtmlNode { Node = document.Contnet.DocumentNode, FromUrl = document.FromUrl };
            ExtractData(node, PageProcessConfig);
        }

        protected Dictionary<string, ExtractionResult> ExtractData(WebHtmlNode node, PageProcessConfig blockConfig)
        {

            var data = new Dictionary<string, ExtractionResult>();
            foreach (var config in blockConfig.DataExtractionConfigs)
            {
                if (node == null)
                    continue;
                /*使用'.'將當前節點做爲上下文*/
                var selectedNodes = node.Node.SelectNodes("." + config.XPath);
                var result = new ExtractionResult(config, node.FromUrl);
                if (selectedNodes != null && selectedNodes.Any())
                {
                    foreach (var sNode in selectedNodes)
                    {
                        if (config.Attribute != null)
                            result.Fill(sNode.Attributes[config.Attribute].Value);
                        else
                            result.Fill(sNode.InnerText);
                    }
                    data[config.Key] = result;
                }
                else { data[config.Key] = null; }
            }

            if (DataExtracted != null)
            {
                var args = new DataExtractedEventArgs(data, node.FromUrl);
                DataExtracted(this, args);
            }

            return data;
        }

        public EventHandler<DataExtractedEventArgs> DataExtracted;
    }

代碼有點亂(由於當時使用的是Abot進行爬網),可是意圖仍是挺明確的,但願從一個html文件中提取出有用的信息,而後經過一個配置來指定如何提取信息。這種處理方式存在的主要問題是:沒法應對複雜結構,在應對特定的結構的時候必須引入新的配置,新的流程,同時這個新的流程不具有較高程度的可重用性。ajax

設計

簡單的開始

爲了應對現實狀況中的複雜性,最基本的處理必須設計的簡單。從之前代碼中捕捉到靈感,對於數據提取,其實咱們想要的就是:json

  • 給程序提供一個html文檔
  • 程序給咱們返回一個值

由此,給出了最基本的接口定義:數組

public interface IContentProcessor
    {
        /// <summary>
        /// 處理內容
        /// </summary>
        /// <param name="source"></param>
        /// <returns></returns>
        object Process(object source);
    }

可組合性

在上述的接口定義中,IContentProcessor接口的實現方法若是足夠龐大,其實能夠解決任何html頁面的數據提取,可是,這意味着其可複用性會愈來愈低,同時維護將愈來愈困難。因此,咱們更但願其方法實現足夠小。可是,越小表明着其功能越少,那麼,爲了面對複雜的現實需求,必須讓這些接口能夠組合起來。因此,要爲接口添加新的要素:子處理器。網絡

public interface IContentProcessor
    {
        /// <summary>
        /// 處理內容
        /// </summary>
        /// <param name="source"></param>
        /// <returns></returns>
        object Process(object source);

        /// <summary>
        /// 該處理器的順序,越小越先執行
        /// </summary>
        int Order { get; }

        /// <summary>
        /// 子處理器
        /// </summary>
        IList<IContentProcessor> SubProcessors { get; }
    }

這樣一來,各個Processor就能夠進行協做了。其嵌套關係和Order屬性共同決定了其執行的順序。同時,整個處理流程也具有了管道的特色:上一個Processor的處理結果能夠做爲下一個Processor的處理源。數據結構

結果的組合性

雖然解決了處理流程的可組合性,可是就目前而言,處理的結果仍是不可組合的,由於沒法應對複雜的結構。爲了解決這個問題,引入了IContentCollector,這個接口繼承自IContentProcessor,可是提出了額外的要求,以下:app

public interface IContentCollector : IContentProcessor
    {
        /// <summary>
        /// 數據收集器收集的值對應的鍵
        /// </summary>
        string Key { get; }
    }

該接口要求提供一個Key來標識結果。這樣,咱們就能夠用一個Dictionary<string,object>把複雜的結構管理起來了。由於字典的項對應的值也能夠是Dictionary<string,object>,這個時候,若是使用json做爲序列化手段的話,是很是容易將結果反序列化成複雜的類的。async

至於爲何要將這個接口繼承自IContentProcessor,這是爲了保證節點類型的一致性,從而方便經過配置來構造整個處理流程。ide

配置

從上面的設計中能夠看到,整個處理流程實際上是一棵樹,結構很是規範。這就爲配置提供了可行性,這裏使用一個Content-Processor-Options類型來表示每一個Processor節點的類型和必要的初始化信息。定義以下所示:

public class ContentProcessorOptions
    {
        /// <summary>
        /// 構造Processor的參數列表
        /// </summary>
        public Dictionary<string, object> Properties { get; set; } = new Dictionary<string, object>();

        /// <summary>
        /// Processor的類型信息
        /// </summary>
        public string ProcessorType { get; set; }

        /// <summary>
        /// 指定一個子Processor,用於快速初始化Children,從而減小嵌套。
        /// </summary>
        public string SubProcessorType { get; set; }

        /// <summary>
        /// 子項配置
        /// </summary>
        public List<ContentProcessorOptions> Children { get; set; } = new List<ContentProcessorOptions>();
    }

在Options中引入了SubProcessorType屬性來快速初始化只有一個子處理節點的ContentCollector,這樣就能夠減小配置內容的層級,從而使得配置文件更加清晰。而如下方法則表示瞭如何經過一個Content-Processor-Options初始化Processor。這裏使用了反射,可是因爲不會頻繁初始化,因此不會有太大的問題。

public static IContentProcessor BuildContentProcessor(ContentProcessorOptions contentProcessorOptions)
        {
            Type instanceType = null;
            try
            {
                instanceType = Type.GetType(contentProcessorOptions.ProcessorType, true);
            }
            catch
            {
                foreach (var assembly in AppDomain.CurrentDomain.GetAssemblies())
                {
                    if (assembly.IsDynamic) continue;
                    instanceType = assembly.GetExportedTypes()
                        .FirstOrDefault(i => i.FullName == contentProcessorOptions.ProcessorType);
                    if (instanceType != null) break;
                }
            }

            if (instanceType == null) return null;

            var instance = Activator.CreateInstance(instanceType);
            foreach (var property in contentProcessorOptions.Properties)
            {
                var instanceProperty = instance.GetType().GetProperty(property.Key);
                if (instanceProperty == null) continue;
                var propertyType = instanceProperty.PropertyType;
                var sourceValue = property.Value.ToString();
                var dValue = sourceValue.Convert(propertyType);
                instanceProperty.SetValue(instance, dValue);
            }
            var processorInstance = (IContentProcessor) instance;
            if (!contentProcessorOptions.SubProcessorType.IsNullOrWhiteSpace())
            {
                var quickOptions = new ContentProcessorOptions
                {
                    ProcessorType = contentProcessorOptions.SubProcessorType,
                    Properties = contentProcessorOptions.Properties
                };
                var quickProcessor = BuildContentProcessor(quickOptions);
                processorInstance.SubProcessors.Add(quickProcessor);
            }
            foreach (var processorOption in contentProcessorOptions.Children)
            {
                var processor = BuildContentProcessor(processorOption);
                processorInstance.SubProcessors.Add(processor);
            }
            return processorInstance;
        }

幾個約束

須要收斂集合

經過一個例子來講明問題:好比,一個html文檔中提取了n個p標籤,返回了一個string [],同時將這個做爲源傳遞給下一個處理節點。下一個處理節點會正確的處理每一個string,可是若是此節點也是針對一個string返回一個string[]的話,這個string []應該被一個Connector拼接起來。不然的話,結果就變成了2維3維度乃至是更多維度的數組。這樣的話,每一個節點的邏輯就變複雜同時不可控了。因此集合須要收斂到一個維度。

配置文件中的Properties不支持複雜結構

因爲當前使用的.NET CORE的配置文件系統,沒法在一個Dictionary<string,object>中將其子項設置爲集合。

若干實現

Processor的實現和測試

HttpRequestContentProcessor

該處理器用於從網絡上下載一段html文本,將文本內容做爲源傳遞給下一個處理器;能夠同時指定請求url或者將上一個請求節點傳遞過來的源做爲url進行請求。實現以下:

public class HttpRequestContentProcessor : BaseContentProcessor
    {
        public bool UseUrlWhenSourceIsNull { get; set; } = true;

        public string Url { get; set; }

        public bool IgnoreBadUri { get; set; }

        protected override object ProcessElement(object element)
        {
            if (element == null) return null;
            if (Uri.IsWellFormedUriString(element.ToString(), UriKind.Absolute))
            {
                if (IgnoreBadUri) return null;
                throw new FormatException($"須要請求的地址{Url}格式不正確");
            }
            return DownloadHtml(element.ToString());
        }

        public override object Process(object source)
        {
            if (source == null && UseUrlWhenSourceIsNull && !Url.IsNullOrWhiteSpace())
                return DownloadHtml(Url);
            return base.Process(source);
        }

        private static async Task<string> DownloadHtmlAsync(string url)
        {
            using (var client = new HttpClient())
            {
                var result = await client.GetAsync(url);
                var html = await result.Content.ReadAsStringAsync();
                return html;
            }
        }

        private string DownloadHtml(string url)
        {
            return AsyncHelper.Synchronize(() => DownloadHtmlAsync(url));
        }
    }

測試以下:

[TestMethod]
        public void HttpRequestContentProcessorTest()
        {
            var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com"};
            var result = processor.Process(null);
            Assert.IsTrue(result.ToString().Contains("baidu"));
        }

XpathContentProcessor

該處理器經過接受一個XPath路徑來獲取指定的信息。能夠經過指定ValueProviderValueProviderKey來指定如何從一個節點中獲取數據,實現以下:

public class XpathContentProcessor : BaseContentProcessor
    {
        /// <summary>
        /// 索引的元素路徑
        /// </summary>
        public string Xpath { get; set; }

        /// <summary>
        /// 值得提供器的鍵
        /// </summary>
        public string ValueProviderKey { get; set; }

        /// <summary>
        /// 提供器的類型
        /// </summary>
        public XpathNodeValueProviderType ValueProviderType { get; set; }

        /// <summary>
        /// 節點的索引
        /// </summary>
        public int? NodeIndex { get; set; }

        /// <summary>
        /// 
        /// </summary>
        public string ResultConnector { get; set; } = Constants.DefaultResultConnector;

        public override object Process(object source)
        {
            var result = base.Process(source);
            return DeterminAndReturn(result);
        }

        protected override object ProcessElement(object element)
        {
            var result = base.ProcessElement(element);
            if (result == null) return null;

            var str = result.ToString();
            
            return ProcessWithXpath(str, Xpath, false);
        }

        protected object ProcessWithXpath(string documentText, string xpath, bool returnArray)
        {
            if (documentText == null) return null;

            var document = new HtmlDocument();
            document.LoadHtml(documentText);
            var nodes = document.DocumentNode.SelectNodes(xpath);

            if (nodes == null)
                return null;

            if (returnArray && nodes.Count > 1)
            {
                var result = new List<string>();
                foreach (var node in nodes)
                {
                    var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
                    if (!nodeResult.IsNullOrWhiteSpace())
                    {
                        result.Add(nodeResult);
                    }
                }
                return result;
            }
            else
            {
                var result = string.Empty;
                foreach (var node in nodes)
                {
                    var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
                    if (!nodeResult.IsNullOrWhiteSpace())
                    {
                        if (result.IsNullOrWhiteSpace()) result = nodeResult;
                        else result = $"{result}{ResultConnector}{nodeResult}";
                    }
                }
                return result;
            }
        }
    }

將這個Processor和上一個Processor組合起來,咱們抓一下百度首頁的title

[TestMethod]
        public void XpathContentProcessorTest()
        {
            var xpathProcessor = new XpathContentProcessor
            {
                Xpath = "//title",
                ValueProviderType = XpathNodeValueProviderType.InnerText
            };
            var processor = new HttpRequestContentProcessor { Url = "https://www.baidu.com" };
            xpathProcessor.SubProcessors.Add(processor);

            var result = xpathProcessor.Process(null);
            Assert.AreEqual("百度一下,你就知道", result.ToString());
        }

Collector的實現和測試

Collector最大的做用是解決複雜的輸出模型的問題。一個複雜數據結構的Collector的實現以下:

public class ComplexContentCollector : BaseContentCollector
    {
        /// <summary>
        /// Complex Content Collector 須要子的數據提取器提供一個Key,因此忽略Processor
        /// </summary>
        /// <param name="source"></param>
        /// <returns></returns>
        protected override object ProcessElement(object source)
        {
            var result = new Dictionary<string, object>();

            foreach (var contentCollector in SubProcessors.OfType<IContentCollector>())
            {
                result[contentCollector.Key] = contentCollector.Process(source);
            }

            return result;
        }
    }

對應的測試以下:

[TestMethod]
        public void ComplexContentCollectorTest2()
        {
            var xpathProcessor = new XpathContentProcessor
            {
                Xpath = "//title",
                ValueProviderType = XpathNodeValueProviderType.InnerText
            };

            var xpathProcessor2 = new XpathContentProcessor
            {
                Xpath = "//p[@id=\"cp\"]",
                ValueProviderType = XpathNodeValueProviderType.InnerText,
                Order = 1
            };
            var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com", Order = -1};
            var complexCollector = new ComplexContentCollector();
            var baseCollector = new BaseContentCollector();

            baseCollector.SubProcessors.Add(processor);
            baseCollector.SubProcessors.Add(complexCollector);
            
            var titleCollector = new BaseContentCollector{Key = "Title"};
            titleCollector.SubProcessors.Add(xpathProcessor);
            var footerCollector = new BaseContentCollector {Key = "Footer"};
            footerCollector.SubProcessors.Add(xpathProcessor2);
            footerCollector.SubProcessors.Add(new HtmlCleanupContentProcessor{Order = 3});

            complexCollector.SubProcessors.Add(titleCollector);
            complexCollector.SubProcessors.Add(footerCollector);

            var result = (Dictionary<string,object>)baseCollector.Process(null);
            Assert.AreEqual("百度一下,你就知道", result["Title"]);
            Assert.AreEqual("©2014 Baidu 使用百度前必讀 京ICP證030173號", result["Footer"]);

        }

使用配置應對稍微複雜的狀況

如今,使用如下代碼進行測試:

public void RunConfig(string section)
        {
            var builder = new ConfigurationBuilder()
                .SetBasePath(AppDomain.CurrentDomain.BaseDirectory)
                .AddJsonFile("appsettings1.json");
            var configurationRoot = builder.Build();

            var options = configurationRoot.GetSection(section).Get<ContentProcessorOptions>();
            var processor = Helper.BuildContentProcessor(options);

            var result = processor.Process(null);
            var json = JsonConvert.SerializeObject(result);
            System.Console.WriteLine(json);
        }

抓取博客園列表標題

使用的配置:

"newsListOptions": {
    "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
    "Properties": {},
    "Children": [
      {
        "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
        "Properties": {
          "Url": "https://www.cnblogs.com/news/",
          "Order": "0"
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
        "Properties": {
          "Xpath": "//div[@class=\"post_item\"]",
          "Order": "1",
          "ValueProviderType": "OuterHtml",
          "OutputToArray": true
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
        "Properties": {
          "Order": "2"
        },
        "Children": [
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//a[@class=\"titlelnk\"]",
              "Key": "Url",
              "ValueProviderType": "Attribute",
              "ValueProviderKey": "href"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//span[@class=\"article_comment\"]",
              "Key": "CommentCount",
              "ValueProviderType": "InnerText",
              "Order": "0"
            },
            "Children": [
              {
                "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
                "Properties": {
                  "RegexPartten": "[0-9]+",
                  "Order": "1"
                }
              }
            ]
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//*[@class=\"digg\"]//span",
              "Key": "LikeCount",
              "ValueProviderType": "InnerText"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//a[@class=\"titlelnk\"]",
              "Key": "Title",
              "ValueProviderType": "InnerText"
            }
          }
        ]
      }
    ]
  },

獲取的結果:

[
        {
            "Url": "//news.cnblogs.com/n/574269/",
            "CommentCount": "1",
            "LikeCount": "3",
            "Title": "劉強東:京東13年了,真正懂咱們的人仍是不多"
        },
        {
            "Url": "//news.cnblogs.com/n/574267/",
            "CommentCount": "0",
            "LikeCount": "0",
            "Title": "聯想也開始大談人工智能,不過它最迫切的目標是賣更多PC"
        },
        {
            "Url": "//news.cnblogs.com/n/574266/",
            "CommentCount": "0",
            "LikeCount": "0",
            "Title": "除了小米1幾乎都支持 - 小米MIUI9升級機型一覽"
        },
        ...
]

獲取該列表中評論最多的新聞的詳情

這裏面涉及到計算,和集合操做,同時集合元素是個字典,因此須要引入兩個一個新的Processor,一個用於篩選,一個用於映射。

public class ListItemPickContentProcessor : BaseContentProcessor
    {
        public string Key { get; set; }

        /// <summary>
        /// 用來操做的類型
        /// </summary>
        public string OperatorTypeFullName { get; set; }

        /// <summary>
        /// 用來對比的值
        /// </summary>
        public string OperatorValue { get; set; }

        /// <summary>
        /// 下標
        /// </summary>
        public int Index { get; set; }

        /// <summary>
        /// 模式
        /// </summary>
        public ListItemPickMode PickMode { get; set; }

        /// <summary>
        /// 操做符
        /// </summary>
        public ListItemPickOperator PickOperator { get; set; }

        public override object Process(object source)
        {
            var preResult = base.Process(source);

            if (!Helper.IsEnumerableExceptString(preResult))
            {
                if (source is Dictionary<string, object>)
                    return ((Dictionary<string, object>) preResult)[Key];
                return preResult;
            }

            return Pick(source as IEnumerable);
        }

        private object Pick(IEnumerable source)
        {
            var objCollection = source.Cast<object>().ToList();
            if (objCollection.Count == 0)
                return objCollection;
            var item = objCollection[0];
            var compareDictionary = new Dictionary<object, IComparable>();
            if (item is IDictionary)
            {

                foreach (Dictionary<string, object> dic in objCollection)
                {
                    var key = (IComparable) dic[Key].ToString().Convert(ResolveType(OperatorTypeFullName));
                    compareDictionary.Add(dic, key);
                }
            }
            else
            {
                foreach (var objItem in objCollection)
                {
                    var key = (IComparable) objItem.ToString().Convert(ResolveType(OperatorTypeFullName));
                    compareDictionary.Add(objItem, key);
                }
            }

            IEnumerable<object> result;

            switch (PickOperator)
            {
                case ListItemPickOperator.OrderDesc:
                    result = compareDictionary.OrderByDescending(i => i.Value).Select(i => i.Key);
                    break;
                default: throw new NotSupportedException();
            }

            switch (PickMode)
            {
                case ListItemPickMode.First:
                    return result.FirstOrDefault();
                case ListItemPickMode.Last:
                    return result.LastOrDefault();
                case ListItemPickMode.Index:
                    return result.Skip(Index - 1).Take(1).FirstOrDefault();
                default:
                    throw new NotImplementedException();
            }
        }

        private Type ResolveType(string typeName)
        {
            if (typeName == typeof(Int32).FullName)
                return typeof(Int32);
            throw new NotSupportedException();
        }

        public enum ListItemPickMode
        {
            First,
            Last,
            Index
        }

        public enum ListItemPickOperator
        {
            LittleThan,
            GreaterThan,
            Order,
            OrderDesc
        }
    }

這裏用了比較多的反射,可是暫時不考慮性能問題。

public class DictionaryPickContentProcessor : BaseContentProcessor
    {
        public string Key { get; set; }

        protected override object ProcessElement(object element)
        {
            if (element is IDictionary)
            {
                return (element as IDictionary)[Key];
            }
            return element;
        }
    }

這個Processor將從字典中抽取一條記錄。

使用的配置:

"mostCommentsOptions": {
    "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
    "Properties": {},
    "Children": [
      {
        "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
        "Properties": {
          "Url": "https://www.cnblogs.com/news/",
          "Order": "0"
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
        "Properties": {
          "Xpath": "//div[@class=\"post_item\"]",
          "Order": "1",
          "ValueProviderType": "OuterHtml",
          "OutputToArray": true
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
        "Properties": {
          "Order": "2"
        },
        "Children": [
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//a[@class=\"titlelnk\"]",
              "Key": "Url",
              "ValueProviderType": "Attribute",
              "ValueProviderKey": "href"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//span[@class=\"article_comment\"]",
              "Key": "CommentCount",
              "ValueProviderType": "InnerText",
              "Order": "0"
            },
            "Children": [
              {
                "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
                "Properties": {
                  "RegexPartten": "[0-9]+",
                  "Order": "1"
                }
              }
            ]
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//*[@class=\"digg\"]//span",
              "Key": "LikeCount",
              "ValueProviderType": "InnerText"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
            "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//a[@class=\"titlelnk\"]",
              "Key": "Title",
              "ValueProviderType": "InnerText"
            }
          }
        ]
      },
      {
        "ProcessorType":"IC.Robot.ContentProcessor.ListItemPickContentProcessor",
        "Properties":{
          "OperatorTypeFullName":"System.Int32",
          "Key":"CommentCount",
          "PickMode":"First",
          "PickOperator":"OrderDesc",
          "Order":"4"
        }
      },
      {
        "ProcessorType":"IC.Robot.ContentProcessor.DictionaryPickContentProcessor",
        "Properties":{
          "Order":"5",
          "Key":"Url"
        }
      },
      {
        "ProcessorType":"IC.Robot.ContentProcessor.FormatterContentProcessor",
        "Properties":{
          "Formatter":"https:{0}",
          "Order":"6"
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
        "Properties": {
          "Order": "7"
        }
      },
      {
        "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
        "Properties": {
          "Xpath": "//div[@id=\"news_content\"]//p[2]",
          "Order": "8",
          "ValueProviderType": "InnerHtml",
          "OutputToArray": false
        }
      }
    ]
  }

獲取的結果:

  昨日,京東忽然通知平臺商戶,將關閉每天快遞服務接口。這意味着京東平臺上的商戶之後不能再用每天快遞發貨了。

能夠優化地方

  • 須要一個GUI來處理配置,如今的配置實在不人性化
  • 須要引入一個調度器,解決Processor調度的問題(深度優先、廣度優先等)
  • 須要在代碼級別,對各個調度器的依賴關係提出約束(例如,項的收斂問題),從而更好的引導配置
  • 規則還不夠統一,好比何時該約束返回集合,何時不應約束

寫代碼仍是頗有趣的,不是嗎?

相關文章
相關標籤/搜索