接上一篇, JD SKU對應的店鋪信息是異步加載的,所以沒法使用上一篇的爬蟲直接解決。這時咱們須要從新徹底採集全部的SKU數據嗎?補爬的話歷史數據就用不了了。所以,去京東頁面上找看是否有提供相關的接口。html
安裝 Fiddler, 並打開mysql
在谷歌瀏覽器中訪問: http://list.jd.com/list.html?cat=1315,1343,9719git
在Fiddler查找一條條的訪問記錄,找到咱們想要的接口github
分析返回的數據結果,咱們能夠先寫出數據對象的定義(觀察Expression的值已是JsonPath查詢表達式了,同時Type必須設置爲Type = SelectorType.JsonPath)。另外須要注意的是,此次的爬蟲是更新型爬蟲,就是說採集到的數據補充回原表,那麼就必定要設置主鍵是什麼,即在數據類上添加主鍵的定義sql
[EntityTable("test", "jd_sku", EntityTable.Monday, Primary = "Sku", UpdateColumns = new[] { "ShopId" })] [EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)] class ProductUpdater : SpiderEntity { [PropertyDefine(Expression = "$.pid", Type = SelectorType.JsonPath, Length = 25)] public string Sku { get; set; } [PropertyDefine(Expression = "$.shopId", Type = SelectorType.JsonPath)] public int ShopId { get; set; } }
因爲返回的數據中還有一個json()這樣的pagging,因此須要先作一個截取操做,框架提供了PageHandler接口,而且咱們實現了許多經常使用的Handler,用於HTML的解析前的一些處理操做。PrepareStartUrls 接口是用來從數據源來獲取起始URL,而不須要把URL直接寫在代碼裏。完整的代碼以下json
public class JdShopDetailSpider : EntitySpider { public JdShopDetailSpider() : base("JdShopDetailSpider", new Site()) { } protected override void MyInit(params string[] arguments) { Identity = Identity ?? Guid.NewGuid().ToString(); Downloader.AddAfterDownloadCompleteHandler(new SubContentHandler { StartPart = "json(", EndPart = ");", StartOffset = 5, EndOffset = 0 }); AddStartUrlBuilder(new DbStartUrlBuilder(Database.MySql, "Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;", $"SELECT * FROM test.jd_sku_{DateTimeUtils.MondayOfCurrentWeek.ToString("yyyy_MM_dd")} WHERE ShopName is null or ShopId is null or ShopId = 0 order by sku", new[] { "sku" }, "http://chat1.jd.com/api/checkChat?my=list&pidList={0}&callback=json")); AddPipeline(new MySqlEntityPipeline("Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;")); AddEntityType(typeof(ProductUpdater)); } [EntityTable("test", "jd_sku", EntityTable.Monday, Primary = "Sku", UpdateColumns = new[] { "ShopId" })] [EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)] class ProductUpdater : SpiderEntity { [PropertyDefine(Expression = "$.pid", Type = SelectorType.JsonPath, Length = 25)] public string Sku { get; set; } [PropertyDefine(Expression = "$.shopId", Type = SelectorType.JsonPath)] public int ShopId { get; set; } } }
https://github.com/zlzforever/DotnetSpider 望各位大佬加星 api
博文寫得比較早, 框架修改有時會來不及更新博文中的代碼, 請查看DotnetSpider.Sample項目中的樣例爬蟲瀏覽器
QQ羣: 477731655架構
郵箱: zlzforever@163.com框架