最近工做中有一個需求,須要爬取天貓商品的信息,整個需求的過程以下:html
修改後端廣告交易平臺的代碼,從阿里上傳的素材中解析url,該url格式以下:node
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D
明顯進行編碼了,首先咱們須要進行解碼,解碼的在線網站以下:git
http://tool.chinaz.com/Tools/urlencode.aspxgithub
通過decode之後,咱們獲得:web
https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content={"items":[{"images":["https://asearch.alicdn.com/bao/uploaded//i4/22356367/TB2PMQinN6I8KJjy0FgXXXXzVXa_!!0-saturn_solar.jpg"],"itemid":"7664169349","shorttitle":"乒乓球拍 無線專屬"}]}
咱們須要的就是其中的"itemid":"7664169349"。json
而後咱們經過訪問https://detail.tmall.com/item.htm?id=7664169349,打開以下頁面:後端
這就是咱們須要抓取的頁面信息。廣告交易平臺將解析的ItemId放入到nsq中,爬蟲系統從nsq中讀取ItemId經過拼接URL抓取頁面的關鍵信息,而後將關鍵信息發送到Kafka中,Hive和ES再從Kafka中獲取相應的信息,進行查詢操做。api
第一步瀏覽器
第一步就是解析出ItemId,在廣告交易平臺咱們能夠獲取須要解析的URL,接下來咱們用代碼對URL進行decode而且解析出相應的ItemId數值。因爲項目採用的是Golang,因此這裏以Golang爲例,Python寫其實更簡單,原理同樣。app
URL解析的方法,能夠參考:
https://gobyexample.com/url-parsing
JSON序列化和反序列化,能夠參考:
http://www.javashuo.com/article/p-torllrqb-dn.html
這裏給出個人代碼:
package main import ( "encoding/json" "fmt" "net/url" "strconv" ) //結構體的首字母大寫 type item struct { Images []string ItemId string ShortTitle string } func main() { var urlstring string = "https://handycam.alicdn.com/slideshow/26/7ef5aed1e3c39843e8feac816a436ecf.mp4?content=%7B%22items%22%3A%5B%7B%22images%22%3A%5B%22https%3A%2F%2Fasearch.alicdn.com%2Fbao%2Fuploaded%2F%2Fi4%2F22356367%2FTB2PMQinN6I8KJjy0FgXXXXzVXa_%21%210-saturn_solar.jpg%22%5D%2C%22itemid%22%3A%227664169349%22%2C%22shorttitle%22%3A%22%E4%B9%92%E4%B9%93%E7%90%83%E6%8B%8D%20%E6%97%A0%E7%BA%BF%E4%B8%93%E5%B1%9E%22%7D%5D%7D" unescape, err := url.QueryUnescape(urlstring) if err != nil { fmt.Println("err is", err) } fmt.Println(unescape) parse, err := url.Parse(unescape) fmt.Println(parse.RawQuery) query, err := url.ParseQuery(parse.RawQuery) fmt.Println(query) fmt.Printf("%T, %v\n", query["content"][0], query["content"][0]) m := make(map[string][]item) json.Unmarshal([]byte(query["content"][0]), &m) fmt.Println("m:", m) itemValue := m["items"][0] fmt.Println(itemValue.ItemId) //轉成int64 i, err := strconv.ParseInt(itemValue.ItemId, 10, 64) fmt.Printf("%T, %v", i, i) }
運行結果:
即可以獲得咱們須要的ItemId數值。
第二步
第二步就是拼接咱們的URL進行頁面內容的爬取。
如何經過GoLang拉取網頁呢?附上一個簡單demo。
package main import ( "net/http" "io/ioutil" "fmt" ) func main(){ var website string = "http://www.future.org.cn" if resp,err := http.Get(website); err == nil{ defer resp.Body.Close() if body, err := ioutil.ReadAll(resp.Body); err == nil { fmt.Println("HTML content:", string(body)); }else{ fmt.Println("Cannot read from connected http server:", err); } }else{ fmt.Println("Cannot connect the server:", err); } }
可是爬取頁面之後,會發現個問題,就是中文顯示亂碼。
中文亂碼問題解決,參考:
安裝 iconv-go
go get github.com/djimenez/iconv-go
能夠獲取之後再轉碼,好比:
func convFromGbk(s string) string { gbkConvert, _ := iconv.NewConverter("gbk", "utf-8") res, _ := gbkConvert.ConvertString(s) return res }
也能夠用以下方式轉換Reader:
req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) rsp, err := j.client.Do(req) if err != nil { return nil, err } //轉碼 utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8") //if body, err := ioutil.ReadAll(utfBody); err == nil { // fmt.Println("HTML content:", string(body)) //}
爬取之後的頁面咱們須要進行解析,這裏採用的XPath。
關於使用XPath的方式,參考:
http://www.w3school.com.cn/xpath/xpath_axes.asp
很是簡單,看完就明白了。
由於爬取以後是html,你只須要獲取本身想要的內容便可,說白了就是解析html。
接下來還有一個難點,就是咱們抓取的靜態頁面,不少信息都包含,可是價格信息不包含,由於它是動態加載的。
咱們不妨分析一下,
咱們將其點開,複製URL在瀏覽器打開,發現沒法訪問,403,不要着急,只須要在請求的Header中加上以下的參數便可。
在代碼中以下:
referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID)
req.Header.Set("Referer", referer)
咱們查看響應發現是一個JSON,
格式化一下:格式化網址:http://tool.oschina.net/codeformat/json
{ "defaultModel": { "bannerDO": { "success": true }, "deliveryDO": { "areaId": 110100, "deliveryAddress": "浙江金華", "deliverySkuMap": { "6310159781": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快遞: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金華", "type": 0 } ], "default": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快遞: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金華", "type": 0 } ], "6310159797": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快遞: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金華", "type": 0 } ], "3280089025135": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快遞: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金華", "type": 0 } ], "3280089025136": [ { "arrivalNextDay": false, "arrivalThisDay": false, "forceMocked": false, "postage": "快遞: 0.00 ", "postageFree": false, "skuDeliveryAddress": "浙江金華", "type": 0 } ] }, "destination": "北京市", "success": true }, "detailPageTipsDO": { "crowdType": 0, "hasCoupon": true, "hideIcons": false, "jhs99": false, "minicartSurprise": 0, "onlyShowOnePrice": false, "priceDisplayType": 4, "primaryPicIcons": [ ], "prime": false, "showCuntaoIcon": false, "showDou11Style": false, "showDou11SugPromPrice": false, "showDou12CornerIcon": false, "showDuo11Stage": 0, "showJuIcon": false, "showMaskedDou11SugPrice": false, "success": true, "trueDuo11Prom": false }, "doubleEleven2014": { "doubleElevenItem": false, "halfOffItem": false, "showAtmosphere": false, "showRightRecommendedArea": false, "step": 0, "success": true }, "extendedData": { }, "extras": { }, "gatewayDO": { "changeLocationGateway": { "queryDelivery": true, "queryProm": false }, "success": true, "trade": { "addToBuyNow": { }, "addToCart": { } } }, "inventoryDO": { "hidden": false, "icTotalQuantity": 225, "skuQuantity": { "3280089025136": { "quantity": 71, "totalQuantity": 71, "type": 1 }, "6310159781": { "quantity": 33, "totalQuantity": 33, "type": 1 }, "6310159797": { "quantity": 44, "totalQuantity": 44, "type": 1 }, "3280089025135": { "quantity": 77, "totalQuantity": 77, "type": 1 } }, "success": true, "totalQuantity": 225, "type": 1 }, "itemPriceResultDO": { "areaId": 110100, "duo11Item": false, "duo11Stage": 0, "extraPromShowRealPrice": false, "halfOffItem": false, "hasDPromotion": false, "hasMobileProm": false, "hasTmallappProm": false, "hiddenNonBuyPrice": false, "hideMeal": false, "priceInfo": { "6310159781": { "areaSold": true, "onlyShowOnePrice": false, "price": "178.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "75.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促銷", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 }, "6310159797": { "areaSold": true, "onlyShowOnePrice": false, "price": "178.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "75.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促銷", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 }, "3280089025135": { "areaSold": true, "onlyShowOnePrice": false, "price": "168.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "68.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促銷", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 }, "3280089025136": { "areaSold": true, "onlyShowOnePrice": false, "price": "168.00", "promotionList": [ { "amountPromLimit": 0, "amountRestriction": "", "basePriceType": "IcPrice", "canBuyCouponNum": 0, "endTime": 1561651200000, "extraPromTextType": 0, "extraPromType": 0, "limitProm": false, "postageFree": false, "price": "68.00", "promType": "normal", "start": false, "startTime": 1546267717000, "status": 2, "tfCartSupport": false, "tmallCartSupport": false, "type": "火爆促銷", "unLogBrandMember": false, "unLogShopVip": false, "unLogTbvip": false } ], "sortOrder": 0 } }, "queryProm": false, "success": true, "successCall": true, "tmallShopProm": [ ] }, "memberRightDO": { "activityType": 0, "level": 0, "postageFree": false, "shopMember": false, "success": true, "time": 1, "value": 0.5 }, "miscDO": { "bucketId": 15, "city": "北京", "cityId": 110100, "debug": { }, "hasCoupon": false, "region": "東城區", "regionId": 110101, "rn": "fa015e69c6a4ca4bb559805d670557e7", "smartBannerFlag": "top", "success": true, "supportCartRecommend": false, "systemTime": "1555232632711", "town": "東華門街道", "townId": 110101001 }, "regionalizedData": { "success": true }, "sellCountDO": { "sellCount": "5", "success": true }, "servicePromise": { "has3CPromise": false, "servicePromiseList": [ { "description": "商品支持正品保障服務", "displayText": "正品保證", "icon": "無", "link": "//www.tmall.com/wow/portal/act/bzj", "rank": -1 }, { "description": "極速退款是爲誠信會員提供的退款退貨流程的專享特權,額度是根據每一個用戶當前的信譽評級狀況而定", "displayText": "極速退款", "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", "link": "//vip.tmall.com/vip/privilege.htm?spm=3.1000588.0.141.2a0ae8&priv=speed", "rank": -1 }, { "description": "賣家爲您購買的商品投保退貨運費險(保單生效如下單顯示爲準)", "displayText": "贈運費險", "icon": "//img.alicdn.com/bao/album/sys/icon/discount.gif", "link": "//service.tmall.com/support/tmall/knowledge-1121473.htm?spm=0.0.0.0.asbDA1", "rank": -1 }, { "description": "七天無理由退換", "displayText": "七天無理由退換", "icon": "//img.alicdn.com/tps/i3/T1Vyl6FCBlXXaSQP_X-16-16.png", "link": "//pages.tmall.com/wow/seller/act/seven-day", "rank": -1 } ], "show": true, "success": true, "titleInformation": [ ] }, "soldAreaDataDO": { "currentAreaEnable": true, "success": true, "useNewRegionalSales": true }, "tradeResult": { "cartEnable": true, "cartType": 2, "miniTmallCartEnable": true, "startTime": 1554812946000, "success": true, "tradeEnable": true }, "userInfoDO": { "activeStatus": 0, "companyPurchaseUser": false, "loginMember": false, "loginUserType": "buyer", "success": true, "userId": 0 } }, "isSuccess": true }
咱們發現JSON的內容很是多,咱們要是每一個都解析,豈不是很累?這裏咱們只須要獲取price的信息,也就是priceInfo,因此咱們想尋求一種方法,相似XPath的方式解析,這裏咱們採用JSONPath。
參考:https://github.com/DarrenChanChenChi/jsonpath
用法和XPath大同小異。
解析出咱們想要的代碼便可。
common.go:
package main import ( "github.com/djimenez/iconv-go" "time" "net" "net/http" "gopkg.in/xmlpath.v2" "strings" "fmt" "math/rand" ) type Msg struct{ AdID int64 `json:"ad_id"` SourceID int64 `json:"source_id"` Source string `json:"source"` ItemID int64 `json:"item_id"` URL string `json:"url"` UID int64 `json:"uid"` DID int64 `json:"did"` } func convFromGbk(s string) string { gbkConvert, _ := iconv.NewConverter("gbk", "utf-8") res, _ := gbkConvert.ConvertString(s) return res } func newHTTPClient() *http.Client { client := &http.Client{ Transport: &http.Transport{ Dial: func(netw, addr string) (net.Conn, error) { return net.DialTimeout(netw, addr, time.Duration(1500*time.Millisecond)) }, MaxIdleConnsPerHost: 200, }, Timeout: time.Duration(1500 * time.Millisecond), } return client } //只獲取首元素 func parseNode(node *xmlpath.Node, xpath string) string { path, err := xmlpath.Compile(xpath) if err != nil { fmt.Errorf("%s",err) return "" } it := path.Iter(node) for it.Next() { s := strings.TrimSpace(it.Node().String()) if len(s) != 0 { //return convFromGbk(s) return s } } return "" } //獲取全部元素 func parseNodeForAll(node *xmlpath.Node, xpath string) []string { path, err := xmlpath.Compile(xpath) if err != nil { fmt.Errorf("%s",err) return nil } it := path.Iter(node) elements := []string{} for it.Next() { s := strings.TrimSpace(it.Node().String()) if len(s) != 0 { //return convFromGbk(s) elements = append(elements, s) } } return elements } // percent returns the possibility of pct func percent(pct int) bool { if pct < 0 || pct > 100 { return false } return pct > rand.Intn(100) }
ali_spider.go:
package main import ( "code.byted.org/gopkg/logs" "encoding/json" "fmt" "github.com/djimenez/iconv-go" "github.com/ngaut/logging" "github.com/oliveagle/jsonpath" "gopkg.in/xmlpath.v2" "io/ioutil" "math/rand" "net/http" "strconv" "strings" ) const itemURLPatternAli = "https://detail.tmall.com/item.htm?id=%d" const priceURLPatternAli = "https://mdskip.taobao.com/core/initItemDetail.htm?isUseInventoryCenter=false&cartEnable=true&service3C=false&isApparel=true&isSecKill=false&tmallBuySupport=true&isAreaSell=false&tryBeforeBuy=false&offlineShop=false&itemId=%d&showShopProm=false&isPurchaseMallPage=false&itemGmtModified=1555201252000&isRegionLevel=false&household=false&sellerPreview=false&queryMemberRight=true&addressLevel=2&isForbidBuyItem=false&callback=setMdskip×tamp=1555210888509&isg=bBQF1SmIvk4dQ8UGBOCNIZNDTp7T7IRAguWjmN99i_5Qy1Y_p8_OlZkxNev6Vj5RsG8p46-P7M29-etfw&isg2=BPPzr6M1qyiTZGdgYB4puOBagvEXdGgbstRSkqWQUpJJpBNGLPrUOlF1XpTvBN_i"
var ualist = []string{ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", }
type AliSpider struct { client *http.Client } func NewAliSpider() *AliSpider { return &AliSpider{ client: newHTTPClient(), } } func (j *AliSpider) loadPage(url string) (*xmlpath.Node, error) { req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) rsp, err := j.client.Do(req) if err != nil { return nil, err } //轉碼 utfBody, err := iconv.NewReader(rsp.Body, "gb2312", "utf-8") //if body, err := ioutil.ReadAll(utfBody); err == nil { // fmt.Println("HTML content:", string(body)) //} node, err := xmlpath.ParseHTML(utfBody) rsp.Body.Close() return node, err } func (j *AliSpider) parsePrice(itemID int64) (map[string]map[string]float64, error) { priceURL := fmt.Sprintf(priceURLPatternAli, itemID) req, err := http.NewRequest("GET", priceURL, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", ualist[rand.Intn(len(ualist))]) referer := fmt.Sprintf("https://detail.tmall.com/item.htm?id=%d", itemID) req.Header.Set("Referer", referer) rsp, err := j.client.Do(req) if err != nil { return nil, err } priceInfoRaw, err := ioutil.ReadAll(rsp.Body) if err != nil { return nil, err } priceInfo := string(priceInfoRaw) jsonStr := convFromGbk(priceInfo) leftIndex := strings.Index(jsonStr, "(") + 1 rightIndex := strings.Index(jsonStr, ")") var json_data interface{} json.Unmarshal([]byte(jsonStr[leftIndex:rightIndex]), &json_data) skuQuantity, err := jsonpath.JsonPathLookup(json_data, "$.defaultModel.inventoryDO.skuQuantity") if err != nil { logs.Info("json path is err, err is %v", err) } skuQuantityMap := skuQuantity.(map[string]interface{}) itemPriceResultMap := map[string]map[string]float64{} itemPriceResultDetailMap := map[string]float64{} for skuQuantityId, _ := range skuQuantityMap { //fmt.Println(key, value) jpathPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.price", skuQuantityId) jpathPromotionPrice := fmt.Sprintf("$.defaultModel.itemPriceResultDO.priceInfo.%s.promotionList[0].price", skuQuantityId) price, err := jsonpath.JsonPathLookup(json_data, jpathPrice) if err != nil { logs.Info("jpathPrice is err, err is %v", err) } promotionPrice, err := jsonpath.JsonPathLookup(json_data, jpathPromotionPrice) if err != nil { logs.Info("jpathPromotionPrice is err, err is %v", err) } priceStr := price.(string) promotionPriceStr := promotionPrice.(string) itemPriceResultDetailMap["price"], _ = strconv.ParseFloat(priceStr, 64) itemPriceResultDetailMap["promotion_price"], _ = strconv.ParseFloat(promotionPriceStr, 64) itemPriceResultMap[skuQuantityId] = itemPriceResultDetailMap } return itemPriceResultMap, err } func (j *AliSpider) Parse(msg *Msg) (map[string]interface{}, error) { defer func() { if r := recover(); r != nil { logging.Errorf("parse msg %v, error %v", *msg, r) return } }() itemURL := fmt.Sprintf(itemURLPatternAli, msg.ItemID) node, err := j.loadPage(itemURL) if err != nil { fmt.Errorf("%s",err) return nil, err } //metricsClient.EmitCounter("jd_spider", 1, "", map[string]string{"step": "parse"}) name := parseNode(node, "//h1[@data-spm]") //詳情描述 /** 產品名稱:紐曼 品牌: 紐曼 型號: EX16 功能: 睡眠監測 計步 防水 */ details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li") detailsMap := make(map[string]string, len(details)) for _, detail := range details { split := strings.Split(detail, ":") if(len(split) > 1){ detailsMap[split[0]] = strings.TrimSpace(split[1]) } } shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服務 物流 shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]") describe, _ := strconv.ParseFloat(shopinfos[0], 64) service, _ := strconv.ParseFloat(shopinfos[1], 64) logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //價格(多個型號,price是標準價格,promotion_price是促銷價格) //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]] itemPriceResultMap, err := j.parsePrice(msg.ItemID) res := map[string]interface{}{} res["source"] = "Ali" res["source_id"] = msg.SourceID res["id"] = msg.ItemID res["ad_id"] = msg.AdID res["url"] = itemURL res["name"] = name res["details"] = detailsMap res["shopname"] = shopname res["describe"] = describe res["service"] = service res["logistics"] = logistics res["uid"] = msg.UID res["did"] = msg.DID res["item_price"] = itemPriceResultMap // 選幾個必須包含的類別校驗 if res["name"] == "" && res["shopname"] == "" { return nil, fmt.Errorf("invalid html page %s", itemURL) } return res, nil }
ali_spider_test.go:
package main import ( "encoding/json" "fmt" "strconv" "strings" "testing" ) func TestName(t *testing.T) { //conf, err := ssconf.LoadSsConfFile(confFile) //if err != nil { // panic(err) //} aliSpider := NewAliSpider() //554867117919 585758506034 var itemId int64 = 7664169349 itemURL := fmt.Sprintf(itemURLPatternAli, itemId) node, err := aliSpider.loadPage(itemURL) if err != nil { fmt.Errorf("%s",err) } //fmt.Println(node) name := parseNode(node, "//h1[@data-spm]") //詳情描述 /** 產品名稱:紐曼 品牌: 紐曼 型號: EX16 功能: 睡眠監測 計步 防水 */ details := parseNodeForAll(node, "//ul[@id=\"J_AttrUL\"]/li") detailsMap := make(map[string]string, len(details)) for _, detail := range details { split := strings.Split(detail, ":") if(len(split) > 1){ detailsMap[split[0]] = strings.TrimSpace(split[1]) } } shopname := parseNode(node, "//a[@class=\"slogo-shopname\"]") //描述 服務 物流 shopinfos := parseNodeForAll(node, "//span[@class=\"shopdsr-score-con\"]") describe, _ := strconv.ParseFloat(shopinfos[0], 64) service, _ := strconv.ParseFloat(shopinfos[1], 64) logistics, _ := strconv.ParseFloat(shopinfos[2], 64) //價格(多個型號,price是標準價格,promotion_price是促銷價格) //map[4023134073248:map[price:3299.00 promotion_price:3299.00] 4023134073249:map[price:3299.00 promotion_price:3299.00] 4200326178501:map[promotion_price:3299.00 price:3299.00] 4023134073246:map[price:3299.00 promotion_price:3299.00] 4023134073247:map[price:3299.00 promotion_price:3299.00] 4023134073245:map[price:3299.00 promotion_price:3299.00] 4023134073250:map[price:3299.00 promotion_price:3299.00]] itemPriceResultMap, err := aliSpider.parsePrice(itemId) res := map[string]interface{}{} res["source"] = "Ali" res["url"] = itemURL res["name"] = name res["details"] = detailsMap res["shopname"] = shopname res["describe"] = describe res["service"] = service res["logistics"] = logistics res["item_price"] = itemPriceResultMap bytes, err := json.Marshal(res) if err != nil { fmt.Println("error is ", err) } fmt.Println(string(bytes)) }
運行結果:
{"describe":4.9,"details":{"上市時間":"2014年冬季","乒乓底板材質":"其餘","品牌":"Palio/拍里奧","型號":"TNT-1","層數":"9層","拍柄重量":"頭沉柄輕","是否商場同款":"是","系列":"拍里奧TNT-1","貨號":"TNT-1","顏色分類":"TNT-1直拍(短柄)1只+贈送:1海綿護邊【7木+2碳】 TNT-1橫拍(長柄)1只+贈送:1海綿護邊【7木+2碳】 新TNT直拍(短柄)1只+贈送:1海綿護邊【5木+2碳】 新TNT橫拍(長柄)1只+贈送:1海綿護邊【5木+2碳】"},"item_price":{"3280089025135":{"price":168,"promotion_price":68},"3280089025136":{"price":168,"promotion_price":68},"6310159781":{"price":168,"promotion_price":68},"6310159797":{"price":168,"promotion_price":68}},"logistics":4.8,"name":"正品 拍里奧乒乓球底板新TNT-1碳素快攻弧圈乒乓球拍底板球拍球板","service":4.8,"shopname":"璽源運動專營店","source":"Ali","url":"https://detail.tmall.com/item.htm?id=7664169349"}