大衆點評評論數據抓取 反爬蟲措施有css文字映射和字體庫反爬蟲css
大衆點評的反爬蟲手段有那些:html
封ip,封帳號,字體庫反爬蟲,css文字映射,圖形滑動驗證碼web
這個圖片是滑動驗證碼,訪問頻率高的話,會出現這個滑動驗證碼cookie
這個圖片是店鋪失效或者封帳號出現的提示session
關於大衆點評 css文件映射分析:app
第一步:less
打開網頁,點擊檢查看到文本內容以下圖:ide
咱們發現部分漢字用字母替代,好比 漢字 大,替代字母是 htgj9。svg
第二步:找到css 文字映射的關係。工具
1.首先去找到 以 http://s3plus.meituan.net 開始的 url連接:
第二步: 訪問 http://s3plus.meituan.net/v1/mss_0a06a471f9514fc79c981b5466f56b91/svgtextcss/ffe70b009694f9067f7e6dd07c1f6286.css ,而且在裏面 找到 css文件的映射值:
第三步: 訪問 http://s3plus.meituan.net/v1/mss_0a06a471f9514fc79c981b5466f56b91/svgtextcss/ffe70b009694f9067f7e6dd07c1f6286.css 在裏面找到以 .svg 結尾的 URL連接:
其中以 .svg 結尾的文件通常有三個,兩個是無用的,只有一個是正確的。
第四步:訪問正確的以 svg結尾的文件,找到數字和文字的映射關係:
至此,流程分析完畢。
關於大衆點評字體庫反爬蟲分析:
查看到特徵值出現了 口 字,就是字體庫反爬蟲。
第二步:找到 woff 文件。
第三步: 打開woff文件,使用的工具是。
第四步: 截取圖片,用在線ocr工具識別,在線識別會出現錯誤,須要人工去調整,不過比人工去一一對應的速度快多了。
在線識別網址:https://zhcn.109876543210.com/
分析完成。
直接上代碼:
#-*-coding:utf-8-*- # 爬取大衆點評評論 import sys reload(sys) sys.setdefaultencoding('utf-8') from contest import * data_dict = { } def filter_time(timestr): try: timestr = re.search('(\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2})', timestr).group(1) except Exception, e: print e return timestr # 第一步,得到 css url def get_css_url(html): # 獲取css文件的內容 regex = re.compile(r'(s3plus\.meituan\.net.*?)\"') css_url = re.search(regex, html).group(0) css_url = 'http://' + css_url return css_url def content_replace(content): content_list = re.findall('<svgmtsi class="review">(.*?);</svgmtsi>', content) content_list_l = [] for item in content_list: item = item.replace("&#x", "uni") content_list_l.append(data_dict[item]) content_list_l = content_list_l + ["</div>"] content_end_list = content.split('<svgmtsi') content_end = [] j = 0 for i in content_end_list: content_end.append(i + content_list_l[j]) j = j + 1 content_end_str = ''.join(content_end) def replace_review(newline): newline = str(newline).replace('</div>', "").replace(' ', "") re_comment = re.compile('class="review">[^>]*</svgmtsi>') newlines = re_comment.sub('', newline) newlines = newlines.replace('class="review">', '').replace('</svgmtsi>', '') return newlines content_end_str = replace_review(content_end_str) return content_end_str def dzdp_conent_spider(item,cookies): addr = item['addr'] shop_id = addr.split('/')[-1] print shop_id for page in range(1,3): url = "http://www.dianping.com/shop/"+ shop_id +"/review_all/p" + str(page) + "?queryType=sortType&queryVal=latest" print url headers = { "Host":"www.dianping.com", "Connection":"keep-alive", "Upgrade-Insecure-Requests":"1", "User-Agent":user_agents, "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "Referer":"http://www.dianping.com/shop/508453/review_all?queryType=sortType&&queryVal=latest", "Accept-Encoding":"gzip, deflate", "Accept-Language":"zh-CN,zh;q=0.9", "Cookie":cookies, } # 請求封裝方法 def requests_download(request_max=101): result_html = "" result_status_code = "" try: # proxies = get_proxies() result = session.get(url=url, headers=headers, verify=False,timeout=20) result_html = result.content result_status_code = result.status_code if result_status_code != 200: result = session.get(url=url, headers=headers, verify=False, timeout=20) result_html = result.content result_status_code = result.status_code except Exception as e: if request_max > 0: if result_status_code != 200: time.sleep(2) return requests_download(request_max - 1) return result_html # 調用 requests_download 方法 a = 2 result = requests_download(request_max=11) result_replace = replace(result) result_replace = result_replace.replace(' ', '').replace(' ', "").replace(' ', "").replace(' ', "") if '暫無點評' in result_replace or '抱歉!頁面沒法訪問......' in result_replace: a = 1 else: result_replaces = re.findall('<div class="reviews-items">(.*?)<div class="bottom-area clearfix">', result_replace)[0] if a == 1: pass else: resultList =re.findall('<li>.*?<a class="dper-photo-aside(.*?)>投訴</a>.*?</li>',result_replaces) for data in resultList: data = str(data) try: username = re.findall('data-click-name="用戶名.*?data-click-title="文字".*?>(.*?)<img class=".*?" src=',data)[0] except: username = re.findall('data-click-name="用戶名.*?"<br/> data-click-title="文字"<br/>>(.*?)<div class="review-rank"><br/>',data)[0] userid = re.findall(' data-user-id="(.*?)".*?data-click-name="用戶頭像',data)[0] headimg = re.findall('<img data-lazyload="(.*?)<div class="main-review">',data)[0] try: comment_star = re.findall('<span class="sml-rank-stars sml-str(.*?) star">.*?<span class="score">',data)[0] except: try: comment_star = re.findall('<span class="sml-rank-stars sml-str(.*?) star">.*?</div>.*?<div class="review-truncated-words">',data)[0] except: comment_star = re.findall('<span class="sml-rank-stars sml-str(.*?) star"></span>.*?<div class="review-words">',data)[0] if '展開評論' in data: content = re.findall('<div class="review-words Hide">(.*?)<div class="less-words">', data)[0] else: try: content = re.findall('<div class="review-words">(.*?)<div class="review-pictures">',data)[0] except: content = re.findall('<div class="review-words">(.*?)<div class="misc-info clearfix">',data)[0] comment_time = re.findall('<span class="time">(.*?)<span class="shop">',data)[0] website = "大衆點評" pingtai = re.findall('<h1 class="shop-name">(.*?)</h1>.*?<div class="rank-info">',result_replace)[0] username = replace_tag(username) userid = replace_tag(userid) headimg = replace_tag(headimg) headimg = headimg.replace('','') comment_star = replace_tag(comment_star) comment_star= comment_star.replace('0',"") comment_time = replace_tag(comment_time) website = replace_tag(website) pingtai = replace_tag(pingtai) # 爬蟲有字體庫反爬蟲和css映射兩種狀況 # 若是是字體庫反爬蟲,調用 content_replace 方法,修改 data_dict裏面的數據,作數據映射 格式 如 'unie02f': '值', # content = content_replace(content) # svg 格式是 <text x="0" y= 導入 svgutil_3 的 svg2word方法 # svg 格式是 <path id=".*?" d="M0(.*?)H600 導入 svgutil_4 的 svg2word方法 from svgutil_4 import svg2word css_url = get_css_url(result_replace) if '</svgmtsi>' in content: content = svg2word(content, css_url) else: content = content content = content.replace('<br/>', "") print content crawl_time = str(datetime.now().strftime('%Y-%m-%d %H:%M')) comment_time = comment_time[:10] + " " + comment_time[10:] comment_time = filter_time(comment_time) content = replace_tag(content) result_dict = { "username":username, "headimg":headimg, "comment_start":comment_star, "content":content, "comment_time":comment_time, "website":website, "pingtai":pingtai, "crawl_time":crawl_time, } # 插入MySQL中去 dbName = "TM_commentinfo_shanghaikeji" insert_data(dbName, result_dict) if __name__ == "__main__": cookies = "ctu=e14b301a513cb5e6cb4368ec1a6ef38e098827bd2b05c3a6a03ff7d0ead834f3; _lxsdk_cuid=16c4081aba9c8-018c6bfcb5c785-5f1d3a17-13c680-16c4081aba9c8; _lxsdk=16c4081aba9c8-018c6bfcb5c785-5f1d3a17-13c680-16c4081aba9c8; _hc.v=0d426222-8f95-4389-68ab-d202b3b51e9b.1564450337; __utma=1.1294718334.1564531914.1564531914.1564531914.1; __utmz=1.1564531914.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); cy=2; cye=beijing; s_ViewType=10; Hm_lvt_e6f449471d3527d58c46e24efb4c343e=1566184043; aburl=1; Hm_lvt_dbeeb675516927da776beeb1d9802bd4=1566463021; _dp.ac.v=236e82c8-26e1-4496-95f0-9d8b7ba8ca1e; dper=a7fba89f38fb1047d3d06b33821d73e96c141c23a8a6a4a39e746932f07c92067950e08465aaede7532242b58ae779a0dacc3a24f475f1b7b95c4b8cff4b1299e360f5cdab6d77cb939a78f478c0b4e73b6ef56f3deeff682210e5c0fbb428f2; ll=7fd06e815b796be3df069dec7836c3df; ua=%E9%B9%8F; uamo=13683227400; _lxsdk_s=16ce04e6382-9de-106-562%7C%7C155" result_list = [ {"addr": "http://www.dianping.com/shop/22289267"}, {"addr": "http://www.dianping.com/shop/17951321"}, ] for item in result_list[1:]: print "正在採集的位置是:",result_list.index(item) dzdp_conent_spider(item,cookies)
svgutil_3 的方法是:
# -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') from contest import * #標籤轉漢字 def svg2word(content, css_url): print "css_url",css_url print "content",content #經過 css_url 得到 css_html css_url = css_url.replace('"',"") headers = { "Host":"s3plus.meituan.net", "Connection":"keep-alive", "Cache-Control":"max-age=0", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "Accept-Encoding":"gzip, deflate", "Accept-Language":"zh-CN,zh;q=0.9", "If-None-Match":'d26baaeee5bc2306d2c5d6262c3deba2', "If-Modified-Since":"Tue, 21 May 2019 13:30:03 GMT", } resp = requests.get(url=css_url) css_html = resp.content.decode('utf-8') # 從css_html 提取出來 svg_url svg_url = re.findall('background-image: url\((.*?)\);background-repeat: no-repeat;', css_html) print "svg_url",svg_url svg_url = svg_url[0] print "svg_url",svg_url if svg_url.startswith('//'): svg_url = 'http:' + svg_url # 得到svg_html # print svg_url resp = requests.get(svg_url,verify=False).text svg_html = resp.replace("\n", "") cssls = re.findall('<svgmtsi class="(.*?)"></svgmtsi>', content)# 評論中全部標籤集合 print cssls replace_content_list = [] for charcb in cssls: # 開始解析每一個標籤應該對應的漢字 # css 文件中取出 須要替換字符的 橫座標 和 縱座標 # print charcb css = re.search(charcb+"\{background:-(\d+)\.0px -(\d+)\.0px;\}", css_html).groups() #提取橫縱座標 x_px = css[0] # 橫座標 y_px = css[1] # 縱座標 # 計算出來 須要 替換漢字的位置 # print "x_px",x_px # print "y_px",y_px # print "svg_html",svg_html x_num = int(x_px)/14 + 1 # 取出 每一行的漢字 結構 以下所示 text_y_list = re.findall('<text x="0" y="(.*?)">.*?</text>',svg_html) # 取得漢字的list text_list = re.findall('<text x="0" y=".*?">(.*?)</text>',svg_html) # print "text_y_list",text_y_list # print "text_list",text_list y_num_list = [] for i in text_y_list: if int(i) > int(y_px): y_num_list.append(i) y_num = text_y_list.index(y_num_list[0]) # 得到 須要替換座標的 值 replace_chinese = text_list[int(y_num)][x_num - 1] replace_content_list.append(replace_chinese) content_list = content.split('</svgmtsi>') replace_content_list.append('<div>') return_content = [] for ii in content_list: return_content.append(ii+replace_content_list[content_list.index(ii)]) return_content_str = ''.join(return_content) def replace_tag_br(newline): newline = str(newline).replace('<br/>','').replace(' ','') re_comment = re.compile('<[^>]*>') newlines = re_comment.sub('', newline) return newlines return_content_str = replace_tag_br(return_content_str) return return_content_str
svgutil_4 的方法是:
# -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') from contest import * #標籤轉漢字 def svg2word(content, css_url): css_url = css_url.replace('"','') print "css_url",css_url print "content",content resp = requests.get(url=css_url) css_html = resp.content.decode('utf-8') # 從css_html 提取出來 svg_url print "css_html", css_html svg_url = re.findall('background-image: url\((.*?)\);background-repeat: no-repeat;', css_html) print "svg_url", svg_url svg_url = svg_url[1] # print "svg_url",svg_url if svg_url.startswith('//'): svg_url = 'http:' + svg_url # 得到svg_html headers = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } try: resp = requests.get(svg_url,headers=headers, verify=False).text except: resp = requests.get(svg_url,headers=headers, verify=False).text # print "resp",resp svg_html = resp.replace("\n", "") # print "content",content content = content.replace(";",'') cssls = re.findall('<svgmtsi class="(.*?)"></svgmtsi>', content) # 評論中全部標籤集合 # print "cssls",cssls replace_content_list = [] for charcb in cssls: # 開始解析每一個標籤應該對應的漢字 # css 文件中取出 須要替換字符的 橫座標 和 縱座標 css = re.search(charcb+"\{background:-(\d+)\.0px -(\d+)\.0px;\}", css_html).groups() #提取橫縱座標 x_px = css[0] # 橫座標 y_px = css[1] # 縱座標 # print "x_px",x_px # print "y_px",y_px # 計算出來 須要 替換漢字的位置 x_num = int(x_px)/14 + 1 # 取出 每一行的漢字 結構 以下所示 # 取出y數值 # print "svg_html",svg_html # time.sleep(200) text_y_list = re.findall('<path id=".*?" d="M0(.*?)H600"/>',svg_html) # print "text_y_list",text_y_list # 取得漢字的list text_list = re.findall('<textPath xlink:href="#.*?" textLength=".*?">(.*?)</textPath>',svg_html) y_num_list = [] for i in text_y_list: if int(i) > int(y_px): y_num_list.append(i) y_num = text_y_list.index(y_num_list[0]) # 得到 須要替換座標的 值 replace_chinese = text_list[int(y_num)][x_num - 1] replace_content_list.append(replace_chinese) content_list = content.split('</svgmtsi>') replace_content_list.append('<div>') return_content = [] for ii in content_list: return_content.append(ii+replace_content_list[content_list.index(ii)]) return_content_str = ''.join(return_content) def replace_tag_br(newline): newline = str(newline).replace('<br/>','').replace(' ','') re_comment = re.compile('<[^>]*>') newlines = re_comment.sub('', newline) return newlines return_content_str = replace_tag_br(return_content_str) return return_content_str
注意事項:
# 爬蟲有字體庫反爬蟲和css映射兩種狀況
# 若是是字體庫反爬蟲,調用 content_replace 方法,修改 data_dict裏面的數據,作數據映射 格式 如 'unie02f': '值',
# content = content_replace(content)
# svg 格式是 <text x="0" y= 導入 svgutil_3 的 svg2word方法
# svg 格式是 <path id=".*?" d="M0(.*?)H600 導入 svgutil_4 的 svg2word方法
# 實際採集的時候注意控制訪問頻率,大衆點評的封鎖力度仍是挺大的。
有須要代碼的請在下面評論留下郵箱地址
文章佔時沒有寫完