本文的目標是爬取京東商城美的電熱水器的品名、價格、評論數、好評率、差評率、中評率、標籤、評論、評論時間、暱稱、購買時間。html
import requests from bs4 import BeautifulSoup import json import re import pymysql import random from multiprocessing import Pool connection = pymysql.connect('localhost','root','0102003','spider') cursor = connection.cursor()
首先導入相關的包,並鏈接到本地的MySQL數據庫,以便將爬取數據存入數據庫。python
requests.DEFAULT_RETRIES = 5 uas = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1', 'Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6', 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6', 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5', 'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3', 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3', 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3', 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24', 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24'] def get_html(url): """ 用於獲取html :param url: 網站的url地址 :return: html """ ua = random.choice(uas) head = {'user-agent': ua, 'authority': 'item.jd.com', 'method': ' GET', 'path': '/1106432.', 'scheme': 'https'} try: r = requests.get(url, head) return r.text except Exception as e: print(e,r.status_code) get_html(url)
get_html函數用於發送請求並獲取response,其中,使用random模塊的random.choice()隨機選取user-agent,try-except用於捕捉異常並打印異常類型和異常狀態碼。因爲在使用過程當中出現502異常,所以執行重複操做。mysql
def get_detail_urls(): """ 獲取全部美的熱水器的詳情頁url :return: """ detail_url_list = [] for i in range(1, 14): url = 'https://list.jd.com/list.html?cat=737,13297,13690&ev=exbrand%5F12380&page={}&' \ 'sort=sort_totalsales15_desc&trans=1&JL=6_0_0#J_main'.format(i) soup = BeautifulSoup(get_html(url), 'lxml') for j in range(1, len(soup.select('#plist > ul > li'))+1): url = "https:" + soup.select('#plist > ul > li:nth-child({}) > div > div.p-name > a'.format(j))[0]\ .attrs['href'] name = soup.select('#plist > ul > li:nth-child(1) > div > div.p-name > a > em')[0].get_text().strip() detail_url_list.append((url, name)) return detail_url_list
get_detail_urls()本來是想獲取商品詳情頁連接,而後在詳情頁獲取相關數據,可是因爲大多詳情頁數據是經過js加載的,沒法直接從html網頁得到,並且,獲取詳情頁須要模擬登錄。所以,get_detail_urls()實際上獲取了商品名稱和url,其中url中有用是商品id信息,以此跳過對詳情頁的訪問。正則表達式
def save_to_sql(l): sql = "insert into media values {}".format(tuple(l)) try: cursor.execute(sql) connection.commit() except Exception as e: print(e)
用於將數據插入MySQL數據庫中。sql
def get_info(url, title): id = url.split('/')[-1].split('.')[0] # 獲取商品id price_url = 'https://p.3.cn/prices/mgets?skuIds=J_' + id comment_info_url = 'https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}'.format(id) comment_tag_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&' \ 'productId={}&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'.format(id) comment_info = json.loads(get_html(comment_info_url)).get('CommentsCount')[0] r = re.compile(".*?\((.*)\)") comment_tag = json.loads(r.findall(get_html(comment_tag_url))[0]) tag = [] # 評價標籤 for i in comment_tag.get('hotCommentTagStatistics'): tag.append(i.get('name') + "("+str(i.get('count'))+ ")") tag = str(tag) price = float(json.loads(get_html(price_url))[0].get('p')) # 商品價格 CommentCount = comment_info.get('CommentCount') # 評價總數 GoodRate = comment_info.get('GoodRate') # 好評率 PoorRate = comment_info.get('PoorRate') # 差評率 GeneralRate = comment_info.get('GeneralRate') # 中評率 name = title out = False for i in range(100): comment_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&' \ 'productId={}&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1'.format(id, i) r = re.compile(".*?\((.*)\);") try: comment = json.loads(r.findall(get_html(comment_url))[0]).get('comments') except Exception as e: print(id, i) print(e) break for j in range(10): try: com = comment[j].get('content') except IndexError as e: print(e) print(id, i, j) out = True break except TypeError as e: print(e) print(id, i, j) break comment_time = comment[j].get('creationTime') color = comment[j].get('productColor') bought_time = comment[j].get('referenceTime') nickname = comment[j].get('nickname') l = [name, price, CommentCount, GoodRate, GeneralRate, PoorRate, tag, com, comment_time, nickname, bought_time] save_to_sql(l) if out: break
獲取某型號熱水器的各類信息。
首先,使用url獲取商品的id,用於請求價格、評論等信息。price_url是使用商品id構建的一個返回包含商品價格的json文件的url,使用json.loads()將json文件解碼成python的字典,而後獲取價格信息。
comment_info_url是一個包含商品評價的聚合統計量(如評論數、好評率等)的json文件,這些信息也能夠在以後的comment_tag_url中獲取。
comment_tag_url是comment_url的第一頁,京東只保存了商品的最多100頁評論信息,每頁最多十條評論,對頁碼循環並構建url。因爲返回的文件並非標準的json格式文件,所以,須要使用正則表達式獲取標準的json信息。當評論數較少,循環頁碼超出時跳出循環。對每一頁評論信息循環獲取每條評論內容,同理,當不足十條時,跳出循環。將所需信息存入列表後插入MySQL數據庫。數據庫
def job(z): return get_info(z[0], z[1]) def main(): urls = get_detail_urls() print(len(urls)) pool = Pool() pool.map(job, urls) pool.close() pool.join()
main函數只需串聯起來便可,使用multiprocessing的Pool實現多進程爬取,代碼綜合以下:json
import requests from bs4 import BeautifulSoup import json import re import pymysql import random from multiprocessing import Pool connection = pymysql.connect('localhost','root','0102003','spider') cursor = connection.cursor() requests.DEFAULT_RETRIES = 5 uas = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"] def get_html(url): """ 用於獲取html :param url: 網站的url地址 :return: html """ ua = random.choice(uas) head = {"user-agent": ua, 'authority': 'item.jd.com', 'method': ' GET', 'path': '/1106432.', 'scheme': 'https'} try: r = requests.get(url, head) return r.text except Exception as e: print(e) get_html(url) def get_detail_urls(): """ 獲取全部美的熱水器的詳情頁url :return: """ detail_url_list = [] for i in range(1, 14): url = 'https://list.jd.com/list.html?cat=737,13297,13690&ev=exbrand%5F12380&page={}&' \ 'sort=sort_totalsales15_desc&trans=1&JL=6_0_0#J_main'.format(i) soup = BeautifulSoup(get_html(url), 'lxml') for j in range(1, len(soup.select('#plist > ul > li'))+1): url = "https:" + soup.select('#plist > ul > li:nth-child({}) > div > div.p-name > a'.format(j))[0]\ .attrs['href'] name = soup.select('#plist > ul > li:nth-child(1) > div > div.p-name > a > em')[0].get_text().strip() detail_url_list.append((url, name)) return detail_url_list def save_to_sql(l): sql = "insert into media values {}".format(tuple(l)) try: cursor.execute(sql) connection.commit() except Exception as e: print(e) def get_info(url, title): id = url.split('/')[-1].split('.')[0] # 獲取商品id price_url = 'https://p.3.cn/prices/mgets?skuIds=J_' + id comment_info_url = 'https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}'.format(id) comment_tag_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&' \ 'productId={}&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'.format(id) comment_info = json.loads(get_html(comment_info_url)).get('CommentsCount')[0] r = re.compile(".*?\((.*)\)") comment_tag = json.loads(r.findall(get_html(comment_tag_url))[0]) tag = [] # 評價標籤 for i in comment_tag.get('hotCommentTagStatistics'): tag.append(i.get('name') + "("+str(i.get('count'))+ ")") tag = str(tag) price = float(json.loads(get_html(price_url))[0].get('p')) # 商品價格 CommentCount = comment_info.get('CommentCount') # 評價總數 GoodRate = comment_info.get('GoodRate') # 好評率 PoorRate = comment_info.get('PoorRate') # 差評率 GeneralRate = comment_info.get('GeneralRate') # 中評率 name = title out = False for i in range(100): comment_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&' \ 'productId={}&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1'.format(id, i) r = re.compile(".*?\((.*)\);") try: comment = json.loads(r.findall(get_html(comment_url))[0]).get('comments') except Exception as e: print(id, i) print(e) break for j in range(10): try: com = comment[j].get('content') except IndexError as e: print(e) print(id, i, j) out = True break except TypeError as e: print(e) print(id, i, j) break comment_time = comment[j].get('creationTime') color = comment[j].get('productColor') bought_time = comment[j].get('referenceTime') nickname = comment[j].get('nickname') l = [name, price, CommentCount, GoodRate, GeneralRate, PoorRate, tag, com, comment_time, nickname, bought_time] save_to_sql(l) if out: break def job(z): return get_info(z[0], z[1]) def main(): urls = get_detail_urls() print(len(urls)) pool = Pool() pool.map(job, urls) pool.close() pool.join() if __name__ == '__main__': main()