scrapy 爬取知乎問題、答案，並異步寫入數據庫（mysql）

時間 2019-11-11

標籤 scrapy 問題答案異步寫入數據庫 mysql 欄目 Python 简体版

原文原文鏈接

python版本 python2.7html

爬取知乎流程:python

　一、分析　在訪問知乎首頁的時候（https://www.zhihu.com），在沒有登陸的狀況下，會進行重定向到（https://www.zhihu.com/signup?next=%2F）這個頁面，web

　　爬取知乎，首先要完成登陸操做,登錄的時候觀察往那個頁面發送了post或者get請求。能夠利用抓包工具來獲取登陸時密碼錶單等數據的提交地址。sql

　一、利用抓包工具，查看用戶名密碼數據的提交地址頁就是post請求,將表單數據提交的網址,通過查看。是這個網址 'https://www.zhihu.com/api/v3/oauth/sign_in'。數據庫

　二、經過抓取上述登陸地址,在其請求的contenr字段中,發現post請求服務器不止包含用戶名,密碼,還有timetamp,lang,client_id,sihnature等表單數據,須要知道每個表單數據的特色,而特色是咱們數據變化　在每次登陸的時候的變化來查找數據的規律。json

　三、通過屢次登陸觀察,這些表單數據中只有timetamp,和signature是變化的,其餘的值是不變的。api

四、經過js發現 signature字段的值是有多個字段組合加密而成,其實timetamp時間戳是核心,每次根據時間的變化,生成不一樣的signature值。瀏覽器

五、考慮到signature的值加密較爲複雜,直接將瀏覽器登錄成功後的時間戳timetamp和signature 複製到請求數據中,而後進行登陸。
六、表單數據田中完畢,發送post請求時,出現了缺乏驗證碼票據的錯誤(capsion_ticket) 通過分析驗證碼票據是爲了獲取驗證碼而提供的一種驗證方式,
而抓包裝工具中關於驗證碼的請求有兩次, 一次獲取的是:
{'show_captcha':true}
而同時第二次獲取的是:{'img_base_64':Rfadausifpoauerfae}。
七、通過分析{'show_captcha':true} 是獲取驗證碼的關鍵信息,再抓包信息中發現第一次請求相應的set-cookie中,包含了capsion_ticket驗證碼票據信息。
八、在此模擬登錄又出現了錯誤'ERR_xxx_AUTH_TOKEN'錯誤信息,而她出如今咱們很根據驗證碼票據獲取驗證碼圖片時,
咱們從抓包中查看關於Authorization:oauth ce30dasjfsdjhfkiswdnf.因此將其在headers當中進行配置。
驗證碼問題：

驗證碼問題
 -對於知乎的驗證碼，有兩種狀況，一種是英文的圖片驗證碼，一種是點擊倒立文字的驗證碼，當登陸須要驗證碼的時候，迴向這兩個網站發送數據
 倒立文字驗證碼：https://www.zhihu.com/api/v3/oauth/captcha?lang=cn
 英文圖片驗證碼：https://www.zhihu.com/api/v3/oauth/captcha?lang=en
 -英文驗證碼獲得數據是四個英文字母。可採用雲打碼在線識別。
　　 -倒立文字驗證碼是獲得的是每一個漢字有必定的範圍，當登錄的時候點擊驗證碼的時候，
從https://www.zhihu.com/api/v3/oauth/captcha?lang=cn該網站獲取到的一個像素點（x,y),好比倒立文字在第三個和第五個，就會有一個可選範圍，只要輸入合適的像素點 就能夠登陸。
　　-只對倒立文字進行驗證
　　-只是簡單地爬取第一頁的問題及回答

2、建立scrapy項目
　　scrapy startproject ZhiHuSpider
　　scrapy genspider zhihu zhihu.com
3、代碼
　　在zhihu.py中代碼以下：

 1 # -*- coding: utf-8 -*-  2 import base64  3 import json  4 import urlparse  5 import re  6 from datetime import datetime  7 import scrapy  8 from scrapy.loader import ItemLoader  9 from ..items import ZhiHuQuestionItem, ZhiHuAnswerItem  10  11  12 class ZhihuSpider(scrapy.Spider):  13 name = 'zhihu'  14 allowed_domains = ['www.zhihu.com']  15 start_urls = ['https://www.zhihu.com']  16 start_answer_url = "https://www.zhihu.com/api/v4/questions/{}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset={}&sort_by=default"  17  18 headers = {  19 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',  20 'Referer': 'https://www.zhihu.com',  21 'HOST': 'www.zhihu.com',  22 'Authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'  23  }  24 points_list = [[20, 27], [42, 25], [65, 20], [90, 25], [115, 32], [140, 25], [160, 25]]  25  26 def start_requests(self):  27 """  28  重寫父類的start_requests()函數，在這裏設置爬蟲的起始url爲登陸頁面的url。  29  :return:  30 """  31 yield scrapy.Request(  32 url='https://www.zhihu.com/api/v3/oauth/captcha?lang=cn',  33 callback=self.captcha,  34 headers=self.headers,  35  )  36  37 def captcha(self, response):  38 show_captcha = json.loads(response.body)['show_captcha']  39 if show_captcha:  40 print u'有驗證碼'  41 yield scrapy.Request(  42 url='https://www.zhihu.com/api/v3/oauth/captcha?lang=cn',  43 method='PUT',  44 headers=self.headers,  45 callback=self.shi_bie  46  )  47 else:  48 print u'沒有驗證碼'  49 # 直接進行登陸的操做  50 post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'  51 post_data = {  52 'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',  53 'grant_type': 'password',  54 'timestamp': '1515391742289',  55 'source': 'com.zhihu.web',  56 'signature': '6d1d179e50a06d1c17d6e8b5c89f77db34f406ac',  57 'username': '',#帳號  58 'password': '',#密碼  59 'captcha': '',  60 'lang': 'cn',  61 'ref_source': 'homepage',  62 'utm_source': ''  63  }  64  65 yield scrapy.FormRequest(  66 url=post_url,  67 headers=self.headers,  68 formdata=post_data,  69 callback=self.index_page  70  )  71  72 def shi_bie(self, response):  73 try:  74 img= json.loads(response.body)['img_base64']  75 except Exception, e:  76 print '獲取img_base64的值失敗，緣由：%s'%e  77 else:  78 print '成功獲取加密後的圖片地址'  79 # 將加密後的圖片進行解密，同時保存到本地  80 img = img.encode('utf-8')  81 img_data = base64.b64decode(img)  82 with open('zhihu_captcha.GIF', 'wb') as f:  83  f.write(img_data)  84  85 captcha = raw_input('請輸入倒立漢字的位置：')  86 if len(captcha) == 2:  87 # 說明有兩個倒立的漢字  88 pass  89 first_char = int(captcha[0]) - 1 # 第一個漢字對應列表中的索引  90 second_char = int(captcha[1]) - 1 # 第二個漢字對應列表中的索引  91 captcha = '{"img_size":[200,44],"input_points":[%s,%s]}' % (self.points_list[first_char], self.points_list[second_char])  92 else:  93 # 說明只有一個倒立的漢字  94 pass  95 first_char = int(captcha[0]) - 1  96 captcha = '{"img_size":[200,44],"input_points":[%s]}' % (  97  self.points_list[first_char])  98  99 data = { 100 'input_text': captcha 101  } 102 yield scrapy.FormRequest( 103 url='https://www.zhihu.com/api/v3/oauth/captcha?lang=cn', 104 headers=self.headers, 105 formdata=data, 106 callback=self.get_result 107  ) 108 109 def get_result(self, response): 110 try: 111 yan_zheng_result = json.loads(response.body)['success'] 112 except Exception, e: 113 print '關於驗證碼的POST請求響應失敗，緣由：{}'.format(e) 114 else: 115 if yan_zheng_result: 116 print u'驗證成功' 117 post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in' 118 post_data = { 119 'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20', 120 'grant_type': 'password', 121 'timestamp': '1515391742289', 122 'source': 'com.zhihu.web', 123 'signature': '6d1d179e50a06d1c17d6e8b5c89f77db34f406ac', 124 'username': '',#帳號 125 'password': '',#密碼 126 'captcha': '', 127 'lang': 'cn', 128 'ref_source': 'homepage', 129 'utm_source': '' 130  }
　　　　　　　　　　　　#以上數據須要在抓包中獲取 131 132 yield scrapy.FormRequest( 133 url=post_url, 134 headers=self.headers, 135 formdata=post_data, 136 callback=self.index_page 137  ) 138 else: 139 print u'是錯誤的驗證碼！' 140 141 def index_page(self, response): 142 for url in self.start_urls: 143 yield scrapy.Request( 144 url=url, 145 headers=self.headers 146  ) 147 148 def parse(self, response): 149 """ 150  提取首頁中的全部問題的url，並對這些url進行進一步的追蹤，爬取詳情頁的數據。 151  :param response: 152  :return: 153 """ 154 # /question/19618276/answer/267334062 155 all_urls = response.xpath('//a[@data-za-detail-view-element_name="Title"]/@href').extract() 156 all_urls = [urlparse.urljoin(response.url, url) for url in all_urls] 157 for url in all_urls: 158 # https://www.zhihu.com/question/19618276/answer/267334062 159 # 同時提取：詳情的url；文章的ID； 160 result = re.search('(.*zhihu.com/question/(\d+))', url) 161 if result: 162 detail_url = result.group(1) 163 question_id = result.group(2) 164 # 將詳情url交由下載器去下載網頁源碼 165 yield scrapy.Request( 166 url=detail_url, 167 headers=self.headers, 168 callback=self.parse_detail_question, 169 meta={ 170 'question_id': question_id, 171  } 172  ) 173 174 # 在向詳情url發送請求的同時，根據問題的ID，同時向問題的url發送請求。因爲問題和答案是兩個獨立的url。而答案實際上是一個JSON的API接口，直接請求便可，不須要和問題url產生聯繫。 175 yield scrapy.Request( 176 # 參數：問題ID，偏移量。默認偏移量爲0，從第一個答案開始請求 177 url=self.start_answer_url.format(question_id, 0), 178 headers=self.headers, 179 callback=self.parse_detail_answer, 180 meta={ 181 'question_id': question_id 182  } 183  ) 184 185 break 186 187 def parse_detail_question(self, response): 188 """ 189  用於處理詳情頁面關於question問題的數據，好比：問題名稱，簡介，瀏覽數，關注者數等 190  :param response: 191  :return: 192 """ 193 item_loader = ItemLoader(item=ZhiHuQuestionItem(), response=response) 194 item_loader.add_value('question_id', response.meta['question_id']) 195 item_loader.add_xpath('question_title', '//div[@class="QuestionHeader"]//h1/text()') 196 item_loader.add_xpath('question_topic', '//div[@class="QuestionHeader-topics"]//div[@class="Popover"]/div/text()') 197 # 獲取的問題中，可能會不存在簡介 198 item_loader.add_xpath('question_content', '//span[@class="RichText"]/text()') 199 item_loader.add_xpath('question_watch_num', '//button[contains(@class, "NumberBoard-item")]//strong/text()') 200 item_loader.add_xpath('question_click_num', '//div[@class="NumberBoard-item"]//strong/text()') 201 item_loader.add_xpath('question_answer_num', '//h4[@class="List-headerText"]/span/text()') 202 item_loader.add_xpath('question_comment_num', '//div[@class="QuestionHeader-Comment"]/button/text()') 203 item_loader.add_value('question_url', response.url) 204 item_loader.add_value('question_crawl_time', datetime.now()) 205 206 question_item = item_loader.load_item() 207 yield question_item 208 209 def parse_detail_answer(self, response): 210 """ 211  用於解析某一個問題ID對應的全部答案。 212  :param response: 213  :return: 214 """ 215 answer_dict = json.loads(response.body) 216 is_end = answer_dict['paging']['is_end'] 217 next_url = answer_dict['paging']['next'] 218 219 for answer in answer_dict['data']: 220 answer_item = ZhiHuAnswerItem() 221 answer_item['answer_id'] = answer['id'] 222 answer_item['answer_question_id'] = answer['question']['id'] 223 answer_item['answer_author_id'] = answer['author']['id'] 224 answer_item['answer_url'] = answer['url'] 225 answer_item['answer_comment_num'] = answer['comment_count'] 226 answer_item['answer_praise_num'] = answer['voteup_count'] 227 answer_item['answer_create_time'] = answer['created_time'] 228 answer_item['answer_content'] = answer['content'] 229 answer_item['answer_crawl_time'] = datetime.now() 230 answer_item['answer_update_time'] = answer['updated_time'] 231 232 yield answer_item 233 234 # 判斷is_end若是值爲False，說明還有下一頁 235 if not is_end: 236 yield scrapy.Request( 237 url=next_url, 238 headers=self.headers, 239 callback=self.parse_detail_answer 240 )

　　item.py中代碼:服務器

 1 # -*- coding: utf-8 -*-  2  3 # Define here the models for your scraped items  4 #  5 # See documentation in:  6 # https://doc.scrapy.org/en/latest/topics/items.html  7  8 from datetime import datetime  9 import scrapy 10 from utils.common import extract_num 11 12 13 class ZhihuspiderItem(scrapy.Item): 14 # define the fields for your item here like: 15 # name = scrapy.Field() 16 pass 17 18 19 class ZhiHuQuestionItem(scrapy.Item): 20 question_id=scrapy.Field() # 問題ID 21 question_title = scrapy.Field() # 問題標題 22 question_topic = scrapy.Field() # 問題分類 23 question_content = scrapy.Field() # 問題內容 24 question_watch_num = scrapy.Field() # 關注者數量 25 question_click_num = scrapy.Field() # 瀏覽者數量 26 question_answer_num = scrapy.Field() # 回答總數 27 question_comment_num = scrapy.Field() # 評論數量 28 question_crawl_time = scrapy.Field() # 爬取時間 29 question_url = scrapy.Field() # 問題詳情url 30 31 def get_insert_sql(self): 32 insert_sql = "insert into zhihu_question(question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE question_id=VALUES(question_id),question_title=VALUES(question_title),question_topic=VALUES(question_topic),question_content=VALUES(question_content),question_watch_num=VALUES(question_watch_num),question_click_num=VALUES(question_click_num),question_answer_num=VALUES(question_answer_num),question_comment_num=VALUES(question_comment_num),question_crawl_time=VALUES(question_crawl_time),question_url=VALUES(question_url)" 33 34 # 整理字段對應的數據 35 question_id = str(self['question_id'][0]) 36 question_title = ''.join(self['question_title']) 37 question_topic = ",".join(self['question_topic']) 38 39 try: 40 question_content = ''.join(self['question_content']) 41 except Exception,e: 42 question_content = 'question_content內容爲空' 43 44 question_watch_num = ''.join(self['question_watch_num']).replace(',', '') 45 question_watch_num = extract_num(question_watch_num) 46 47 question_click_num = ''.join(self['question_click_num']).replace(',', '') 48 question_click_num = extract_num(question_click_num) 49 # '86 回答' 50 question_answer_num = ''.join(self['question_answer_num']) 51 question_answer_num = extract_num(question_answer_num) 52 # '100 條評論' 53 question_comment_num = ''.join(self['question_comment_num']) 54 question_comment_num = extract_num(question_comment_num) 55 56 question_crawl_time = self['question_crawl_time'][0] 57 question_url = self['question_url'][0] 58 59 args_tuple = (question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url) 60 61 return insert_sql, args_tuple 62 63 64 class ZhiHuAnswerItem(scrapy.Item): 65 answer_id = scrapy.Field() # 答案的ID (zhihu_answer表的主鍵) 66 answer_question_id = scrapy.Field() # 問題的ID (zhihu_question表的主鍵) 67 answer_author_id = scrapy.Field() # 回答用戶的ID 68 answer_url = scrapy.Field() # 回答的url 69 answer_comment_num = scrapy.Field() # 該回答的總評論數 70 answer_praise_num = scrapy.Field() # 該回答的總點贊數 71 answer_create_time = scrapy.Field() # 該回答的建立時間 72 answer_content = scrapy.Field() # 回答的內容 73 answer_update_time = scrapy.Field() # 回答的更新時間 74 75 answer_crawl_time = scrapy.Field() # 爬蟲的爬取時間 76 77 def get_insert_sql(self): 78 insert_sql = "insert into zhihu_answer(answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, answer_update_time, answer_crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE answer_id=VALUES(answer_id),answer_question_id=VALUES(answer_question_id),answer_author_id=VALUES(answer_author_id),answer_url=VALUES(answer_url),answer_comment_num=VALUES(answer_comment_num),answer_praise_num=VALUES(answer_praise_num),answer_create_time=VALUES(answer_create_time),answer_content=VALUES(answer_content),answer_update_time=VALUES(answer_update_time),answer_crawl_time=VALUES(answer_crawl_time)" 79 80 # 處理answer_item中的數據 81 # fromtimestamp(timestamp)：將一個時間戳數據轉化爲一個date日期類型的數據 82 answer_id = self['answer_id'] 83 answer_question_id = self['answer_question_id'] 84 answer_author_id = self['answer_author_id'] 85 answer_url = self['answer_url'] 86 answer_comment_num = self['answer_comment_num'] 87 answer_praise_num = self['answer_praise_num'] 88 answer_content = self['answer_content'] 89 answer_create_time = datetime.fromtimestamp(self['answer_create_time']) 90 answer_update_time = datetime.fromtimestamp(self['answer_update_time']) 91 answer_crawl_time = self['answer_crawl_time'] 92 93 args_tuple = (answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, answer_update_time, answer_crawl_time) 94 95 return insert_sql, args_tuple

　　　　pipeline,py代碼以下:cookie

 1 # -*- coding: utf-8 -*-  2  3 # Define your item pipelines here  4 #  5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting  6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html  7  8 import MySQLdb  9 import MySQLdb.cursors 10 from twisted.enterprise import adbapi 11 12 # 數據庫的異步寫入操做。由於execute()及commit()提交數據庫的方式是同步插入數據，一旦數據量比較大，scrapy的解析是異步多線程的方式，解析速度很是快，而數據庫的寫入速度比較慢，可能會致使item中的數據插入數據庫不及時，形成數據庫寫入的阻塞，最終致使數據庫卡死或者數據丟失。 13 14 15 class ZhihuspiderPipeline(object): 16 def process_item(self, item, spider): 17 return item 18 19 20 class MySQLTwistedPipeline(object): 21 def __init__(self, dbpool): 22 self.dbpool = dbpool 23 24  @classmethod 25 def from_settings(cls, settings): 26 args = dict( 27 host=settings['MYSQL_HOST'], 28 db=settings['MYSQL_DB'], 29 user=settings['MYSQL_USER'], 30 passwd=settings['MYSQL_PASSWD'], 31 charset=settings['MYSQL_CHARSET'], 32 cursorclass=MySQLdb.cursors.DictCursor 33  ) 34 # 建立一個線程池對象 35 # 參數1：用於鏈接MySQL數據庫的驅動 36 # 參數2：數據庫的連接信息（host, port, user等） 37 dbpool = adbapi.ConnectionPool('MySQLdb', **args) 38 return cls(dbpool) 39 40 def process_item(self, item, spider): 41 # 在線程池dbpool中經過調用runInteraction()函數，來實現異步插入數據的操做。runInteraction()會insert_sql交由線程池中的某一個線程執行具體的插入操做。 42 query = self.dbpool.runInteraction(self.insert, item) 43 # addErrorback()數據庫異步寫入失敗時，會執行addErrorback()內部的函數調用。 44  query.addErrback(self.handle_error, item) 45 46 def handle_error(self, failure, item): 47 print u'插入數據失敗，緣由：{}，錯誤對象：{}'.format(failure, item) 48 49 def insert(self, cursor, item): 50 pass 51 # 當存在多張表時，每個表對應的數據，解析時間是不肯定的，不太可能保證問題，答案同時可以解析完成，而且同時進入到pipeline中執行Insert的操做。 52 # 因此，不能再這個函數中，對全部的表執行execute()的操做。 53 # 解決辦法：將sql語句在每個Item類中實現。 54 # insert_question = '' 55 # insert_answer = '' 56 # insert_user = '' 57 insert_sql, args = item.get_insert_sql() 58 cursor.execute(insert_sql, args)

setting.py代碼以下:

 1 # -*- coding: utf-8 -*-  2  3 # Scrapy settings for ZhiHuSpider project  4 #  5 # For simplicity, this file contains only settings considered important or  6 # commonly used. You can find more settings consulting the documentation:  7 #  8 # https://doc.scrapy.org/en/latest/topics/settings.html  9 # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 10 # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'ZhiHuSpider' 13 14 SPIDER_MODULES = ['ZhiHuSpider.spiders'] 15 NEWSPIDER_MODULE = 'ZhiHuSpider.spiders' 16 17 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 #USER_AGENT = 'ZhiHuSpider (+http://www.yourdomain.com)' 20 21 # Obey robots.txt rules 22 ROBOTSTXT_OBEY = False 23 24 # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 #CONCURRENT_REQUESTS = 32 26 27 # Configure a delay for requests for the same website (default: 0) 28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay 29 # See also autothrottle settings and docs 30 #DOWNLOAD_DELAY = 3 31 # The download delay setting will honor only one of: 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 #CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 #COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 #TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 # DEFAULT_REQUEST_HEADERS = { 43 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 # 'Accept-Language': 'en', 45 # } 46 47 # Enable or disable spider middlewares 48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html 49 #SPIDER_MIDDLEWARES = { 50 # 'ZhiHuSpider.middlewares.ZhihuspiderSpiderMiddleware': 543, 51 #} 52 53 # Enable or disable downloader middlewares 54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 55 #DOWNLOADER_MIDDLEWARES = { 56 # 'ZhiHuSpider.middlewares.ZhihuspiderDownloaderMiddleware': 543, 57 #} 58 59 # Enable or disable extensions 60 # See https://doc.scrapy.org/en/latest/topics/extensions.html 61 #EXTENSIONS = { 62 # 'scrapy.extensions.telnet.TelnetConsole': None, 63 #} 64 65 # Configure item pipelines 66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 67 ITEM_PIPELINES = { 68 # 'ZhiHuSpider.pipelines.ZhihuspiderPipeline': 300, 69 'ZhiHuSpider.pipelines.MySQLTwistedPipeline':1, 70 } 71 72 # Enable and configure the AutoThrottle extension (disabled by default) 73 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html 74 #AUTOTHROTTLE_ENABLED = True 75 # The initial download delay 76 #AUTOTHROTTLE_START_DELAY = 5 77 # The maximum download delay to be set in case of high latencies 78 #AUTOTHROTTLE_MAX_DELAY = 60 79 # The average number of requests Scrapy should be sending in parallel to 80 # each remote server 81 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 82 # Enable showing throttling stats for every response received: 83 #AUTOTHROTTLE_DEBUG = False 84 85 # Enable and configure HTTP caching (disabled by default) 86 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 87 #HTTPCACHE_ENABLED = True 88 #HTTPCACHE_EXPIRATION_SECS = 0 89 #HTTPCACHE_DIR = 'httpcache' 90 #HTTPCACHE_IGNORE_HTTP_CODES = [] 91 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 92 93 MYSQL_HOST = 'localhost'# 本機端口, 94 MYSQL_DB = '' #數據庫名字 95 MYSQL_USER = '' #數據庫用戶名 96 MYSQL_PASSWD = '' #密碼 97 MYSQL_CHARSET = 'utf8'

　　另外設置了一個工具模塊新建了一個python package.用來過濾item數據

　　　　須要在item中導入模塊

　　　　　　代碼以下:

1 import re 2 3 4 def extract_num(value): 5 result = re.search(re.compile('(\d+)'), value) 6 res = int(result.group(1)) 7 return res

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

scrapy 爬取知乎問題、答案 ，並異步寫入數據庫（mysql）

scrapy 爬取知乎問題、答案，並異步寫入數據庫（mysql）