第五天,知乎問題和回答字段提取和存入數據庫

 

 對應github地址:知乎和拉鉤css

 

摘要:
1. Scrapy的Request類支持設置cookie屬性,要在爬蟲請求中帶上cookie,能夠重載Spider的start_requests方法。start_requests()方法能夠返回一個請求給爬蟲的起始網站,這個返回的請求至關於start_urls,start_requests()返回的請求會替代start_urls裏的請求
 
2. json.loads把json字符串轉變爲python格式的,json.dumps把python格式字符串轉爲json格式
 
 
 
一. selenium進行模擬登錄
進入項目目錄後,執行下面代碼
scrapy genspider zhihu_sel www.zhihu.com
 
1. 模擬登錄知乎並獲取cookie信息
from selenium import webdriver import time import pickle def start_requests(self): browser = webdriver.Chrome() browser.get('https://www.zhihu.com/signin') input1 = browser.find_element_by_css_selector("input[name=username]") input1.send_keys('xx') input2 = browser.find_element_by_css_selector("input[name=password]") input2.send_keys('xx') button = browser.find_element_by_class_name('SignFlow-submitButton') button.click() time.sleep(10) Cookies = browser.get_cookies() print(Cookies) cookie_dict = {} for cookie in Cookies: f = open('./cookie/' + cookie['name']+'.zhihu', 'wb') pickle.dump(cookie, f) f.close() cookie_dict[cookie['name']] = cookie['value'] browser.close() return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict, headers=self.headers)] 使用scrapy讀取本地cookie文件的時候,須要在加上最後一行代碼 並在zhihu_sql.py中添加以下信息,這樣能夠保證後續的request請求都把cookie信息自動加上去 custom_settings = { "COOKIES_ENABLED": True, "DOWNLOAD_DELAY": 1.5, }

 

 
注意:
1)只能使用chrome60版本和相應的驅動,不然會報grant type錯誤
2) www.zhihu.com/signin這個網站地址很簡潔方便模擬登錄 ,網上搜的地址很麻煩,登不上
3)  經測試,代碼中cookie文件的保存位置./cookie必須是在cmd中進入虛擬環境後,使用mkdir cookie命令創建目錄,其餘狀況都不能保存文件到cookie目錄中
 
 
2. 修改調試的main.py,把jobbole註釋掉,添加知乎信息
 
 
3. 使用requests來模擬登錄知乎,本身查資料裏的zhihu_login_requests.py文件
將cookie信息保存在本地,下次登錄直接讀取cookie本地文件進行登錄
注意:
1)csrf會在用戶名和密碼的session信息中加入一段隨機碼並加密存儲在session_value中,保存在數據庫中的session包括session_key, session_value, 有效時間。
 
 
 
 
二. 知乎分析和數據表設計
 
1. scrapy shell中增長user_agent信息
有時候代碼訪問網頁的時候不加user_agent信息,會訪問不到頁面內容,在cmd虛擬環境中執行
scrapy shell -s USER_AGENT="頭信息" https://www.zhihu.com/question/34659999
而後執行以下代碼,能夠把網頁內容寫入到自定義文件中
 
 
2. 安裝插件JsonView能夠有序的查看json信息,須要加載解壓包裏的webcontent
而後雙擊ajax網頁連接就可查看
 
 
3. 問題和回答數據表設計
 
ask表
注意:ask表中並無create_time和update_time字段,因此在設計item時能夠不添加這兩個字段,可是能夠在answer的ajax數據中找到,能夠後面在ask表中加上
 
 
answer表
 
注意:用戶回答能夠是匿名的,因此author_id可爲空
 
 
 
 
 
 
 
三. item loader方式提取question
 
1. scrapy默認是使用深度優先算法來提取信息
 
2. 在F12開發者工具中,Request Header中通常能找到以下兩條信息
 
3. 爲防止被ban,執行以下措施
1)添加頭信息
headers = { # HOST就是要訪問的域名地址,https://blog.csdn.net/zhangqi_gsts/article/details/50775341
    "HOST": "www.zhihu.com", # referer表示從哪一個網頁跳轉過來的,可防止盜鏈。https://blog.csdn.net/shenqueying/article/details/79426884
    "Referer": "https://www.zhihu.com", 'User-Agent': "user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/60.0.3112.113 Safari/537.36" }

 

 
2)禁用cookie和設置下載延遲
# 重點:防止被ban
custom_settings = { "COOKIES_ENABLED": True, "DOWNLOAD_DELAY": 1 }

 

 
 
 
3. 編寫parse函數
selenium模擬登錄後,會首先執行parse函數
"""
提取出html頁面中的全部url 並跟蹤這些url進行一步爬取
若是提取的url中格式爲 /question/xxx 就下載以後直接進入解析函數
"""
def parse(self, response): all_urls = response.css("a::attr(href)").extract() all_urls = [parse.urljoin(response.url, url) for url in all_urls] # 使用lambda函數對於每個url進行過濾,若是是true放回列表,返回false去除。
    all_urls = filter(lambda x: True if x.startswith("https") else False, all_urls) for url in all_urls: # 具體問題以及具體答案的url咱們都要提取出來。用或關係實現,要用小括號括起來。由於具體答案的url沒斜槓
        match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", url) if match_obj: # 若是提取到question相關的頁面則下載後交由提取函數進行提取
            request_url = match_obj.group(1) yield scrapy.Request(request_url, headers=self.headers, callback=self.parse_question) else: # 註釋這裏方便調試
            pass
            # 若是不是question頁面則直接進一步跟蹤
            yield scrapy.Request(url, headers=self.headers, callback=self.parse)

 

 
說明:
1) 拼接相對地址,並過濾出來以https開頭的URL
from urllib import parse url1 = response.css("a::attr(href)").extract() url2 = [parse.urljoin(response.url, url) for url in url1]

 

 
2) x是url2中的每個值,若是x是以https開頭的就爲True,而後就能夠過濾出來存放到url3中
url3 = filter(lambda x: True if x.startswith("https") else False, url2) 若是以爲很差理解,能夠以下寫 url_list = [] for url in url3: if url.startwith("https"): url_list = url_list.append(url)

 

 
 
 
4. 補充內容,filter,map
 
filter() 函數用於過濾序列,過濾掉不符合條件的元素,返回一個迭代器對象,若是要轉換爲列表,可使用 list() 來轉換。
該接收兩個參數,第一個爲函數,第二個爲序列,序列的每一個元素做爲參數傳遞給函數進行判,而後返回 True 或 False,最後將返回 True 的元素放到新列表中。
def is_odd(n): return n % 2 == 1 tmplist = filter(is_odd, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) newlist = list(tmplist) print(newlist)

 

 
map()接收一個函數 f 和一個或多個list,並經過把函數 f 依次做用在 list 的每一個元素上,獲得一個新的 list 並返回。
# 一個列表的狀況
>>> map(lambda x: x ** 2, [1, 2, 3, 4, 5]) [1, 4, 9, 16, 25] # 提供了兩個列表,對相同位置的列表數據進行相加 
>>> map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10]) [3, 7, 11, 15, 19]

 

 
 
 
5. 參看定義好的數據庫字段,在items.py中定義問題和答案的item
# 知乎問題的item
class ZhihuQuestionItem(scrapy.Item): zhihu_id = scrapy.Field() topoics = scrapy.Field() url = scrapy.Field() title = scrapy.Field() content = scrapy.Field() answer_num = scrapy.Field() comments_num = scrapy.Field() watch_user_num = scrapy.Field() click_num = scrapy.Field() crawl_time = scrapy.Field() crawl_update_time = scrapy.Field() # 知乎回答的item
class ZhihuAnswerItem(scrapy.Item): zhihu_id = scrapy.Field() url = scrapy.Field() question_id = scrapy.Field() author_id = scrapy.Field() content = scrapy.Field() praise_num = scrapy.Field() comments_num = scrapy.Field() create_time = scrapy.Field() update_time = scrapy.Field() crawl_time = scrapy.Field() crawl_update_time = scrapy.Field()

 

 
 
 
 
6. 在zhuhu.py中編寫parse_question函數,從頁面中提取問題的各字段
 
使用下面命令能夠測試知乎字段爬取是否有效
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/60.0.3112.113 Safari/537.36" https://www.zhihu.com/question/57195513scrapy
 
例如:提取一個回答的方法
# response.css(".QuestionAnswers-answers .List-item:nth-child(1) .RichContent-inner span::text").extract()
 
def parse_question(self, response): # 處理新版本, 新版本有惟一類QuestionHeader-title來設置標題,老版本沒這個類
    if "QuestionHeader-title" in response.text: match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url) if match_obj: # group(2)取到的爲(\d+)中的內容
            question_id = int(match_obj.group(2)) # 使用scrapy默認提供的ItemLoder使代碼更簡潔,首先實例化
        item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response) item_loader.add_value("url_object_id", get_md5(response.url)) item_loader.add_value("zhihu_id", question_id) item_loader.add_css("title", "h1.QuestionHeader-title::text") # 下面一個回答內容的例子,可參考下,提取content的方法提取了全部回答內容
        # response.css(".QuestionAnswers-answers .List-item:nth-child(1) .RichContent-inner span::text").extract()
        item_loader.add_css("content", ".QuestionAnswers-answers") item_loader.add_css("topics", ".QuestionHeader-topics .Tag.QuestionTopic .Popover div::text") item_loader.add_css("answer_num", ".List-headerText span::text") item_loader.add_css("comments_num", ".QuestionHeader-Comment button::text") # 這裏的watch_user_num 包含Watch 和 click, 在clean data中分離
        item_loader.add_css("watch_user_num", ".NumberBoard-itemValue ::text") item_loader.add_value("url", response.url) question_item = item_loader.load_item() # 發起向後臺具體answer的接口請求
yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0), headers=self.headers, callback=self.parse_answer) yield question_item

 

 
注意
1)調試的時候能夠在if "QuestionHeader-title" in response.text:處打一個斷點,一路F6一步步調試,不然會報數據庫的錯誤
 
2)使用xpath"或"的方式來提取字段
有時候不一樣頁面中的標題會有2種格式,好比<a>標籤裏的標題和<span>span標籤裏的標題
此時使用css的方式response.css(".zh-question-title a:text")就不能適用了,這時須要一個或的表示方式,可用xpath來實現,以下
item_loader.add_xpath("title", "//*[@id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()") 

 

 

 
7.  編寫回答的處理函數 parse_answer
回答的內容是用ajax加載的,經分析發現有api接口可調用,以下面的next和previous地址
 
7.1 首先在zhuhu.py中定義一個變量來發起一個關於回答的初始請求,注意裏面的變量要替換爲{0},{1},{2}
start_answer_url = "https://www.zhihu.com/api/v4/questions/{0}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={1}&offset={2}&sort_by=default"
 
def parse_answer(self, response): # json.loads把json字符串轉變爲python格式的
    ans_json = json.loads(response.text) # 判斷是否有後續頁面,以及下一個頁面的URL,就是頁面分析中Preview裏的paging信息
    is_end = ans_json["paging"]["is_end"] next_url = ans_json["paging"]["next"] # 提取answer的具體字段
    for answer in ans_json["data"]: answer_item = ZhihuAnswerItem() #answer_item["url_object_id"] = get_md5(url=answer["url"])
        answer_item["zhihu_id"] = answer["id"] answer_item["question_id"] = answer["question"]["id"] # 有時候回答是匿名的,此時author字段中沒id值,那麼就返回None
        answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None answer_item["author_name"] = answer["author"]["name"] if "name" in answer["author"] else None answer_item["content"] = answer["content"] if "content" in answer else None answer_item["praise_num"] = answer["voteup_count"] answer_item["comments_num"] = answer["comment_count"] answer_item["url"] = "https://www.zhihu.com/question/{0}/answer/{1}".format(answer["question"]["id"], answer["id"]) answer_item["create_time"] = answer["created_time"] answer_item["update_time"] = answer["updated_time"] answer_item["crawl_time"] = datetime.now() yield answer_item # 若是不是最後一個URL,繼續請求下一個頁面
    if not is_end: yield scrapy.Request(next_url, headers=self.headers, callback=self.parse_answer) 

 

 
 
 
四. 數據入庫
 
方法一:根據不一樣的item執行不一樣的mysql語句
def do_insert(self, cursor, item):
    if item.__class__.__name__ == "JobBoleArticleItem":
        insert_sql = """..."""
上面代碼能夠取到當前函數所在類的名字
上面這種方法把Item名字寫死了,後期若是有修改就比較麻煩
 
 
方法二,能夠把不一樣的sql語句寫在items.py中的具體項目的類中,定義一個函數來存放sql語句
 
1. 知乎問題類的item,ZhihuQuestionItem中添加以下代碼
def get_insert_sql(self): # 插入知乎question表的sql語句
    insert_sql = """ insert into zhihu_question(zhihu_id, topics, url, title, content, answer_num, comments_num, watch_user_num, click_num, crawl_time ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE content=VALUES(content), answer_num=VALUES(answer_num), comments_num=VALUES(comments_num), watch_user_num=VALUES(watch_user_num), click_num=VALUES(click_num) """
# scrapy.Field返回類型爲列表
zhihu_id = self["zhihu_id"][0] topics = ",".join(self["topics"]) url = self["url"][0] title = "".join(self["title"]) content = "".join(self["content"]) # extract_num就是上面定義的get_nums,只是把它從新定義爲一個經常使用函數了
answer_num = extract_num("".join(self["answer_num"])) comments_num = extract_num("".join(self["comments_num"])) # 瀏覽數和點擊數是一塊兒取出來的,而且用逗號分隔,須要單獨取出來
if len(self["watch_user_num"]) == 2: watch_user_num_click = self["watch_user_num"] watch_user_num = extract_num_include_dot(watch_user_num_click[0]) click_num = extract_num_include_dot(watch_user_num_click[1]) else: watch_user_num_click = self["watch_user_num"] watch_user_num = extract_num_include_dot(watch_user_num_click[0]) click_num = 0 # 要把時間格式轉爲字符串格式
crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT) # 順序要和sql語句中的保持同樣
params = (zhihu_id, topics, url, title, content, answer_num, comments_num, watch_user_num, click_num, crawl_time) return insert_sql, params

 

注意:
瀏覽數取出來的方式,單獨定義了一個函數extract_num_include_dot,早utils.common文件中
 
 
 
2,改造MysqlTwistedPipeline
def do_insert(self, cursor, item): # 根據不一樣的Item構建不一樣的sql語句並插入到mysql中
    insert_sql, params = item.get_insert_sql() cursor.execute(insert_sql, params) 

 

 
3. ZhihuAnswerItem(scrapy.Item)中添加插入數據代碼
def get_insert_sql(self): # 插入知乎question表的sql語句
    insert_sql = """ insert into zhihu_answer(zhihu_id, url, question_id, author_id, author_name, content, comments_num, create_time, update_time, crawl_time ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE content=VALUES(content), comments_num=VALUES(comments_num), update_time=VALUES(update_time) """
 
    # int類型轉爲datetime類型,須要使用fromtimestamp函數;再轉爲字符串類型,須要strftime函數
    create_time = datetime.datetime.fromtimestamp(self["create_time"]).strftime(SQL_DATETIME_FORMAT) update_time = datetime.datetime.fromtimestamp(self["update_time"]).strftime(SQL_DATETIME_FORMAT) params = ( self["zhihu_id"], self["url"], self["question_id"], self["author_id"], self["author_name"], self["content"], self["comments_num"], create_time, update_time, self["crawl_time"].strftime(SQL_DATETIME_FORMAT), ) return insert_sql, params

 

 
說明:
1)點贊數有點問題,要先註釋掉
2)on duplicate key update字段的做用:因爲咱們是用zhuhu_id爲主鍵,重複爬取時,點贊數等字段可能會變化,可是主鍵不變,就會形成主鍵衝突,加上這個命令就不會出錯了
相關文章
相關標籤/搜索