本文將介紹網絡數據採集的基本原理:php
所謂爬蟲就是一個自動化數據採集工具,你只要告訴它要採集哪些數據,丟給它一個 URL,就能自動地抓取數據了。其背後的基本原理就是爬蟲程序向目標服務器發起 HTTP 請求,而後目標服務器返回響應結果,爬蟲客戶端收到響應並從中提取數據,再進行數據清洗、數據存儲工做。css
如下截圖來自 掘金小冊
基於 Python 實現微信公衆號爬蟲對《圖解HTTP》的總結html
Python 提供了很是多工具去實現 HTTP 請求,但第三方開源庫提供的功能更豐富,你無需從 socket 通訊開始寫,好比使用Pyton內建模塊 urllib 請求一個 URL
這裏咱們先操練起來,寫個測試爬蟲node
from urllib.request import urlopen//查找python的urllib庫的request模塊,導出urlopen函數 html = urlopen("http://jxdxsw.com/")//urlopen用來打開並讀取一個從網絡獲取的遠程對象 print(html.read())
而後,把這段代碼保存爲`scrapetest.py`,終端中運行以下命令python
python3 scrapetest.py
這裏會輸出http://jxdxsw/
這個網頁首頁的所有HTML代碼mysql
鯨魚注
:
Python 3.x中urllib分爲子模塊:
- urllib.requestgit
urllib是python的標準庫,它可以:github
更多查看python官方文檔web
標準示例面試
import ssl from urllib.request import Request from urllib.request import urlopen context = ssl._create_unverified_context() # HTTP請求 request = Request(url = "http://jxdxsw.com", method="GET", headers= {"Host": "jxdxsw.com"}, data=None) # HTTP響應 response = urlopen(request, context=content) headers = response.info() #響應頭 content = response.read() #響應體 code = response.getcode() #狀態碼
Python 提供的urllib內建模塊過於低級,須要寫不少代碼,使用簡單爬蟲能夠考慮 Requests
pip3 install requests
>>> r = requests.get("https://httpbin.org/ip") >>> r <Response [200]> # 響應對象 >>> r.status_code # 響應狀態碼 200 >>> r.content # 響應內容 '{\n "origin": "183.237.232.123"\n}\n'...
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
服務器反爬蟲機制會判斷客戶端請求頭中的User-Agent是否來源於真實瀏覽器,因此,咱們使用Requests常常會指定UA假裝成瀏覽器發起請求
>>> url = 'https://httpbin.org/headers' >>> headers = {'user-agent': 'Mozilla/5.0'} >>> r = requests.get(url, headers=headers)
不少時候URL後面會有一串很長的參數,爲了提升可讀性,requests 支持將參數抽離出來做爲方法的參數(params)傳遞過去,而無需附在 URL 後面,例如請求 url http://bin.org/get?key=val
>>> url = "http://httpbin.org/get" >>> r = requests.get(url, params={"key":"val"}) >>> r.url u'http://httpbin.org/get?key=val'
Cookie 是web瀏覽器登陸網站的憑證,雖然 Cookie 也是請求頭的一部分,咱們能夠從中剝離出來,使用 Cookie 參數指定
>>> s = requests.get('http://httpbin.org/cookies', cookies={'from-my': 'browser'}) >>> s.text u'{\n "cookies": {\n "from-my": "browser"\n }\n}\n'
當發起一個請求遇到服務器響應很是緩慢而你又不但願等待過久時,能夠指定 timeout 來設置請求超時時間,單位是秒,超過該時間尚未鏈接服務器成功時,請求將強行終止。
r = requests.get('https://google.com', timeout=5)
一段時間內發送的請求太多容易被服務器斷定爲爬蟲,因此不少時候咱們使用代理IP來假裝客戶端的真實IP。
import requests proxies = { 'http': 'http://127.0.0.1:1080', 'https': 'http://127.0.0.1:1080', } r = requests.get('http://www.kuaidaili.com/free/', proxies=proxies, timeout=2)
若是想和服務器一直保持登陸(會話)狀態,而沒必要每次都指定 cookies,那麼可使用 session,Session 提供的API和 requests 是同樣的。
import requests s = request.Session() s.cookies = requests.utils.cookiejar_from_dict({"a":"c"}) r = s.get('http://httpbin.org/cookies') print(r.text) # '{"cookies": {"a": "c"}}'
使用Requests完成一個爬取知乎專欄用戶關注列表的簡單爬蟲
以一塊兒學習爬蟲這個專欄爲例,打開關注列表關注列表
用 Chrome 找到獲取粉絲列表的請求地址
https://zhuanlan.zhihu.com/ap...
而後咱們用 Requests 模擬瀏覽器發送請求給服務器
import json import requests class SimpleCrawler: init_url = "https://zhuanlan.zhihu.com/api/columns/pythoneer/followers" offset = 0 def crawl(self, params=None): # 必須指定UA,不然知乎服務器會斷定請求不合法 headers = { "Host": "zhuanlan.zhihu.com", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36", } response = requests.get(self.init_url, headers=headers, params=params) print(response.url) data = response.json() # 7000表示全部關注量 # 分頁加載更多,遞歸調用 while self.offset < 7000: self.parse(data) self.offset += 20 params = {"limit": 20, "offset": self.offset} self.crawl(params) def parse(self, data): # 以json格式存儲到文件 with open("followers.json", "a", encoding="utf-8") as f: for item in data: f.write(json.dumps(item)) f.write('\n') if __name__ == '__main__': SimpleCrawler().crawl()
這就是一個最簡單的基於 Requests 的單線程知乎專欄粉絲列表的爬蟲,requests 很是靈活,請求頭、請求參數、Cookie 信息均可以直接指定在請求方法中,返回值 response 若是是 json 格式能夠直接調用json()方法返回 python 對象
beatifulsoup非python標準庫須要單獨安裝
安裝使用詳情
鯨魚使用的是ubuntu因此一下幾行命令便可
sudo apt-get install python-bs4 sudo apt-get install python3-pip //安裝python包管理工具 pip3 install beautifulsoup4
使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出:
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://jxdxsw.com/") bsobj = BeautifulSoup(html.read()) print(bsobj.prettify()) print("-----------------------------我是分割線---------------------------") print(bsobj.title) print("-----------------------------我是分割線---------------------------") print(bsobj.find_all('a'))
html = urlopen("http://jxdxsw.com/")
這行代碼主要可能會發生兩種異常:
第一種異常會返回HTTP錯誤,如:"404 Page Not Found" "500 Internal Server Error",全部相似狀況, urlopen函數都會拋出「HTTPError」異常,遇到這種異常,咱們能夠這樣處理:
try: html = urlopen("http://jxdxsw.com/") except HTTPError as e: print(e) # 返回空值,中斷程序,或者執行另外一個方案 else: # 程序繼續。注意,若是你已經在上面異常捕獲那段代碼裏返回或中斷(break) #那就不須要使用else語句,這段代碼也不會執行
第二種服務器不存在(就是說連接http://jxdxsw.com/打不開,或者url寫錯),urlopen 會返回一個None對象,這個對象與其餘編程語言中的null相似
# 添加一個判斷語句檢測返回的html是否是None if html is None: print("URL is not found) else: #程序繼續
咱們建立一個網絡爬蟲來抓取http://www.pythonscraping.com...。
這個網頁中,小說人物對話內容都是紅色,人物名稱都是綠色
用Requests + Beautifulsoup 爬取 Tripadvisor
from bs4 import BeautifulSoup import requests url = 'https://www.tripadvisor.cn/Attractions-g294220-Activities-Nanjing_Jiangsu.html' urls = ['https://www.tripadvisor.cn/Attractions-g294220-Activities-oa{}-Nanjing_Jiangsu.html#FILTERED_LIST'.format(str(i)) for i in range(30,800,30)] def get_attraction(url, data=None): wb_data = requests.get(url) soup = BeautifulSoup(wb_data.text, 'html.parser') # print(soup) # 使用BeautifulSoup對html解析時,當使用css選擇器,對於子元素選擇時,要將nth-child改寫爲nth-of-type才行 #titles = soup.select('#taplc_attraction_coverpage_attraction_0 > div:nth-of-type(1) > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a') titles = soup.select('a.poiTitle') # imgs = soup.select('img.photo_image') imgs = soup.select('img[width="200"]') # 把信息轉入字典 for title, img in zip(titles,imgs): data = { 'title': title.get_text(), 'img': img.get('src'), } print(data) for single_url in urls: get_attraction(single_url)
{'title': '夫子廟景區', 'img': 'https://cc.ddcdn.com/img2/x.gif'} {'title': '南京夫子廟休閒街', 'img': 'https://cc.ddcdn.com/img2/x.gif'} {'title': '南京1912街區', 'img': 'https://cc.ddcdn.com/img2/x.gif'} {'title': '棲霞寺', 'img': 'https://cc.ddcdn.com/img2/x.gif'} {'title': '夫子廟大成殿', 'img': 'https://cc.ddcdn.com/img2/x.gif'} {'title': '南京毗盧寺', 'img': 'https://cc.ddcdn.com/img2/x.gif'}
細心的朋友會發現,這個圖片地址都是一個url,這是由於圖片地址不在頁面的dom結構裏面,都是後來js注入的。這也是一種反爬取的手段,咱們能夠這樣解決:
爬取移動端的(前提是反爬不嚴密)
from bs4 import BeautifulSoup import requests headers = { 'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1' } url = 'https://www.tripadvisor.cn/Attractions-g294220-Activities-Nanjing_Jiangsu.html' mb_data = requests.get(url,headers=headers) soup = BeautifulSoup(mb_data.text,'html.parser') imgs = soup.select('div.thumb.thumbLLR.soThumb > img') for img in imgs: print(img.get('src'))
from bs4 import BeautifulSoup import requests import time url = 'https://knewone.com/discover?page=' def get_page(url, data=None): wb_data = requests.get(url) soup = BeautifulSoup(wb_data.text, 'html.parser') imgs = soup.select('a.cover-inner > img') titles = soup.select('section.content > h4 > a') links = soup.select('section.content > h4 > a') for img, title, link in zip(imgs, titles, links): data = { 'img': img.get('src'), 'title': title.get('title'), 'link': link.get('href') } print(data) def get_more_pages(start, end): for one in range(start, end): get_page(url+ str(one)) time.sleep(2) get_more_pages(1,10)
http://scrapy-chs.readthedocs...
pip install Scrapy scrapy startproject tutorial
學習實例
https://github.com/scrapy/quo...
import re line = 'jwxddxsw33' if line == "jxdxsw33": print("yep") else: print("no") # ^ 限定以什麼開頭 regex_str = "^j.*" if re.match(regex_str, line): print("yes") #$限定以什麼結尾 regex_str1 = "^j.*3$" if re.match(regex_str, line): print("yes") regex_str1 = "^j.3$" if re.match(regex_str, line): print("yes") # 貪婪匹配 regex_str2 = ".*(d.*w).*" match_obj = re.match(regex_str2, line) if match_obj: print(match_obj.group(1)) # 非貪婪匹配 # ?處表示遇到第一個d 就匹配 regex_str3 = ".*?(d.*w).*" match_obj = re.match(regex_str3, line) if match_obj: print(match_obj.group(1)) # * 表示>=0次 + 表示 >=0次 # ? 表示非貪婪模式 # + 的做用至少>出現一次 因此.+任意字符這個字符至少出現一次 line1 = 'jxxxxxxdxsssssswwwwjjjww123' regex_str3 = ".*(w.+w).*" match_obj = re.match(regex_str3, line1) if match_obj: print(match_obj.group(1)) # {2}限定前面的字符出現次數 {2,}2次以上 {2,5}最小兩次最多5次 line2 = 'jxxxxxxdxsssssswwaawwjjjww123' regex_str3 = ".*(w.{3}w).*" match_obj = re.match(regex_str3, line2) if match_obj: print(match_obj.group(1)) line2 = 'jxxxxxxdxsssssswwaawwjjjww123' regex_str3 = ".*(w.{2}w).*" match_obj = re.match(regex_str3, line2) if match_obj: print(match_obj.group(1)) line2 = 'jxxxxxxdxsssssswbwaawwjjjww123' regex_str3 = ".*(w.{5,}w).*" match_obj = re.match(regex_str3, line2) if match_obj: print(match_obj.group(1)) # | 或 line3 = 'jx123' regex_str4 = "((jx|jxjx)123)" match_obj = re.match(regex_str4, line3) if match_obj: print(match_obj.group(1)) print(match_obj.group(2)) # [] 表示中括號內任意一個 line4 = 'ixdxsw123' regex_str4 = "([hijk]xdxsw123)" match_obj = re.match(regex_str4, line4) if match_obj: print(match_obj.group(1)) # [0,9]{9} 0到9任意一個 出現9次(9位數) line5 = '15955224326' regex_str5 = "(1[234567][0-9]{9})" match_obj = re.match(regex_str5, line5) if match_obj: print(match_obj.group(1)) # [^1]{9} line6 = '15955224326' regex_str6 = "(1[234567][^1]{9})" match_obj = re.match(regex_str6, line6) if match_obj: print(match_obj.group(1)) # [.*]{9} 中括號中的.和*就表明.*自己 line7 = '1.*59224326' regex_str7 = "(1[.*][^1]{9})" match_obj = re.match(regex_str7, line7) if match_obj: print(match_obj.group(1)) #\s 空格 line8 = '你 好' regex_str8 = "(你\s好)" match_obj = re.match(regex_str8, line8) if match_obj: print(match_obj.group(1)) # \S 只要不是空格均可以(非空格) line9 = '你真好' regex_str9 = "(你\S好)" match_obj = re.match(regex_str9, line9) if match_obj: print(match_obj.group(1)) # \w 任意字符 和.不一樣的是 它表示[A-Za-z0-9_] line9 = '你adsfs好' regex_str9 = "(你\w\w\w\w\w好)" match_obj = re.match(regex_str9, line9) if match_obj: print(match_obj.group(1)) line10 = '你adsf_好' regex_str10 = "(你\w\w\w\w\w好)" match_obj = re.match(regex_str10, line10) if match_obj: print(match_obj.group(1)) #\W大寫的是非[A-Za-z0-9_] line11 = '你 好' regex_str11 = "(你\W好)" match_obj = re.match(regex_str11, line11) if match_obj: print(match_obj.group(1)) # unicode編碼 [\u4E00-\u\9FA5] 表示漢字 line12= "鏡心的小樹屋" regex_str12= "([\u4E00-\u9FA5]+)" match_obj = re.match(regex_str12,line12) if match_obj: print(match_obj.group(1)) print("-----貪婪匹配狀況----") line13 = 'reading in 鏡心的小樹屋' regex_str13 = ".*([\u4E00-\u9FA5]+樹屋)" match_obj = re.match(regex_str13, line13) if match_obj: print(match_obj.group(1)) print("----取消貪婪匹配狀況----") line13 = 'reading in 鏡心的小樹屋' regex_str13 = ".*?([\u4E00-\u9FA5]+樹屋)" match_obj = re.match(regex_str13, line13) if match_obj: print(match_obj.group(1)) #\d數字 line14 = 'XXX出生於2011年' regex_str14 = ".*(\d{4})年" match_obj = re.match(regex_str14, line14) if match_obj: print(match_obj.group(1)) regex_str15 = ".*?(\d+)年" match_obj = re.match(regex_str15, line14) if match_obj: print(match_obj.group(1))
多種出生日期寫法匹配
email 地址匹配
#!/usr/bin/env python3 # -*- coding: utf-8 -*- ### # 試寫一個驗證Email地址的正則表達式。版本一應該能夠驗證出相似的Email: #someone@gmail.com #bill.gates@microsoft.com ### import re addr = 'someone@gmail.com' addr2 = 'bill.gates@microsoft.com' def is_valid_email(addr): if re.match(r'[a-zA-Z_\.]*@[a-aA-Z.]*',addr): return True else: return False print(is_valid_email(addr)) print(is_valid_email(addr2)) # 版本二能夠提取出帶名字的Email地址: # <Tom Paris> tom@voyager.org => Tom Paris # bob@example.com => bob addr3 = '<Tom Paris> tom@voyager.org' addr4 = 'bob@example.com' def name_of_email(addr): r=re.compile(r'^(<?)([\w\s]*)(>?)([\w\s]*)@([\w.]*)$') if not r.match(addr): return None else: m = r.match(addr) return m.group(2) print(name_of_email(addr3)) print(name_of_email(addr4))
深度優先(遞歸實現)
:順着一條路,走到最深處。而後回頭廣度優先(隊列實現)
:分層遍歷:遍歷完兒子輩。而後遍歷孫子輩--> 關於這些基礎算法 請戳鯨魚以前的文章
數據結構與算法:二叉樹算法
數據結構與算法:圖和圖算法(一)
scrapy startproject duitang
自動生成一個文件夾
. ├── duitang //該項目的python模塊。以後您將在此加入代碼。 │ ├── __init__.py │ ├── items.py//項目中的item文件. │ ├── middlewares.py │ ├── pipelines.py //項目中的pipelines文件. │ ├── __pycache__ │ ├── settings.py//項目的設置文件. │ └── spiders //放置spider代碼的目錄. │ ├── __init__.py │ └── __pycache__ └── scrapy.cfg //項目的配置文件
而後是建立spider,也就是實現具體抓取邏輯的文件,scrapy提供了一個便捷的命令行工具,cd到生成的項目文件夾下執行
Scrapy爬蟲框架教程(二)-- 爬取豆瓣電影TOP250
豆瓣美女
利用Scrapy爬取全部知乎用戶詳細信息並存至MongoDB(附視頻和源碼)
Scrapy分佈式爬蟲打造搜索引擎- (二)伯樂在線爬取全部文章
這裏能夠用
scrapy shell url
來調試
extract()
: 序列化該節點爲unicode字符串並返回list。
注意這裏的 contains
用法
因此spiders下能夠這麼寫
# //ArticleSpider/ArticleSpider/spiders/jobbole.py # -*- coding: utf-8 -*- import re import scrapy class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/110287/'] def parse(self, response): #提取文章的具體字段((xpath方式實現)) title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first("") create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace(".","").strip() praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] match_re = re.match(".*(\d+).*", fav_nums) if match_re: fav_nums = match_re.group(1) comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] match_re = re.match(".*(\d+).*", comment_nums) if match_re: comment_nums = match_re.group(1) content = response.xpath("//div[@class='entry']").extract()[0] tag_list= response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() # 去掉以評論結尾的字段 tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ",".join(tag_list) print(tags)#職場,面試 # print(create_date) pass
跑下爬蟲 debug下
scrapy crawl jobbole
# -*- coding: utf-8 -*- import re import scrapy from scrapy.http import Request from urllib import parse class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): """ 1. 獲取文章列表頁的具體url,並交給scrapy下載 而後給解析函數進行具體字段的解析 2. 獲取下一頁的url並交給scarpy進行下載, 下載完成後交給parse函數 """ #解析列表頁中的全部url 並交給scrapy下載後進行解析 post_nodes = response.css("#archive .floated-thumb .post-thumb a") for post_node in post_nodes: # 獲取封面圖url # response.url + post_node # image_url = post_node.css("img::attr(src)").extract_first("") post_url = post_node.css("::attr(href)").extract_first("") url = parse.urljoin(response.url, post_url) request = Request(url, callback= self.parse_detail) yield request #提取下一頁並交給scrapy進行下載 next_url = response.css(".next.page-numbers::attr(href)").extract_first() if next_url: yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse) def parse_detail(self, response): print("--------") #提取文章的具體字段((xpath方式實現)) title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first("") create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace(".","").strip() praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] match_re = re.match(".*(\d+).*", fav_nums) if match_re: fav_nums = match_re.group(1) comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] match_re = re.match(".*(\d+).*", comment_nums) if match_re: comment_nums = match_re.group(1) content = response.xpath("//div[@class='entry']").extract()[0] tag_list= response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() # 去掉以評論結尾的字段 tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ",".join(tag_list) print(tags)#職場,面試 # print(create_date) pass
在代碼最後打斷點,debug下,咱們發現抓取 的值都被提取出來了
items至關於把提取的數據序列化
#//ArticleSpider/ArticleSpider/items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class JobBoleArticlespiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() created_date = scrapy.Field() url = scrapy.Field() url_object_id = scrapy.Field() front_image_url = scrapy.Field() front_image_path = scrapy.Field() praise_nums = scrapy.Field() comment_nums = scrapy.Field() fav_nums = scrapy.Field() tags = scrapy.Field() content = scrapy.Field()
實例化item並填充值
# -*- coding: utf-8 -*- import re import scrapy from scrapy.http import Request from urllib import parse from ArticleSpider.items import JobBoleArticleItem # from ArticleSpider.utils.common import get_md5 class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): """ 1. 獲取文章列表頁的具體url,並交給scrapy下載 而後給解析函數進行具體字段的解析 2. 獲取下一頁的url並交給scarpy進行下載, 下載完成後交給parse函數 """ #解析列表頁中的全部url 並交給scrapy下載後進行解析 post_nodes = response.css("#archive .floated-thumb .post-thumb a") for post_node in post_nodes: # 獲取封面圖url image_url = post_node.css("img::attr(src)").extract_first("") post_url = post_node.css("::attr(href)").extract_first("") url = parse.urljoin(response.url, post_url) # post_url 是咱們每一頁的具體的文章url。 # 下面這個request是文章詳情頁面. 使用回調函數每下載完一篇就callback進行這一篇的具體解析。 # 咱們如今獲取到的是完整的地址能夠直接進行調用。若是不是完整地址: 根據response.url + post_url # def urljoin(base, url)完成url的拼接 request = Request(url,meta={"front_image_url": image_url}, callback= self.parse_detail) yield request #提取下一頁並交給scrapy進行下載 next_url = response.css(".next.page-numbers::attr(href)").extract_first() if next_url: yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse) def parse_detail(self, response): # 實例化item article_item = JobBoleArticleItem() print("經過item loader 加載item") # 經過item loader 加載item front_image_url = response.meta.get("front_image_url","") #文章封面圖 #提取文章的具體字段((xpath方式實現)) title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first("") create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace(".","").strip() praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] match_re = re.match(".*(\d+).*", fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] match_re = re.match(".*(\d+).*", comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 content = response.xpath("//div[@class='entry']").extract()[0] tag_list= response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() # 去掉以評論結尾的字段 tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ",".join(tag_list) # 爲實例化後的對象填充值 # article_item["url_object_id"] = get_md5(response.url) article_item["title"] = title article_item["url"] = response.url article_item["create_date"] = create_date article_item["front_image_url"] = [front_image_url] article_item["praise_nums"] = praise_nums article_item["comment_nums"] = comment_nums article_item["fav_nums"] = fav_nums article_item["tags"] = tags article_item["content"] = content #print(tags)#職場,面試 ## 已經填充好了值調用yield傳輸至pipeline yield article_item
items.py中至關於對數據序列化,而數據傳遞到pipeline須要在settings.py設置,pipeline中主要作數據存儲的
# Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'ArticleSpider.pipelines.ArticlespiderPipeline': 300, }
咱們在pipelines.py文件中打兩個斷點debug下,會發現 item中value值就是咱們以前提取要儲存的
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import codecs import json from scrapy.pipelines.images import ImagesPipeline from scrapy.exporters import JsonItemExporter class ArticlespiderPipeline(object): def process_item(self, item, spider): return item class JsonWithEncodingPipeline(object): #自定義json文件的到出 def __init__(self): self.file = codecs.open('article.json', 'w', encoding="utf-8") def process_item(self, item, spider): lines = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(lines) return item def spider_closed(self, spider): self.file.close() class JsonExporterPipeline(object): #調用scrapy提供的JsonItemExporter 到出json文件 def __init__(self): self.file = open('articleecport.json', 'wb') self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False) self.exporter.start_exporting() def close_spider(self, spider): self.exporter.finish_exporting() self.file.close() def process_item(self, item, spider): self.exporter.export_item(item=item) return item #圖片處理pipline class ArticleImagePipeline(ImagesPipeline): def item_completed(self, results, item, info): for ok, value in results: image_file_path_ = value["path"] item["front_image_path"] = image_file_path_ return item
# ubuntu下必須有這條,不然會報下面的錯誤 sudo apt-get install libmysqlclient-dev # centos 下必須有這條,不然會報下面的錯誤 sudo yum install python-devel mysql-devel pip3 install -i https://pypi.douban.com/simple/ mysqlclient
安裝還遇到這種問題:
解決方法: 一條命令解決mysql_config not foundpipeline.py
import pymysql class MysqlPipeline(object): def __init__(self): # 獲取一個數據庫鏈接,注意若是是UTF-8類型的,須要制定數據庫 self.conn = pymysql.connect('127.0.0.1', 'root', 'wyc2016','article_spider', charset='utf8',use_unicode=True) self.cursor = self.conn.cursor()#獲取一個遊標 def process_item(self, item, spider): insert_sql = """INSERT INTO jobboleArticle(title, url, create_date, fav_nums) VALUES(%s, %s, %s, %s )""" try: self.cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"])) self.conn.commit() except Exception as e: self.conn.rollback() finally: self.conn.close()
咱們發現只存入了3條,由於上面的代碼是同步方式,我爬蟲的解析速度是比入數據庫速度快的,這形成了堵塞
咱們用異步寫下:
class MysqlTwistedPipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): dbparams = dict( host = settings["MYSQL_HOST"], database = settings["MYSQL_DBNAME"], user = settings["MYSQL_USER"], password = settings["MYSQL_PASSWORD"], charset = 'utf8', cursorclass = pymysql.cursors.DictCursor, use_unicode = True ) dbpool = adbapi.ConnectionPool("pymysql", **dbparams) return cls(dbpool) def process_item(self, item, spider): #使用twisted將mysql插入變成異步執行 query = self.dbpool.runInteraction(self.do_insert, item) query.addErrorback(self.handle_error, item, spider)# 處理異常 def handle_error(self, failure, item, spider): #處理異步插入異常 print(failure) def do_insert(self, cursor,item): #執行具體的插入query insert_sql = """INSERT INTO jobboleArticle(title, url, create_date, fav_nums) VALUES(%s, %s, %s, %s )""" cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"]))
以微信公衆號爲例,它是封閉的,微信公衆平臺並無對外提供 Web 端入口,只能經過手機客戶端接收、查看公衆號文章,因此,爲了窺探到公衆號背後的網絡請求,咱們須要藉以代理工具的輔助
主流的抓包工具備:
首先要確保你的手機和電腦在同一個局域網,若是再也不同一個局域網,你能夠買個隨身WiFi,在你電腦上搭建一個極簡無線路由器
https://www.jianshu.com/p/be7...
爬蟲之路
Python網絡數據採集
基於 Python 實現微信公衆號爬蟲
用Scrapy shell調試xpath
Scrapy分佈式爬蟲打造搜索引擎 注:這個筆記是記錄的慕課網相關課程
python爬蟲從入門到放棄(三)之 Urllib庫的基本使用
ArticleSpider/ArticleSpider/spiders/jobbole.py
scrapy/quotesbot 官方簡單示例
詳解python3使用PyMysql鏈接mysql數據庫步驟
python3.6 使用 pymysql 鏈接 Mysql 數據庫及 簡單的增刪改查操做
總結:經常使用的 Python 爬蟲技巧
python 讀取本地txt,存入到mysql
https://www.cnblogs.com/shaos...