由一系列定義了一個網址或一組網址類如何被爬取的類組成php
具體包括如何執行爬取任務而且如何從頁面中提取結構化的數據。css
簡單來講就是幫助你爬取數據的地方html
生成初始的 Requests 來爬取第一個URLS,而且標識一個回調函數, python
第一個請求定義在start_requests()方法內默認從start_urls列表中得到url地址來生成Request請求正則表達式
默認的回調函數是parse方法, 回調函數在下載完成返回response時自動觸發數據庫
在回調函數中解析頁面內容django
一般使用Scrapy自帶的Selectorsjson
也可使用Beutifulsoup,lxml或其餘windows
在回調函數中,解析response而且返回值api
返回值能夠4種:
針對返回的Items對象將會被持久化到數據庫或者其餘文件
#一、scrapy.spiders.Spider
#scrapy.Spider等同於scrapy.spiders.Spider #二、scrapy.spiders.CrawlSpider #三、scrapy.spiders.XMLFeedSpider #四、scrapy.spiders.CSVFeedSpider #五、scrapy.spiders.SitemapSpider
這是最簡單的spider類,任何其餘的spider類都須要繼承它(包含你本身定義的)。
該類不提供任何特殊的功能,它僅提供了一個默認的start_requests方法默認從start_urls中讀取url地址發送requests請求,而且默認parse做爲回調函數
對 scrapy.spiders.Spider 更進一步封裝了的類
scrapy genspider -t crawl lagou www.lagou.com
默認不指名模板的時候使用第一個 basic 模板
若是使用 basic 模板則使用的是 scrapy.spiders.Spider 類來使用爬蟲
crawl ------> scrapy.spiders.CrawlSpider
scrapy genspider --list
import scrapy class AmazonSpider(scrapy.Spider): name = 'amazon' # 爬蟲名, 必須惟一 allowed_domains = ['www.amazon.cn'] # 容許爬取的域名 start_urls = ['http://www.amazon.cn/'] # 起始爬取地址
def parse(self,response): # 默認的回調函數, 用於對響應內容進行解析 pass
#一、name = 'amazon' 定義爬蟲名,scrapy會根據該值定位爬蟲程序 因此它必需要有且必須惟一(In Python 2 this must be ASCII only.) #二、allowed_domains = ['www.amazon.cn'] 定義容許爬取的域名,若是OffsiteMiddleware啓動(默認就啓動), 那麼不屬於該列表的域名及其子域名都不容許爬取 若是爬取的網址爲:https://www.example.com/1.html,那就添加'example.com'到列表. #三、start_urls = ['http://www.amazon.cn/'] 若是沒有指定url,就從該列表中讀取url來生成第一個請求 #四、custom_settings 值爲一個字典,定義一些配置信息,在運行爬蟲程序時,這些配置會覆蓋項目級別的配置 因此custom_settings必須被定義成一個類屬性,因爲settings會在類實例化前被加載 #五、settings 經過self.settings['配置項的名字']能夠訪問settings.py中的配置,若是本身定義了custom_settings仍是以本身的爲準 #六、logger 日誌名默認爲spider的名字 self.logger.debug('=============>%s' %self.settings['BOT_NAME']) #五、crawler:瞭解 該屬性必須被定義到類方法from_crawler中 #六、from_crawler(crawler, *args, **kwargs):瞭解 You probably won’t need to override this directly because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs. #七、start_requests() 該方法用來發起第一個Requests請求,且必須返回一個可迭代的對象。它在爬蟲程序打開時就被Scrapy調用,Scrapy只調用它一次。 默認從start_urls裏取出每一個url來生成Request(url, dont_filter=True) #針對參數dont_filter,請看自定義去重規則 若是你想要改變起始爬取的Requests,你就須要覆蓋這個方法,例如你想要起始發送一個POST請求,以下 class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass #八、parse(response) 這是默認的回調函數,全部的回調函數必須返回an iterable of Request and/or dicts or Item objects. #九、log(message[, level, component]):瞭解 Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders. #十、closed(reason) 爬蟲程序結束時自動觸發
# -*- coding: utf-8 -*- import scrapy from Maoyan.items import MaoyanItem class MaoyanSpider(scrapy.Spider): # 爬蟲名 name = 'maoyan' # 容許爬取的域名 allowed_domains = ['maoyan.com'] offset = 0 # 起始的URL地址 start_urls = ['https://maoyan.com/board/4?offset=0'] def parse(self, response): # 基準xpath,匹配每一個電影信息節點對象列表 dd_list = response.xpath('//dl[@class="board-wrapper"]/dd') # dd_list : [<element dd at xxx>,<...>] for dd in dd_list: # 建立item對象 item = MaoyanItem()
# [<selector xpath='' data='霸王別姬'>] # dd.xpath('')結果爲[選擇器1,選擇器2] # .extract() 把[選擇器1,選擇器2]全部選擇器序列化爲 unicode 字符串 # .extract_first() : 取第一個字符串
item['name'] = dd.xpath('./a/@title').extract_first().strip() item['star'] = dd.xpath('.//p[@class="star"]/text()').extract()[0].strip() item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').extract()[0] yield item # 此方法不推薦,效率低 self.offset += 10 if self.offset <= 90: url = 'https://maoyan.com/' \ 'board/4?offset={}'.format(str(self.offset)) yield scrapy.Request( url=url, callback=self.parse )
# -*- coding: utf-8 -*- import scrapy from Maoyan.items import MaoyanItem class MaoyanSpider(scrapy.Spider): # 爬蟲名 name = 'maoyan2' # 容許爬取的域名 allowed_domains = ['maoyan.com'] # 起始的URL地址 start_urls = ['https://maoyan.com/board/4?offset=0'] def parse(self, response): for offset in range(0, 91, 10): url = 'https://maoyan.com' \ '/board/4?offset={}'.format(str(offset)) # 把地址交給調度器入隊列 yield scrapy.Request( url=url, callback=self.parse_html ) def parse_html(self, response): # 基準xpath,匹配每一個電影信息節點對象列表 dd_list = response.xpath( '//dl[@class="board-wrapper"]/dd') # dd_list : [<element dd at xxx>,<...>] for dd in dd_list: # 建立item對象 item = MaoyanItem() item['name'] = dd.xpath('./a/@title').extract_first().strip() item['star'] = dd.xpath('.//p[@class="star"]/text()').extract()[0].strip() item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').extract()[0] yield item
import scrapy class AmazonSpider(scrapy.Spider): def __init__(self,keyword=None,*args,**kwargs): #在entrypoint文件裏面傳進來的keyword,在這裏接收了 super(AmazonSpider,self).__init__(*args,**kwargs) self.keyword = keyword name = 'amazon' # 必須惟一 allowed_domains = ['www.amazon.cn'] # 容許域 start_urls = ['http://www.amazon.cn/'] # 若是你沒有指定發送的請求地址,會默認使用第一個 custom_settings = { # 自定製配置文件,本身設置了用本身的,沒有就找父類的 "BOT_NAME": 'HAIYAN_AMAZON', 'REQUSET_HEADERS': {}, } def start_requests(self): url = 'https://www.amazon.cn/s/ref=nb_sb_noss_1/461-4093573-7508641?' url+=urlencode({"field-keywords":self.keyword}) print(url) yield scrapy.Request( url, callback = self.parse_index, #指定回調函數 dont_filter = True, #不去重,這個也能夠本身定製 # dont_filter = False, #去重,這個也能夠本身定製 # meta={'a':1} #meta代理的時候會用 ) #若是要想測試自定義的dont_filter,可多返回結果重複的便可
url 指定請求地址
callback 指定回調函數
method 指定請求方式, 默認爲 GET , 在使用 POST 表單提交的時候推薦使用
meta 在 request 和 response 中添加傳遞信息時使用 scrapy.FormRequest
headers 指定請求頭, 字典形式可傳遞多個鍵值
body 指定請求體
cookies 指定 cookies , 能夠本身設置, 列表或者字典形式
可是 scrapy 會自動處理好( 默認自帶一箇中間件處理 ), 因此不須要操心這個
priority 優先級, 越高越優先, 默認爲 0
dont_filter 默認爲 False , 表示不能被過濾, 設置爲 True 時, 表示會被過濾
errback 返回 500 或者 404 的時候的回調函數, 發送錯誤的回調
copy() 返回一個當前請求的複製
replace() 返回一個當前請求某些屬性的替換後的複製
FormRequest 用於處理 表單POST 方式提交數據請求
XmlRpcRequest 沒用過不清楚是幹嗎的
url 當前響應網頁的 URL
status 當前響應的 狀態碼, 默認是 200
headers 服務器返回的響應頭
body 當前響應網頁的所有內容
request 當前響應以前的請求, 經過此屬性能夠拿到此響應的發送請求
copy() 當前響應的複製
replace() 當前響應替換某些屬性後的複製
urljoin() 進行 URL 的拼接
text 當前響應的內容
css() / xpath() CSS / Xpath 標籤選擇器方法
follow
TextResponse 該子類中進行大量的落實操做,對 Response 類進行了大量方法重寫
1 class TextResponse(Response): 2 3 _DEFAULT_ENCODING = 'ascii' 4 5 def __init__(self, *args, **kwargs): 6 self._encoding = kwargs.pop('encoding', None) 7 self._cached_benc = None 8 self._cached_ubody = None 9 self._cached_selector = None 10 super(TextResponse, self).__init__(*args, **kwargs) 11 12 def _set_url(self, url): 13 if isinstance(url, six.text_type): 14 if six.PY2 and self.encoding is None: 15 raise TypeError("Cannot convert unicode url - %s " 16 "has no encoding" % type(self).__name__) 17 self._url = to_native_str(url, self.encoding) 18 else: 19 super(TextResponse, self)._set_url(url) 20 21 def _set_body(self, body): 22 self._body = b'' # used by encoding detection 23 if isinstance(body, six.text_type): 24 if self._encoding is None: 25 raise TypeError('Cannot convert unicode body - %s has no encoding' % 26 type(self).__name__) 27 self._body = body.encode(self._encoding) 28 else: 29 super(TextResponse, self)._set_body(body) 30 31 def replace(self, *args, **kwargs): 32 kwargs.setdefault('encoding', self.encoding) 33 return Response.replace(self, *args, **kwargs) 34 35 @property 36 def encoding(self): 37 return self._declared_encoding() or self._body_inferred_encoding() 38 39 def _declared_encoding(self): 40 return self._encoding or self._headers_encoding() \ 41 or self._body_declared_encoding() 42 43 def body_as_unicode(self): 44 """Return body as unicode""" 45 return self.text 46 47 @property 48 def text(self): 49 """ Body as unicode """ 50 # access self.encoding before _cached_ubody to make sure 51 # _body_inferred_encoding is called 52 benc = self.encoding 53 if self._cached_ubody is None: 54 charset = 'charset=%s' % benc 55 self._cached_ubody = html_to_unicode(charset, self.body)[1] 56 return self._cached_ubody 57 58 def urljoin(self, url): 59 """Join this Response's url with a possible relative url to form an 60 absolute interpretation of the latter.""" 61 return urljoin(get_base_url(self), url) 62 63 @memoizemethod_noargs 64 def _headers_encoding(self): 65 content_type = self.headers.get(b'Content-Type', b'') 66 return http_content_type_encoding(to_native_str(content_type)) 67 68 def _body_inferred_encoding(self): 69 if self._cached_benc is None: 70 content_type = to_native_str(self.headers.get(b'Content-Type', b'')) 71 benc, ubody = html_to_unicode(content_type, self.body, 72 auto_detect_fun=self._auto_detect_fun, 73 default_encoding=self._DEFAULT_ENCODING) 74 self._cached_benc = benc 75 self._cached_ubody = ubody 76 return self._cached_benc 77 78 def _auto_detect_fun(self, text): 79 for enc in (self._DEFAULT_ENCODING, 'utf-8', 'cp1252'): 80 try: 81 text.decode(enc) 82 except UnicodeError: 83 continue 84 return resolve_encoding(enc) 85 86 @memoizemethod_noargs 87 def _body_declared_encoding(self): 88 return html_body_declared_encoding(self.body) 89 90 @property 91 def selector(self): 92 from scrapy.selector import Selector 93 if self._cached_selector is None: 94 self._cached_selector = Selector(self) 95 return self._cached_selector 96 97 def xpath(self, query, **kwargs): 98 return self.selector.xpath(query, **kwargs) 99 100 def css(self, query): 101 return self.selector.css(query) 102 103 def follow(self, url, callback=None, method='GET', headers=None, body=None, 104 cookies=None, meta=None, encoding=None, priority=0, 105 dont_filter=False, errback=None): 106 # type: (...) -> Request 107 """ 108 Return a :class:`~.Request` instance to follow a link ``url``. 109 It accepts the same arguments as ``Request.__init__`` method, 110 but ``url`` can be not only an absolute URL, but also 111 112 * a relative URL; 113 * a scrapy.link.Link object (e.g. a link extractor result); 114 * an attribute Selector (not SelectorList) - e.g. 115 ``response.css('a::attr(href)')[0]`` or 116 ``response.xpath('//img/@src')[0]``. 117 * a Selector for ``<a>`` or ``<link>`` element, e.g. 118 ``response.css('a.my_link')[0]``. 119 120 See :ref:`response-follow-example` for usage examples. 121 """ 122 if isinstance(url, parsel.Selector): 123 url = _url_from_selector(url) 124 elif isinstance(url, parsel.SelectorList): 125 raise ValueError("SelectorList is not supported") 126 encoding = self.encoding if encoding is None else encoding 127 return super(TextResponse, self).follow(url, callback, 128 method=method, 129 headers=headers, 130 body=body, 131 cookies=cookies, 132 meta=meta, 133 encoding=encoding, 134 priority=priority, 135 dont_filter=dont_filter, 136 errback=errback 137 )
HtmlResponse 啥都沒作就繼承了 TextResponse
在 scrapy 中是沒法使用 session = requests.session() 這種方式來處理的
可是 scrapy的 yield scrapy.Request() 在一塊兒請求後會自動處理好 session 相關的處理
而且延續到以後的全部操做, 可是若是中間又須要更新則須要從新發起一次來更新
實際使用場景 - 知乎的反爬策略會在屢次爬取時返回 403 錯誤, 須要驗證碼登陸等
此時須要重來一次發送請求來刷新 cookie
def login(self, response): response_text = response.text match_obj = re.match('.*name="_xsrf" value="(.*?)"', response_text, re.DOTALL) xsrf = '' if match_obj: xsrf = (match_obj.group(1)) if xsrf: post_url = "https://www.zhihu.com/login/phone_num" post_data = { "_xsrf": xsrf, "phone_num": "", "password": "", "captcha": "" } import time t = str(int(time.time() * 1000)) # 驗證碼請求地址 captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t) # 將數據傳遞到回調函數中在處理,且將cookie 也傳遞下去 yield scrapy.Request(captcha_url, headers=self.headers, meta={"post_data": post_data}, callback=self.login_after_captcha) def login_after_captcha(self, response): # 驗證碼保存本地 with open("captcha.jpg", "wb") as f: f.write(response.body) # 文件是保存在 body 中的, 不是text 中了 f.close() from PIL import Image try: # 打開看一眼驗證碼而後手動驗證 im = Image.open('captcha.jpg') im.show() im.close() except: pass # 手動輸入驗證碼... captcha = input("輸入驗證碼\n>") # 上一輪傳遞過來的數據 post_data = response.meta.get("post_data", {}) post_data["captcha"] = captcha post_url = "https://www.zhihu.com/login/phone_num" # 經過 FormRequest 發起請求 return [scrapy.FormRequest( url=post_url, formdata=post_data, headers=self.headers, callback=self.check_login )] def check_login(self, response): # 驗證服務器的返回數據判斷是否成功 text_json = json.loads(response.text) if "msg" in text_json and text_json["msg"] == "登陸成功": for url in self.start_urls: yield scrapy.Request(url, dont_filter=True, headers=self.headers)
去重規則應該多個爬蟲共享的,但凡一個爬蟲爬取了,其餘都不要爬了,實現方式以下
#方法一: 1、新增類屬性 visited=set() #類屬性 2、回調函數parse方法內: def parse(self, response): if response.url in self.visited: return None ....... self.visited.add(response.url) #方法一改進:針對url可能過長,因此咱們存放url的hash值 def parse(self, response): url=md5(response.request.url) if url in self.visited: return None ....... self.visited.add(url) #方法二:Scrapy自帶去重功能 配置文件: DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默認的去重規則幫咱們去重,去重規則在內存中 DUPEFILTER_DEBUG = False JOBDIR = "保存範文記錄的日誌路徑,如:/root/" # 最終路徑爲 /root/requests.seen,去重規則放文件中 scrapy自帶去重規則默認爲RFPDupeFilter,只須要咱們指定 Request(...,dont_filter=False) ,若是dont_filter=True則告訴Scrapy這個URL不參與去重。 #方法三: 咱們也能夠仿照RFPDupeFilter自定義去重規則, from scrapy.dupefilter import RFPDupeFilter,看源碼,仿照BaseDupeFilter #步驟一:在項目目錄下自定義去重文件cumstomdupefilter.py ''' if hasattr("MyDupeFilter",from_settings): func = getattr("MyDupeFilter",from_settings) obj = func() else: return MyDupeFilter() ''' class MyDupeFilter(object): def __init__(self): self.visited = set() @classmethod def from_settings(cls, settings): '''讀取配置文件''' return cls() def request_seen(self, request): '''請求看過沒有,這個纔是去重規則該調用的方法''' if request.url in self.visited: return True self.visited.add(request.url) def open(self): # can return deferred '''打開的時候執行''' pass def close(self, reason): # can return a deferred pass def log(self, request, spider): # log that a request has been filtered '''日誌記錄''' pass #步驟二:配置文件settings.py # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默認會去找這個類實現去重 #自定義去重規則 DUPEFILTER_CLASS = 'AMAZON.cumstomdupefilter.MyDupeFilter' # 源碼分析: from scrapy.core.scheduler import Scheduler 見Scheduler下的enqueue_request方法:self.df.request_seen(request)
內部原理
scrapy引擎來爬蟲中取起始URL:
1. 調用start_requests並獲取返回值
2. v = iter(返回值)
3.
req1 = 執行 v.__next__()
req2 = 執行 v.__next__()
req3 = 執行 v.__next__()
...
4. req所有放到調度器中
自定義實例
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): # 方式一: for url in self.start_urls: yield Request( url=url, callback=self.parse, method='POST' # 本身寫的優先級更高。所以在這裏能夠指定 post 請求也能夠做爲起始了 ) # 方式二: # req_list = [] # for url in self.start_urls: # req_list.append(Request(url=url)) # return req_list
方式一
利用 random 進行隨機的在代理列表裏面選擇, 而後賦值給屬性, 雖然簡單
可是每次在發送請求的時候都要進行這個操做, 代碼重複不少不合適
方式二
利用下載器中間件完成
os.envrion設置代理
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): import os os.environ['HTTPS_PROXY'] = "http://root:yangtuo@192.168.11.11:9999/" os.environ['HTTP_PROXY'] = '19.11.2.32', for url in self.start_urls: yield Request(url=url,callback=self.parse)
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:yangtuo@192.168.11.11:9999/"'})
import base64 import random from six.moves.urllib.parse import unquote try: from urllib2 import _parse_proxy except ImportError: from urllib.request import _parse_proxy from six.moves.urllib.parse import urlunparse from scrapy.utils.python import to_bytes class XdbProxyMiddleware(object): def _basic_auth_header(self, username, password): user_pass = to_bytes( '%s:%s' % (unquote(username), unquote(password)), encoding='latin-1') return base64.b64encode(user_pass).strip() def process_request(self, request, spider): PROXIES = [ "http://root:yangtuo@192.168.11.11:9999/", "http://root:yangtuo@192.168.11.12:9999/", "http://root:yangtuo@192.168.11.13:9999/", "http://root:yangtuo@192.168.11.14:9999/", "http://root:yangtuo@192.168.11.15:9999/", ] url = random.choice(PROXIES) orig_type = "" proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', '')) if user: creds = self._basic_auth_header(user, password) else: creds = None request.meta['proxy'] = proxy_url if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds
適用場景
某些 URL 是基於當前額頁面進行的相對地址 URL
直接經過 a[@href] 或者 a::attr() 是隻能拿到殘缺的 URL
須要一種手段實現講當前頁面的域名進行拼接
解決方法
scrapy中 response 提取的沒有主域名的url拼接
# 1.導入urllib的parse
# 2.調用parse.urljoin()進行拼接
例子中response.url會自動提取出當前頁面url的主域名
get_url是從response中的元素中提取的沒有主域名的url
from urllib import parse url = parse.urljoin(response.url, get_url)
設置 meta 能夠將上一次請求的某些參數傳遞到響應中供下一次的請求來使用
def parse1(self, response): item = ... yield = scrapy.Request( url=url, meta={'item': item} callback = self.parse2 ) def parse2(self, response): item = response.meta['item'] ...
#在items模塊中有下面三個參數: import scrapy class TextItem(spider.Item): title = scrapy.Field() price = scrapy.Field() image = scrapy.Field() #在spider爬蟲中: class TaobaoSpider(scrapy.Spider): name = ['taobao'] allowed_domains = ['www.taobao.com'] def parse1(self,response): ''' 須要知道的是item是一個字典 ''' item = TextItem() for product in response.css('......').extract(): item['title'] = product.css('......').extract_first() item['price'] = product.css('......').extract_first() url = product.css('......').extract_first() yield = scrapy.Request(url=url, meta={'item':item} callback=self.parse2) ''' 好比咱們要爬取淘寶上的商品,咱們在第一層爬取時候得到了標題(title)和價格(price), 可是還想得到商品的圖片,就是那些點進去的大圖片,假設點進去的連接是上述代碼的url, 利用scrpy.Request請求url後生成一個Request對象,經過meta參數,把item這個字典賦值給meta字典的'item'鍵, 即meta={'item':item},這個meta參數會被放在Request對象裏一塊兒發送給parse2()函數。 ''' def parse2(self,response): item = response.meta['item'] for product in response.css('......').extract(): item[imgae] = product.scc('......').extract_first() return item ''' 這個response已含有上述meta字典,此句將這個字典賦值給item,完成信息傳遞。 這個item已經和parse中的item同樣了 以後咱們就能夠作圖片url提取的工做了, 數據提取完成後return item ,這樣就完成了數據抓取的任務了。 '''
# 咱們可能須要在命令行爲爬蟲程序傳遞參數,好比傳遞初始的url,像這樣 # 命令行執行 scrapy crawl myspider -a category=electronics #在__init__方法中能夠接收外部傳進來的參數 import scrapy class MySpider(scrapy.Spider): name = 'myspider' def __init__(self, category=None, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = ['http://www.example.com/categories/%s' % category] #... #注意接收的參數全都是字符串,若是想要結構化的數據,你須要用相似json.loads的方法
爬蟲的暫停重啓是須要文件支持
在啓動的命令裏選擇一個路徑
不一樣的爬蟲不能共用,
相同的爬蟲若是公用同一個就會給予這個文件的上一次狀態繼續爬取
該命令的中斷命令是基於 windows Ctrl+c / 殺進程 或者 Linux 裏面的 kill -f -9 main.py
所以在 pycharm 中的中斷是作不到的, 只能在命令行中處理
scrapy crawl lagou -s JOBDIR=job_info/001
指定 文件路徑能夠在 settings.py 中設置
這樣就是全局設置了
JOBDIR="job_info/001"
或者在單爬蟲類中設置
cutom_settings = { "JOBDIR": "job_info/001" }
可是仍是和上面的說法同樣.....pycharm 裏面沒辦法中斷, 所以仍是沒有啥意義,
仍是隻能使用命令行方式
內置了不少的簡單的分析
""" Scrapy extension for collecting scraping stats """ import pprint import logging logger = logging.getLogger(__name__) class StatsCollector(object): def __init__(self, crawler): self._dump = crawler.settings.getbool('STATS_DUMP') self._stats = {} def get_value(self, key, default=None, spider=None): return self._stats.get(key, default) def get_stats(self, spider=None): return self._stats def set_value(self, key, value, spider=None): self._stats[key] = value def set_stats(self, stats, spider=None): self._stats = stats def inc_value(self, key, count=1, start=0, spider=None): d = self._stats d[key] = d.setdefault(key, start) + count def max_value(self, key, value, spider=None): self._stats[key] = max(self._stats.setdefault(key, value), value) def min_value(self, key, value, spider=None): self._stats[key] = min(self._stats.setdefault(key, value), value) def clear_stats(self, spider=None): self._stats.clear() def open_spider(self, spider): pass def close_spider(self, spider, reason): if self._dump: logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats), extra={'spider': spider}) self._persist_stats(self._stats, spider) def _persist_stats(self, stats, spider): pass class MemoryStatsCollector(StatsCollector): def __init__(self, crawler): super(MemoryStatsCollector, self).__init__(crawler) self.spider_stats = {} def _persist_stats(self, stats, spider): self.spider_stats[spider.name] = stats class DummyStatsCollector(StatsCollector): def get_value(self, key, default=None, spider=None): return default def set_value(self, key, value, spider=None): pass def set_stats(self, stats, spider=None): pass def inc_value(self, key, count=1, start=0, spider=None): pass def max_value(self, key, value, spider=None): pass def min_value(self, key, value, spider=None): pass
設置信號量追蹤, 而後設置一個計數器遞增, 將 404 的頁面url 放入容器中
from scrapy.xlib.pydispatch import dispatcher from scrapy import signals
handle_httpstatus_list = [404] def __init__(self, **kwargs): self.fail_urls = []
def parse(self, response): if response.status == 404: self.fail_urls.append(response.url) self.crawler.stats.inc_value("failed_url") ...
官方文檔 - 這裏
做爲一個更好封裝的爬蟲類, 裏面提供更簡單的使用方式
指定模板 crawl 時使用此類建立爬蟲
scrapy genspider -t crawl lagou www.lagou.com
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').get() item['description'] = response.xpath('//td[@id="item_description"]/text()').get() return item
CrawlSpider 的使用幾乎相似於 Spider , 也擁有 name , allowed_domains , start_urls 屬性做用同 Spider
可是內部提供了一個變量 rules
▨ rules
內部使用了 LinkExtractor 方法
▧ LinkExtractor
此方法內部含有兩個參數,
▧ allow 指明匹配規則
callback 指明回調函數, 傳值爲字符串形式, 由於是類方法, 無 self 能夠調用
class CrawlSpider(Spider): rules = () def __init__(self, *a, **kw): super(CrawlSpider, self).__init__(*a, **kw) self._compile_rules() def parse(self, response): return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True) def parse_start_url(self, response): return [] def process_results(self, response, results): return results def _build_request(self, rule, link): r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=rule, link_text=link.text) return r def _requests_to_follow(self, response): if not isinstance(response, HtmlResponse): return seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r) def _response_downloaded(self, response): rule = self._rules[response.meta['rule']] return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow) def _parse_response(self, response, callback, cb_kwargs, follow=True): if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, six.string_types): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs) spider._follow_links = crawler.settings.getbool( 'CRAWLSPIDER_FOLLOW_LINKS', True) return spider def set_crawler(self, crawler): super(CrawlSpider, self).set_crawler(crawler) self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)
CrawlSpider 是繼承自 Spider , 可是沒有重寫 start_requests 方法, 只能基於 Spider 來進行的爬蟲入口
可是內部重寫了 parse 方法, 在 Spider 的時候是基於直接在爬蟲文件中進行重寫 此函數, 可是在 CrawlSpider 中是不能這樣作的
在 parse 方法中執行了此函數,
函數顯示從這裏開始
既然不能本身重寫 parse 做爲回調了, 那 CrawlSpider 也提供了一個接口用來讓用戶重寫做爲回調
此函數爲 parse_start_url , 此函數很煎蛋.
同時在此函數中還有對 process_results 的調用, 並且此函數方法也是很煎蛋
若是這兩個函數爲空, 則 callback 是無心義的, 這兩個函數是用做用戶重寫的鉤子函數
follow 參數 有默認值爲 True , 接着往下執行到了須要一個 _follow_links 屬性,
此屬性在 set_crawler 函數中, 可見此屬性是一個配置文件中的配置屬性 CRAWLSPIDER_FOLLOW_LINKS
默認爲 True , 若是設置爲 False 則就不會往下執行此函數結束, 默認是繼續往下執行 _requests_to_follow
此函數在會進行一個去重,而後將 _rules 改爲一個可迭代對象, 那這個 _rules 從何而來?
問問 _compile_rules 吧, 此函數還能夠用到一個 process_links ,
追蹤此方法, 可見是在 Rule 類中的一個初始化屬性, 可見此方法也是個鉤子能夠自定義一系列操做
這個 Rule 就是咱們在爬蟲文件中自行定製 rules 用的那個實例類
而後對 process_links 處理完後進行集合的添加, 而後進行 _build_request 調用進行請求封裝
查看 _build_request 的源碼中, 進行了 callback , 此處的回調並非調用用戶自定義的
而是 _response_downloaded
此方法的返回值就是 _parse_response 的運行結果
_parse_response 的執行的最後, 會 yield 進行處理髮送請 求開始爬取數據
因而可知就連起來了.
此函數來完成一個 _rules 的封裝, 代碼可見到是一個淺拷貝
那麼此函數什麼時候被使用呢?
在 CrawlSpider 類實例化的時候自動執行,執行的時候就會遍歷全部的規則而後進行
這三個函數的調用, 至此, 整個流程完成
源碼仍是比較短的, 稍微畫個圖
瞭解了以上的源碼流程, 重點就是要看到底怎麼用
這裏主要圍繞 Rule 以及 LinkExtractor 這兩個類展開
這兩個類也是在爬蟲文件中直接要使用的須要被實例化的類
class Rule(object): def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity): self.link_extractor = link_extractor self.callback = callback self.cb_kwargs = cb_kwargs or {} self.process_links = process_links self.process_request = process_request if follow is None: self.follow = False if callback else True else: self.follow = follow
link_extractor LinkExtractor 建立的實例, 下面那個就是
callback 回調的函數
cb_kwargs 傳遞給 link_extracto 的參數
follow 知足的是否進行跟蹤
process_links 本身定製的相關預處理操做接口
process_request 對 request 進行處理的函數接口 - 默認是個很簡單的直接傳x返回x的函數,能夠用來重寫或者覆蓋
class LxmlLinkExtractor(FilteringLinkExtractor): def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(), tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, deny_extensions=None, restrict_css=(), strip=True): ...
...
...
def extract_links(self, response): base_url = get_base_url(response) if self.restrict_xpaths: docs = [subdoc for x in self.restrict_xpaths for subdoc in response.xpath(x)] else: docs = [response.selector] all_links = [] for doc in docs: links = self._extract_links(doc, response.url, response.encoding, base_url) all_links.extend(self._process_links(links)) return unique_list(all_links)
allow 全部符合的URL, 傳入值爲正則表達式, 且支持元組形式
deny 不符合的URL
allow_domains 全部域內的URL
deny_domains 不在域內的URL
restrict_xpaths 進一步的限定URL - 基於xpath實現特定位置的標籤內尋找 URL
restrict_css 進一步的限定URL - 基於 css 實現特定位置的標籤內尋找 URL
tags 在哪裏找URL ( 默認是 a, area 標籤 ) - 不須要手動修改
attrs 基於tags,找到標籤後具體找哪一個屬性 ( 默認是 href ) - 不須要手動修改
其餘參數無視便可
rules = ( Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True), Rule(LinkExtractor(allow=("gongsi/j\d+.html",)), follow=True), Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True), )
基礎的使用方式下, 每一個爬蟲的 item 字段都須要配置一次 xpath 或者 css 進行選擇
所以若是頁面抓取次數多了每一個回調函數都要進行一次的重複, 很是不便
def parse_detail(self, response): article_item = JobBoleArticleItem() # 提取文章的具體字段 title = response.xpath('//div[@class="entry-header"]/h1/text()').extract_first("") create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace("·","").strip() praise_nums = response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0] fav_nums = response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0] match_re = re.match(".*?(\d+).*", fav_nums) if match_re: fav_nums = match_re.group(1) comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] match_re = re.match(".*?(\d+).*", comment_nums) if match_re: comment_nums = match_re.group(1) content = response.xpath("//div[@class='entry']").extract()[0] tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ",".join(tag_list) # 經過css選擇器提取字段 front_image_url = response.meta.get("front_image_url", "") #文章封面圖 title = response.css(".entry-header h1::text").extract()[0] create_date = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip() praise_nums = response.css(".vote-post-up h10::text").extract()[0] fav_nums = response.css(".bookmark-btn::text").extract()[0] match_re = re.match(".*?(\d+).*", fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0] match_re = re.match(".*?(\d+).*", comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 content = response.css("div.entry").extract()[0] tag_list = response.css("p.entry-meta-hide-on-mobile a::text").extract() tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ",".join(tag_list) article_item["url_object_id"] = get_md5(response.url) article_item["title"] = title article_item["url"] = response.url try: create_date = datetime.datetime.strptime(create_date, "%Y/%m/%d").date() except Exception as e: create_date = datetime.datetime.now().date() article_item["create_date"] = create_date article_item["front_image_url"] = [front_image_url] article_item["praise_nums"] = praise_nums article_item["comment_nums"] = comment_nums article_item["fav_nums"] = fav_nums article_item["tags"] = tags article_item["content"] = content yield article_item
def parse_detail(self, response): article_item = JobBoleArticleItem() #經過item loader加載item front_image_url = response.meta.get("front_image_url", "") # 文章封面圖 item_loader = ArticleItemLoader(item=JobBoleArticleItem(), response=response) item_loader.add_css("title", ".entry-header h1::text") item_loader.add_value("url", response.url) item_loader.add_value("url_object_id", get_md5(response.url)) item_loader.add_css("create_date", "p.entry-meta-hide-on-mobile::text") item_loader.add_value("front_image_url", [front_image_url]) item_loader.add_css("praise_nums", ".vote-post-up h10::text") item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text") item_loader.add_css("fav_nums", ".bookmark-btn::text") item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text") item_loader.add_css("content", "div.entry") article_item = item_loader.load_item() yield article_item
from scrapy.loader import ItemLoader
item_loader = ArticleItemLoader(item=JobBoleArticleItem(), response=response)
item 爬蟲的 item 的實例化,
response 爬蟲的 回傳信息 reponse
item_loader.add_css("title", ".entry-header h1::text")
item_loader.add_xpath("title", "//div[@class="entry-header"]/h1/text()")
第一個參數 item 的屬性鍵
第二個參數 css 選擇規則
返回值 列表
第一個參數 item 的屬性鍵
第二個參數 response 的屬性值
返回值 列表
item_loader.add_value("url", response.url)
創建映射後代碼變的及其整潔且精簡
可是返回數據都是列表形式, 且某系字段的特殊處理也沒法在這裏完成
所以特殊的操做須要在 item.py 定製的時候進行操做
且自定義的相關方法, 在不一樣的 item 中也能夠被複用
class XXXItem(scrapy.Item): xxx = scrapy.Field( input_processor = MapCompose(func1,func2), output_processor = TakeFirst() )
此字段用來規範計算 item 鍵值的定製產生
此字段用來規範 item 鍵值的結果輸出
首先是須要導入這個兩個類
from scrapy.loader.processors import MapCompose, TakeFirst, Join
配合 input_processor 字段一塊兒使用, 對item 的鍵值進行自定函數加工
參數可傳遞多個函數, 會遍歷執行這兩個函數的結果做爲此 item 鍵的值
多個函數的執行是線性的, 後面函數的結果會疊加而不是覆蓋
配合 output_processor 字段使用能夠起到對結果值的覆蓋.
無參數, 配合 output_processor 字段一塊兒使用用於只輸出第一個對象
解決返回數據都是列表的問題, 若是肯定鍵的值是惟一的, 能夠加此屬性
可是若是字段過多都須要每一個添加比較麻煩, 可使用 自定製 ltemloader
默認的 itemloader 的 out_processor 是 Identily
自定義繼承而後進行特換爲 TakeFirst 便可實現全部的字段都是用列表首項
從而取消單數據被迫使用列表問題
可是這樣致使全部的字段都產出單數據, 所以對原產出多數據的須要用 Join 處理
from scrapy.loader import ItemLoader
class ArticleItemLoader(ItemLoader): # 自定義itemloader default_output_processor = TakeFirst()
在被使用 全局列表首項產出時, 對多數據進行原有形式產出.
保留全部數據並能夠指定鏈接符號產出字符串
tags = scrapy.Field( input_processor=MapCompose(remove_comment_tags), output_processor = Join(",") )
item_loader.add_css("job_addr", ".work_addr")
-----------------------------------------
from w3lib.html import remove_tags ----------------------------------------- job_addr = scrapy.Field( input_processor=MapCompose(remove_tags), )
def date_convert(value): try: create_date = datetime.datetime.strptime(value, "%Y/%m/%d").date() except Exception as e: create_date = datetime.datetime.now().date() return create_date ---------------------------------------------------------- create_date = scrapy.Field( input_processor=MapCompose(date_convert), )
def get_nums(value): match_re = re.match(".*?(\d+).*", value) if match_re: nums = int(match_re.group(1)) else: nums = 0 return nums ------------------------------------------------- praise_nums = scrapy.Field( input_processor=MapCompose(get_nums) ) comment_nums = scrapy.Field( input_processor=MapCompose(get_nums) ) fav_nums = scrapy.Field( input_processor=MapCompose(get_nums) )
#例一: import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) #例二:一個回調函數返回多個Requests和Items import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) #例三:在start_requests()內直接指定起始爬取的urls,start_urls就沒有用了, import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) #例四: # -*- coding: utf-8 -*- from urllib.parse import urlencode # from scrapy.dupefilter import RFPDupeFilter # from AMAZON.items import AmazonItem from AMAZON.items import AmazonItem ''' spiders會循環作下面幾件事 一、生成初始請求來爬取第一個urls,而且綁定一個回調函數 二、在回調函數中,解析response而且返回值 三、在回調函數中,解析頁面內容(可經過Scrapy自帶的Seletors或者BeautifuSoup等) 四、最後、針對返回的Items對象(就是你從返回結果中篩選出來本身想要的數據)將會被持久化到數據庫 Spiders總共提供了五種類: #一、scrapy.spiders.Spider #scrapy.Spider等同於scrapy.spiders.Spider #二、scrapy.spiders.CrawlSpider #三、scrapy.spiders.XMLFeedSpider #四、scrapy.spiders.CSVFeedSpider #五、scrapy.spiders.SitemapSpider ''' import scrapy class AmazonSpider(scrapy.Spider): def __init__(self,keyword=None,*args,**kwargs): #在entrypoint文件裏面傳進來的keyword,在這裏接收了 super(AmazonSpider,self).__init__(*args,**kwargs) self.keyword = keyword name = 'amazon' # 必須惟一 allowed_domains = ['www.amazon.cn'] # 容許域 start_urls = ['http://www.amazon.cn/'] # 若是你沒有指定發送的請求地址,會默認使用只一個 custom_settings = { # 自定製配置文件,本身設置了用本身的,沒有就找父類的 "BOT_NAME": 'HAIYAN_AMAZON', 'REQUSET_HEADERS': {}, } def start_requests(self): url = 'https://www.amazon.cn/s/ref=nb_sb_noss_1/461-4093573-7508641?' url+=urlencode({"field-keywords":self.keyword}) print(url) yield scrapy.Request( url, callback = self.parse_index, #指定回調函數 dont_filter = True, #不去重,這個也能夠本身定製 # dont_filter = False, #去重,這個也能夠本身定製 # meta={'a':1} #meta代理的時候會用 ) #若是要想測試自定義的dont_filter,可多返回結果重複的便可 def parse_index(self, response): '''獲取詳情頁和下一頁的連接''' detail_urls = response.xpath('//*[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract() print(detail_urls) # print("%s 解析 %s",(response.url,(len(response.body)))) for detail_url in detail_urls: yield scrapy.Request( url=detail_url, callback=self.parse_detail #記得每次返回response的時候記得綁定一個回調函數 ) next_url = response.urljoin(response.xpath(response.xpath('//*[@id="pagnNextLink"]/@href').extract_first())) # 由於下一頁的url是不完整的,用urljoin就能夠吧路徑前綴拿到而且拼接 # print(next_url) yield scrapy.Request( url=next_url, callback=self.parse_index #由於下一頁也屬因而索引頁,讓去解析索引頁 ) def parse_detail(self,response): '''詳情頁解析''' name = response.xpath('//*[@id="productTitle"]/text()').extract_first().strip()#獲取name price = response.xpath('//*[@id="price"]//*[@class="a-size-medium a-color-price"]/text()').extract_first()#獲取價格 delivery_method=''.join(response.xpath('//*[@id="ddmMerchantMessage"]//text()').extract()) #獲取配送方式 print(name) print(price) print(delivery_method) #上面是篩選出本身想要的項 #必須返回一個Item對象,那麼這個item對象,是從item.py中來,和django中的model相似, # 可是這裏的item對象也可當作是一個字典,和字典的操做同樣 item = AmazonItem()# 實例化 item["name"] = name item["price"] = price item["delivery_method"] = delivery_method return item def close(spider, reason): print("結束啦")
scray 自帶的用於在回調函數中解析頁面內容的組件
#1 // 與 /
#2 text
#3 extract與extract_first:從selector對象中解出內容
#4 屬性:xpath的屬性加前綴@
#4 嵌套查找
#5 設置默認值
#4 按照屬性查找
#5 按照屬性模糊查找
#6 正則表達式
#7 xpath相對路徑
#8 帶變量的xpath
response.selector.css() response.selector.xpath() 可簡寫爲 response.css() response.xpath() #1 //與/
# // 子子孫孫 / 兒子 .// 當前往下找子子孫孫
response.xpath('//body/a/')# response.css('div a::text') >>> response.xpath('//body/a') #開頭的//表明從整篇文檔中尋找,body以後的/表明body的兒子 [] >>> response.xpath('//body//a') #開頭的//表明從整篇文檔中尋找,body以後的//表明body的子子孫孫 [<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href=" image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>] #2 text >>> response.xpath('//body//a/text()') >>> response.css('body a::text') #三、extract與extract_first:從selector對象中解出內容 >>> response.xpath('//div/a/text()').extract() ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 '] >>> response.css('div a::text').extract() ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 '] >>> response.xpath('//div/a/text()').extract_first() 'Name: My image 1 ' >>> response.css('div a::text').extract_first() 'Name: My image 1 ' #四、屬性:xpath的屬性加前綴@ >>> response.xpath('//div/a/@href').extract_first() 'image1.html' >>> response.css('div a::attr(href)').extract_first() 'image1.html' #四、嵌套查找 >>> response.xpath('//div').css('a').xpath('@href').extract_first() 'image1.html' #五、設置默認值 >>> response.xpath('//div[@id="xxx"]').extract_first(default="not found") 'not found' #四、按照屬性查找 response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract() response.css('#images a[@href="image3.html"]/text()').extract() #五、按照屬性模糊查找 response.xpath('//a[contains(@href,"image")]/@href').extract() response.css('a[href*="image"]::attr(href)').extract() response.xpath('//a[contains(@href,"image")]/img/@src').extract() response.css('a[href*="imag"] img::attr(src)').extract() response.xpath('//*[@href="image1.html"]') response.css('*[href="image1.html"]') #六、正則表達式 response.xpath('//a/text()').re(r'Name: (.*)') response.xpath('//a/text()').re_first(r'Name: (.*)') #七、xpath相對路徑 >>> res=response.xpath('//a[contains(@href,"3")]')[0] >>> res.xpath('img') [<Selector xpath='img' data='<img src="image3_thumb.jpg">'>] >>> res.xpath('./img') [<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>] >>> res.xpath('.//img') [<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>] >>> res.xpath('//img') #這就是從頭開始掃描 [<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>] #八、帶變量的xpath >>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first() 'Name: My image 1 ' >>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5個a標籤的div的id 'images'