這幾天,又用到了scrapy框架寫爬蟲,感受忘得差很少了,雖然保存了書籤,但有些東西,仍是多寫寫纔好啊css
首先,官方而經典的的開發手冊那是須要的:html
https://doc.scrapy.org/en/latest/intro/tutorial.htmlhtml5
命令行cd到合適的目錄:node
scrapy startproject tutorial
就新建了一個tutorial的項目,項目的結構以下:python
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
這裏還會提示需不須要幫你建立一個新的爬蟲文件,隨意了mysql
初步使用官方教程 我就不詳細說了git
以我如今理解是,部署項目發佈相關的東西,好比在使用scrapyd發佈時須要用到,其他時候不用動它github
這裏是用來計劃你須要哪些數據的,好比爬蟲須要保存4個值:web
import scrapy class Product(scrapy.Item): name = scrapy.Field() price = scrapy.Field() stock = scrapy.Field() last_updated = scrapy.Field(serializer=str)
爬蟲中間件是使用很是多的,好比須要爲每一個請求設置隨機User-Agent
代碼以下,須要在settings.py中準備好ua數據,或者其餘方式讀取進來也行
USER_AGENTS = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a", "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)", "Mozilla/4.8 [en] (Windows NT 5.1; U)", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", ]
# 隨機User-Agent class RandomUserAgentMiddleware(object): def __init__(self, agents): """接收從from_crawler傳的List-agents""" self.agents = agents @classmethod def from_crawler(cls, crawler): """從settings讀取USER_AGENTS,傳參到構造函數""" return cls(crawler.settings.getlist('USER_AGENTS')) def process_request(self, request, spider): """修改每個request的header""" request.headers.setdefault('User-Agent', random.choice(self.agents))
最後必定要在settings.py中配置,纔會生效
DOWNLOADER_MIDDLEWARES = { 'tutorial.middlewares.RandomUserAgentMiddleware': 1 }
設置動態代理,一樣在這,只須要在meta中指定便可,scrapy會幫你調度,這裏以設置固定代理爲例
# 代理服務器 class ProxyMiddleware(object): def process_request(self, request, spider): """設置代理""" request.meta['proxy'] = 'http://115.226.140.24:44978'
同上,須要在settings.py中配置,纔會生效
DOWNLOADER_MIDDLEWARES = { 'tutorial.middlewares.RandomUserAgentMiddleware': 1, 'tutorial.middlewares.ProxyMiddleware': 556 }
設置cookie,一樣也能夠在這配置,和設置請求頭實際上是同樣的
# Cookies更新 class CookiesMiddleware(object): def get_cookies(self): # 1.從redis中獲取 # 2.http接口 pass def process_request(self, request, spider): request.cookies = self.get_cookies()
一樣,須要在settings.py中配置
這裏是寫關於item類的處理方式的
好比,須要將item存放至mysql中,就能夠在這裏寫,這裏我寫用文件追加形式保存數據
class WenshuPipeline(object): def process_item(self, item, spider): date = item['date'] # 一個日期一個文件 with open(date + '.txt', 'a', encoding='utf-8') as f: f.write(item['json_data'] + "\n")
一樣,須要在settings.py中配置
ITEM_PIPELINES = { 'tutorial.pipelines.WenshuPipeline': 300, }
這裏配置了許多,很是重要的配置,記幾個經常使用的,之後補充
名字:
通常自動生成,略
遵照爬蟲協議:
# Obey robots.txt rules ROBOTSTXT_OBEY = False # 默認true
失敗重試:
# 請求失敗重試 RETRY_ENABLED = True # 默認true RETRY_TIMES = 3 # 默認2 # RETRY_HTTP_CODECS #遇到什麼http code時須要重試,默認是500,502,503,504,408,其餘的,網絡鏈接超時等問題也會自動retry的
下載超時時間:
DOWNLOAD_TIMEOUT = 15 # 默認180秒 # 可是減少下載超時可能會引起錯誤: # TimeoutError: User timeout caused connection failure: Getting http://xxx.com. took longer than 15.0 seconds.
是否啓用cookie:
# Disable cookies (enabled by default) COOKIES_ENABLED = True
是否啓用httpcahe:
# Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
異步請求數(控制速度十分重要的設置):
# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 16 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
中間件,前面提到過
日誌:
# 設置日誌 # 日誌文件 LOG_FILE = BOT_NAME + '_' + time.strftime("%Y-%m-%d", time.localtime()) + '.log' # 日誌等級 LOG_LEVEL = 'INFO' # 是否啓用日誌(建立日誌後,不需開啓,進行配置) LOG_ENABLED = True # (默認爲True,啓用日誌) # 日誌編碼 LOG_ENCODING = 'utf-8' # 若是是True ,進程當中,全部標準輸出(包括錯誤)將會被重定向到log中;例如:在爬蟲代碼中的 print() LOG_STDOUT = False # 默認爲False
這裏寫爬蟲,爬蟲的起始url,怎麼解析,和怎麼處理,都是在這裏完成
scrapy框架提供的請求類,源碼在scrapy.http.request.__init__文件中,這個類通常用來提交get請求,在spider中經過yield交給引擎
class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None): self._encoding = encoding # this one has to be set first self.method = str(method).upper() self._set_url(url) self._set_body(body) assert isinstance(priority, int), "Request priority not an integer: %r" % priority self.priority = priority if callback is not None and not callable(callback): raise TypeError('callback must be a callable, got %s' % type(callback).__name__) if errback is not None and not callable(errback): raise TypeError('errback must be a callable, got %s' % type(errback).__name__) assert callback or not errback, "Cannot use errback without a callback" self.callback = callback self.errback = errback self.cookies = cookies or {} self.headers = Headers(headers or {}, encoding=encoding) self.dont_filter = dont_filter self._meta = dict(meta) if meta else None self.flags = [] if flags is None else list(flags) @property def meta(self): if self._meta is None: self._meta = {} return self._meta def _get_url(self): return self._url def _set_url(self, url): if not isinstance(url, six.string_types): raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__) s = safe_url_string(url, self.encoding) self._url = escape_ajax(s) if ':' not in self._url: raise ValueError('Missing scheme in request url: %s' % self._url) url = property(_get_url, obsolete_setter(_set_url, 'url')) def _get_body(self): return self._body def _set_body(self, body): if body is None: self._body = b'' else: self._body = to_bytes(body, self.encoding) body = property(_get_body, obsolete_setter(_set_body, 'body')) @property def encoding(self): return self._encoding def __str__(self): return "<%s %s>" % (self.method, self.url) __repr__ = __str__ def copy(self): """Return a copy of this Request""" return self.replace() def replace(self, *args, **kwargs): """Create a new Request with the same attributes except for those given new values. """ for x in ['url', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'callback', 'errback']: kwargs.setdefault(x, getattr(self, x)) cls = kwargs.pop('cls', self.__class__) return cls(*args, **kwargs)
能夠看到,它有咱們瞭解的通常request所須要的內容,咱們設置url,headers,cookies,callback等,均可以在構造函數中來設置
固然,還有兩個重要的方法,是比較鼓勵外部調用的,copy()和replace(),replace()將全部構造函數的屬性淺拷貝,但你能夠設置給定新值,
注意是淺拷貝,因此meta的拷貝只是引用改變。copy()就是淺拷貝request
固然還須要有post請求,也就是form表單提交,源文件在scrapy.http.request.form.FormRequest,這個類繼承自scrapy.Request,確實有不少能夠複用的
class FormRequest(Request): def __init__(self, *args, **kwargs): formdata = kwargs.pop('formdata', None) if formdata and kwargs.get('method') is None: kwargs['method'] = 'POST' super(FormRequest, self).__init__(*args, **kwargs) if formdata: items = formdata.items() if isinstance(formdata, dict) else formdata querystr = _urlencode(items, self.encoding) if self.method == 'POST': self.headers.setdefault(b'Content-Type', b'application/x-www-form-urlencoded') self._set_body(querystr) else: self._set_url(self.url + ('&' if '?' in self.url else '?') + querystr) @classmethod def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None, clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs): kwargs.setdefault('encoding', response.encoding) if formcss is not None: from parsel.csstranslator import HTMLTranslator formxpath = HTMLTranslator().css_to_xpath(formcss) form = _get_form(response, formname, formid, formnumber, formxpath) formdata = _get_inputs(form, formdata, dont_click, clickdata, response) url = _get_form_url(form, kwargs.pop('url', None)) method = kwargs.pop('method', form.method) return cls(url=url, method=method, formdata=formdata, **kwargs) def _get_form_url(form, url): if url is None: action = form.get('action') if action is None: return form.base_url return urljoin(form.base_url, strip_html5_whitespace(action)) return urljoin(form.base_url, url) def _urlencode(seq, enc): values = [(to_bytes(k, enc), to_bytes(v, enc)) for k, vs in seq for v in (vs if is_listlike(vs) else [vs])] return urlencode(values, doseq=1) def _get_form(response, formname, formid, formnumber, formxpath): """Find the form element """ root = create_root_node(response.text, lxml.html.HTMLParser, base_url=get_base_url(response)) forms = root.xpath('//form') if not forms: raise ValueError("No <form> element found in %s" % response) if formname is not None: f = root.xpath('//form[@name="%s"]' % formname) if f: return f[0] if formid is not None: f = root.xpath('//form[@id="%s"]' % formid) if f: return f[0] # Get form element from xpath, if not found, go up if formxpath is not None: nodes = root.xpath(formxpath) if nodes: el = nodes[0] while True: if el.tag == 'form': return el el = el.getparent() if el is None: break encoded = formxpath if six.PY3 else formxpath.encode('unicode_escape') raise ValueError('No <form> element found with %s' % encoded) # If we get here, it means that either formname was None # or invalid if formnumber is not None: try: form = forms[formnumber] except IndexError: raise IndexError("Form number %d not found in %s" % (formnumber, response)) else: return form def _get_inputs(form, formdata, dont_click, clickdata, response): try: formdata = dict(formdata or ()) except (ValueError, TypeError): raise ValueError('formdata should be a dict or iterable of tuples') inputs = form.xpath('descendant::textarea' '|descendant::select' '|descendant::input[not(@type) or @type[' ' not(re:test(., "^(?:submit|image|reset)$", "i"))' ' and (../@checked or' ' not(re:test(., "^(?:checkbox|radio)$", "i")))]]', namespaces={ "re": "http://exslt.org/regular-expressions"}) values = [(k, u'' if v is None else v) for k, v in (_value(e) for e in inputs) if k and k not in formdata] if not dont_click: clickable = _get_clickable(clickdata, form) if clickable and clickable[0] not in formdata and not clickable[0] is None: values.append(clickable) values.extend((k, v) for k, v in formdata.items() if v is not None) return values def _value(ele): n = ele.name v = ele.value if ele.tag == 'select': return _select_value(ele, n, v) return n, v def _select_value(ele, n, v): multiple = ele.multiple if v is None and not multiple: # Match browser behaviour on simple select tag without options selected # And for select tags wihout options o = ele.value_options return (n, o[0]) if o else (None, None) elif v is not None and multiple: # This is a workround to bug in lxml fixed 2.3.1 # fix https://github.com/lxml/lxml/commit/57f49eed82068a20da3db8f1b18ae00c1bab8b12#L1L1139 selected_options = ele.xpath('.//option[@selected]') v = [(o.get('value') or o.text or u'').strip() for o in selected_options] return n, v def _get_clickable(clickdata, form): """ Returns the clickable element specified in clickdata, if the latter is given. If not, it returns the first clickable element found """ clickables = [ el for el in form.xpath( 'descendant::*[(self::input or self::button)' ' and re:test(@type, "^submit$", "i")]' '|descendant::button[not(@type)]', namespaces={"re": "http://exslt.org/regular-expressions"}) ] if not clickables: return # If we don't have clickdata, we just use the first clickable element if clickdata is None: el = clickables[0] return (el.get('name'), el.get('value') or '') # If clickdata is given, we compare it to the clickable elements to find a # match. We first look to see if the number is specified in clickdata, # because that uniquely identifies the element nr = clickdata.get('nr', None) if nr is not None: try: el = list(form.inputs)[nr] except IndexError: pass else: return (el.get('name'), el.get('value') or '') # We didn't find it, so now we build an XPath expression out of the other # arguments, because they can be used as such xpath = u'.//*' + \ u''.join(u'[@%s="%s"]' % c for c in six.iteritems(clickdata)) el = form.xpath(xpath) if len(el) == 1: return (el[0].get('name'), el[0].get('value') or '') elif len(el) > 1: raise ValueError("Multiple elements found (%r) matching the criteria " "in clickdata: %r" % (el, clickdata)) else: raise ValueError('No clickable element matching clickdata: %r' % (clickdata,))
不過很顯然幾乎都是_私有方法,不建議調用,並且不少數據都是須要urlencode的
爲何要講這些呢,由於在我使用過程當中,須要使用中間件動態改變headers,cookies,formdata,讓我查了好久才知道怎麼改
由於好比cookie,也許spider中yield以後,通過很長時間才能抵達下載器,爲了維護成最新的cookie,能夠在中間件中進行修改
這是scrapy提供的響應類,源文件scrapy.http.response.Response
""" This module implements the Response class which is used to represent HTTP responses in Scrapy. See documentation in docs/topics/request-response.rst """ from six.moves.urllib.parse import urljoin from scrapy.http.request import Request from scrapy.http.headers import Headers from scrapy.link import Link from scrapy.utils.trackref import object_ref from scrapy.http.common import obsolete_setter from scrapy.exceptions import NotSupported class Response(object_ref): def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None): self.headers = Headers(headers or {}) self.status = int(status) self._set_body(body) self._set_url(url) self.request = request self.flags = [] if flags is None else list(flags) @property def meta(self): try: return self.request.meta except AttributeError: raise AttributeError( "Response.meta not available, this response " "is not tied to any request" ) def _get_url(self): return self._url def _set_url(self, url): if isinstance(url, str): self._url = url else: raise TypeError('%s url must be str, got %s:' % (type(self).__name__, type(url).__name__)) url = property(_get_url, obsolete_setter(_set_url, 'url')) def _get_body(self): return self._body def _set_body(self, body): if body is None: self._body = b'' elif not isinstance(body, bytes): raise TypeError( "Response body must be bytes. " "If you want to pass unicode body use TextResponse " "or HtmlResponse.") else: self._body = body body = property(_get_body, obsolete_setter(_set_body, 'body')) def __str__(self): return "<%d %s>" % (self.status, self.url) __repr__ = __str__ def copy(self): """Return a copy of this Response""" return self.replace() def replace(self, *args, **kwargs): """Create a new Response with the same attributes except for those given new values. """ for x in ['url', 'status', 'headers', 'body', 'request', 'flags']: kwargs.setdefault(x, getattr(self, x)) cls = kwargs.pop('cls', self.__class__) return cls(*args, **kwargs) def urljoin(self, url): """Join this Response's url with a possible relative url to form an absolute interpretation of the latter.""" return urljoin(self.url, url) @property def text(self): """For subclasses of TextResponse, this will return the body as text (unicode object in Python 2 and str in Python 3) """ raise AttributeError("Response content isn't text") def css(self, *a, **kw): """Shortcut method implemented only by responses whose content is text (subclasses of TextResponse). """ raise NotSupported("Response content isn't text") def xpath(self, *a, **kw): """Shortcut method implemented only by responses whose content is text (subclasses of TextResponse). """ raise NotSupported("Response content isn't text") def follow(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None): # type: (...) -> Request """ Return a :class:`~.Request` instance to follow a link ``url``. It accepts the same arguments as ``Request.__init__`` method, but ``url`` can be a relative URL or a ``scrapy.link.Link`` object, not only an absolute URL. :class:`~.TextResponse` provides a :meth:`~.TextResponse.follow` method which supports selectors in addition to absolute/relative URLs and Link objects. """ if isinstance(url, Link): url = url.url url = self.urljoin(url) return Request(url, callback, method=method, headers=headers, body=body, cookies=cookies, meta=meta, encoding=encoding, priority=priority, dont_filter=dont_filter, errback=errback)
顯然,這裏提供一些方法調用,從這裏能夠看出,response包含了request,而且共用一個meta字典,因此才能用於傳參
還有就是各類經常使用的選擇器使用,可是能夠看出,這並非咱們通常使用的Resonse,咱們使用的也是它的子類scrapy.http.response.text.TextResponse
子類才真正實現了那些選擇器和text
這個並非指一個類,泛指中間件,通常寫在Middlerware.py文件中,主要講一講自定義的功能,定義格式:
class tutorialDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
@classmethod
def from_crawler(cls, crawler):
這個方法主要用於給中間件構造函數傳參,返回的就是實例
def process_request(self, request, spider):
這個方法會處理從引擎發往下載器的request
def process_response(self, request, response, spider):
這個方法會處理從下載器發往引擎的response
def process_exception(self, request, exception, spider):
這個方法會處理其餘中間件拋出的異常和下載時候的異常
def spider_opened(self, spider):
這個沒用過,哈哈,不過看得出是能夠在爬蟲打開時被調用
好了,上面籠統介紹了用處,實際上是爲了說我寫中間件時候遇到的問題,由於scrapy有自帶的不少中間件,因此方法的編寫,優先級的設置,沒搞好出bug都不知道哪出的,還千奇百怪
默認的中間件:DOWNLOADER_MIDDLEWARES_BASE 幾乎都有重要的功能,瞭解這些,若是再深刻一點源碼,對理解http請求都頗有輔助的幫助
因此記錄一下最近用到的Middlerware和優先級,親測是可行的
def process_request(self, request, spider): """修改headers,優先級 1""" request.headers.setdefault('User-Agent', random.choice(self.agents)) def process_request(self, request, spider): """修改cookies,優先級 50""" request.cookies = self.get_cookies() def process_request(self, request, spider): """設置代理,優先級60""" request.meta['proxy'] = 'http://' + self.proxys[self.num % PROXY_POOL_NUM] def process_response(self, request, response, spider): """檢查response合法性, 優先級 560""" html = response.body.decode() if response.status != 200 or html is None: print('內容爲空') return request.replace(dont_filter=True) elif 'VisitRemind' in html: print('出現驗證碼') return request.replace(dont_filter=True) else: return response def process_exception(self, request, exception, spider): """處理異常,優先級 560""" print(exception) meta = request.meta if 'req_error_times' in meta.keys(): meta['req_error_times'] = meta['req_error_times'] + 1 else: meta['req_error_times'] = 1 print(meta['req_error_times']) if meta['req_error_times'] > MAX_ERROR_TIMES: raise IgnoreRequest return request.replace(dont_filter=True)
這是很是重要的部分,我剛開始學的時候就徹底摸不着頭腦怎麼去調試,有些錯誤因爲框架封裝的緣故,報得很奇怪,有時候不能定位具體的錯誤代碼在哪,提示錯誤在scrapy源碼中,總不可能去改源碼吧。。。下面講講怎麼調試
有許多命令行能夠調試的命令,可是總感受很簡陋
若是你在使用IDE,能夠在項目根目錄,也就是scrapy.sfg同級目錄下新建一個run.py文件
# !usr/bin/env python # -*- coding:utf-8 _*- """ @author:happy_code @email: happy_code@foxmail.com @file: run.py @time: 2019/01/24 @desc: """ from scrapy import cmdline if __name__ == '__main__': cmdline.execute('scrapy crawl tutorial'.split())
這樣,右鍵運行項目能夠很方便。打斷點後,右鍵debug也和普通項目同樣debug,很方便
運行,命令行cd到項目根目錄
scrapy crawl tutorial
最近在使用scrapyd和scrapy-client部署,還在學習,感受挺方便的,有時間更
如今還有中間件的配置中,優先級的控制上有些疑惑,
還有在於源碼,好比引擎,下載器,這方面有疑惑
還有關於異步也是不太明白原理
繼續奮鬥吧!