回顧爬蟲

時間 2019-11-17

標籤回顧爬蟲欄目網絡爬蟲简体版

原文原文鏈接

會用到的點javascript

模塊

　　　　1 hashlib模塊--加密。php

　　　　　　update(string.encode('utf-8')) m.hexdigest()css

　　　　2 requests模塊html

　　　　　　https://blog.csdn.net/shanzhizi/article/details/50903748java

　　　　　　r = requests.get(url, params = {}, headers={},cookies = cookies，proxies = proxies)node

　　　　　　cookies, proxies 都是字典格式python

　　　　　　搜索的關鍵字就以字典的形式，放在params參數中mysql

param = {'wd':'火影'}
r = requests.get('https://www.baidu.com/s', params=param)
print(r.status_code)
print(r.url)


百度沒有防爬蟲措施，搜狗有。這裏用百度簡單的演示下。params就是get訪問時每一個？後面的xx=xx

View Code

　　　　　　r = requests.post(url, data = {},headers={} )　　react

　　　　　　headers = {git

　　　　　　　　　　'content-type':

　　　　　　　　　　'User-Agent':

　　　　　　　　　 'Referer':

　　　　　　　　　　'Cookie':

　　　　　　　　　　'Host':

　　　　　　　　　　　}

　　　　　　r.encoding = '' 自定義編碼，對文本內容進行解碼。和 r.text 好基友

　　　　　　r.text 字符串方式的響應體

　　　　　　r.content 字節方式的響應體

　　　　　　r.status_code

　　　　　　r.request 能夠得到請求的相關信息

 1 import requests
 2 
 3 url = 'https://www.cnblogs.com/654321cc/p/11013243.html'
 4 headers = {
 5     'User-Agent':'User-Agent',
 6 }
 7 r = requests.get(url=url,headers=headers)
 8 
 9 #獲取請求頭
10 print(r.request.headers)
11 #獲取響應頭
12 print(r.headers)
13 
14 
15 #獲取請求的cookie
16 print(r.request._cookies)
17 
18 #獲取響應的cookie
19 print(r.cookies)

View Code

　　　　　　r.headers 以字典對象存儲服務器響應頭。不知道何時回用到

　　　　　　r.cookies

　　　　　　r.history

 1 import requests
 2 
 3 url = 'https://www.cnblogs.com/654321cc/p/11013243.html'
 4 headers = {
 5     'User-Agent':'User-Agent',
 6 }
 7 r = requests.get(url=url,headers=headers)
 8 
 9 #獲取請求頭
10 print(r.request.headers)
11 #獲取響應頭
12 print(r.headers)
13 
14 
15 #獲取請求的cookie
16 print(r.request._cookies)
17 
18 #獲取響應的cookie
19 print(r.cookies)

View Code

何時會用到？
有的時候302跳轉。

302 Found 的定義
302 狀態碼錶示目標資源臨時移動到了另外一個 URI 上。因爲重定向是臨時發生的，因此客戶端在以後的請求中還應該使用本來的 URI。

服務器會在響應 Header 的 Location 字段中放上這個不一樣的 URI。瀏覽器可使用 Location 中的 URI 進行自動重定向。

能夠看到跳轉網頁以前的狀態碼。即有時候r.status_code 是唬人的。

能夠用一個參數，禁止自動跳轉。 allow_redirects 

r = requests.get('http://www.baidu.com/link?url=QeTRFOS7TuUQRppa0wlTJJr6FfIYI1DJprJukx4Qy0XnsDO_s9baoO8u1wvjxgqN', allow_redirects = False)
>>>r.status_code
302

View Code

　　　　　　r.headers 響應頭內容

　　　　　　r.request.headers 請求頭內容

　　　　　　假裝請求頭

headers = {'User-Agent': 'liquid'}
r = requests.get('http://www.zhidaow.com', headers = headers)
print r.request.headers['User-Agent']

View Code

　　　　　　會話對象

　　　　　　s = requests.Session()

　　　　　　s.get()

　　　　　　s.post()

會話對象讓你可以跨請求保持某些參數，最方便的是在同一個Session實例發出的全部請求之間保持cookies，且這些都是自動處理的，甚是方便。
import requests
 
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, compress',
           'Accept-Language': 'en-us;q=0.5,en;q=0.3',
           'Cache-Control': 'max-age=0',
           'Connection': 'keep-alive',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}
 
s = requests.Session()
s.headers.update(headers)
# s.auth = ('superuser', '123')
s.get('https://www.kuaipan.cn/account_login.htm')
 
_URL = 'http://www.kuaipan.cn/index.php'
s.post(_URL, params={'ac':'account', 'op':'login'},
       data={'username':'****@foxmail.com', 'userpwd':'********', 'isajax':'yes'})
r = s.get(_URL, params={'ac':'zone', 'op':'taskdetail'})
print(r.json())
s.get(_URL, params={'ac':'common', 'op':'usersign'})

View Code

　　　　　　超時與異常

　　　　　　timeout 參數

r = requests.get('https://m.hcomic1.com',timeout = 1)

View Code

　　　　3 json模塊--輕量級數據交換格式

　　　　　　文件 dump load

　　　　　　字符串 dumps loads

　　　　4 re模塊

　　　　　　re.S 表示「.」（不包含外側雙引號）的做用擴展到整個字符串，包括「\n」。. 默認匹配除換行符之外全部的字符，re.S 模式中，. 連換行符都會匹　　　　　　配。

　　　　　　re.I 表示忽略字符的大小寫

正則表達式中，「.」的做用是匹配除「\n」之外的任何字符，也就是說，它是在一行中進行匹配。這裏的「行」是以「\n」進行區分的。a字符串有每行的末尾有一個「\n」，不過它不可見。

若是不使用re.S參數，則只在每一行內進行匹配，若是一行沒有，就換下一行從新開始，不會跨行。而使用re.S參數之後，正則表達式會將這個字符串做爲一個總體，將「\n」當作一個普通的字符加入到這個字符串中，在總體中進行匹配。

View Code

import re

s = 'agejaoigeaojdnghaw2379273589hahjhgoiaioeg87t98w825tgha9e89aye835yyaghe9857ahge878ahsohga9e9q30gja9eu73hga9w7ga8w73hgna9geuahge9aoi753uajghe9as' \
    '8837t5hga8u83758uaga98973gh8e'
res1 = re.findall('\d{2,3}[a-zA-Z]{1,}?\d{2,3}',s)
# [ ] 字符集，只能取其中一個
# {m,n} 量詞，對前面一個字符重複m到n此
# 量詞後面加？，爲非貪婪匹配
# print(res1) ['589hahjhgoiaioeg87', '98w825', '89aye835', '857ahge878', '758uaga989']

res2 = re.search('(\d{2,3})[a-zA-Z]{1,}?(\d{2,3})',s)
# print(res2) res:<re.Match object; span=(25, 43), match='589hahjhgoiaioeg87'>

print(res2.group())   # match 和 search 只會返回第一個。
print(res2.group(1))  #The first parenthesized subgroup. 使用group傳參，正則匹配必須有響應的 ( )
print(res2.group(2))  #The second parenthesized subgroup.

View Code　　　

# res3 = re.finditer('\d{2,3}[a-zA-Z]{1,}?\d{2,3}',s)
# print(res3) # res3:<callable_iterator object at 0x000001DE04A9E048> 返回迭代對象，省內存

View Code

　　　　5 flask_httpauth模塊--認證模塊？

from flask_httpauth import HTTPBasicAuth
from flask_httpauth import HTTPDigestAuth

View Code

　　　　6 beautifulsoup模塊

　　　　　　Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫.它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方　　式.Beautiful Soup會幫你節省數小時甚至數天的工做時間.

　　　　https://www.cnblogs.com/linhaifeng/articles/7783586.html#_label2

# beautifulsoup的使用經常使用t套路。經過find_all 獲取tag列表。這一步叫搜索文檔樹
# 對獲取到的tag列表，取想要的數據：文本，超連接等。這一步叫獲取tag的屬性，名稱，內容等。
# 通常這樣使用就足夠了

from bs4 import BeautifulSoup
import requests

URL = 'https://www.duodia.com/daxuexiaohua/'

def get_page(url):
    r = requests.get(url)
    if r.status_code == 200:
        return r.text

content = get_page(URL)

soup = BeautifulSoup(content,'lxml')

# 1 搜索文檔樹 name tag名。class_ ,類名。class是關鍵字
a_s = soup(name='a',class_='thumbnail-container')

# print(type(a_s[0]))  # 注意 a_s[0]的數據類型  <class 'bs4.element.Tag'>
# 2 獲取tag的屬性，名稱，內容等。
for a in a_s:
    print(a.attrs['href'])
    print(a.text)
    print(a.name)

View Code

知識點

迭代器與生成器
- 　可能不那麼重要

異常處理

try:
    a = 1
    b = 's'
    c = a + b
except TypeError as e:
    # 可能 except as 比較好，能區分出什麼錯誤來
    print('TypeError %s' %e)
else:
    # 沒有異常纔會走
    print('else') 
finally:
    # 有沒有異常都會走
    print('finally')

View Code

多線程多進程

　　多線程用於IO密集型，如socket，爬蟲，web
　　多進程用於計算密集型，如金融分析

multiprocessing模塊和threading模塊

from threading import Thread

def foo():
    print(os.getpid())
    print('f00')
    time.sleep(2)
def bar():
    print(os.getpid())
    print('bar')
    time.sleep(5)

if __name__ == '__main__':
    t = time.time()
    t1 = Thread(target=foo)
    t2 = Thread(target=bar)
    t1.start()
    t2.start()
    t1.join()  # 線程的join起阻塞進程的做用，t1,t2線程跑完後，纔會回到主進程，執行print()語句
    t2.join()  #
    print('time cost {}'.format(time.time() - t))

View Code

進程的join也是如此。

multiprocessing模塊下的進程池的概念

Pool類這個有點厲害。　　

from multiprocessing import Pool
import time
import os
import random

def foo(n):
    time.sleep(random.random())
    return {'name':'foo','length':n}


def save(dic):
    f = open('a.txt','a',encoding='utf-8')
    f.write('name:{},length:{}\n'.format(dic['name'],dic['length']))
    f.close()

if __name__ == '__main__':

    n = os.cpu_count()

    pool = Pool(n)
    # print(p)   p: <multiprocessing.pool.Pool object at 0x000001DDE9D3E0B8>

    task_list = []

    for i in range(20):
        task = pool.apply_async(foo,args=(i,),callback=save)
        # print(task) task:<multiprocessing.pool.ApplyResult object at 0x0000026084D5AFD0>
        task_list.append(task)
    pool.close()
    pool.join()
    for task in task_list:
        print(task.get())



p = Pool()
task = p.apply_async(func=,args=,kwds=,callback=) 注意 task是什麼相似，異步添加任務。
    callback有且惟一參數是func的返回值，用的好的話，省不少事
p.close()
p.join()

task.get()

View Code

from concurrent.futures import ProcessPoolExecutor,ThreadPoolExecutor

import requests
def get(url):
    r=requests.get(url)

    return {'url':url,'text':r.text}
def parse(future):
    dic=future.result()          #future對象調用result方法取其值、
    f=open('db.text','a')
    date='url:%s\n'%len(dic['text'])
    f.write(date)
    f.close()
if __name__ == '__main__':
    executor=ThreadPoolExecutor()
    url_l = ['http://cn.bing.com/', 'http://www.cnblogs.com/wupeiqi/', 'http://www.cnblogs.com/654321cc/',
                 'https://www.cnblogs.com/', 'http://society.people.com.cn/n1/2017/1012/c1008-29581930.html',
                 'http://www.xilu.com/news/shaonianxinzangyou5gedong.html', ]

    for url in url_l:
        executor.submit(get,url).add_done_callback(parse)         #與Pool進程池回調函數接收的是A函數的返回值(對象ApplyResult.get()獲得的值)。
    executor.shutdown()                                           #這裏回調函數parse，接收的參數是submit生成的 Future對象。
    print('主')

View Code

狀態碼

框架

scrapy

新版本的scarpy，更新了get() 和 getall() 方法。大致至關於以前的 extract_first() extract()
當分析一個網頁時，能夠用 1）查看網頁源代碼 2 ）檢查 3）也能夠用scarpy shell 'url'。拿到一個response對象。我印象中這種方法能夠有一些便利。　　
- 當拿到response後。能夠對這個response對象經過CSS選擇
  - response.css('title') title是html標籤。拿到的是由一系列selector對象構成的selectorlist。 selector對象包含 xml/html元素。你能夠對selector對象進行進一步的獲取數據。
  - response.css('title').getall() 返回列表。獲取到的是 xml/html元素
    - ```
    ['<title>Quotes to Scrape</title>']
```
- response.css('title::text').getall() 獲取所篩選元素的文本值
- response.css('title::text')[0].getall() == response.css('title:text').get() 當你只想獲取第一個元素的值時。上面的方法都是可行的。看出看出 getall() 方法 selector對象和 selectorlist 均可以調用。
- 能夠對selector對象進行嵌套調用
  - >>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").get() ... author = quote.css("small.author::text").get() ... tags = quote.css("div.tags a.tag::text").getall() ... print(dict(text=text, author=author, tags=tags))
    
    View Code
- 除了selector對象的 getall()方法和 get() 方法，selector對象還支持正則表達式
  - response.css('title:text').re(r'\d') 返回的是列表。元素就是符合要求的文本值。樣式和 get() 是同樣的
- 強烈推薦一個CSS選擇工具
  - Selector Gadget
- view response 有點用處
  - scrapy shell url 和 scrapy view url 搭配使用。
  - 在你的默認瀏覽器中打開給定的URL，並以Scrapy spider獲取到的形式展示。有些時候spider獲取到的頁面和普通用戶看到的並不相同，一些動態加載的內容是看不到的，所以該命令能夠用來檢查spider所獲取到的頁面。
- CSS 和 XPath的差別
  - CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
  - XPath is a language for selecting nodes in XML documents, which can also be used with HTML.
  - XPath 使用路徑表達式來選取 XML 文檔中的節點或節點集。節點是經過沿着路徑 (path) 或者步 (steps) 來選取的。
- 獲取節點的屬性值
  - 經過attrib方法 response.xpath('//title').attrib['href'] 只返回第一個匹配的元素的屬性值
  - response.xpath('//title/@href').get()
  - response.xpath('//title/@href').getall()

scrapy基本實例

crawling blogs, forums and other sites with pagination

常規操做

用到的訪問下一頁是

 next_page = response.urljoin(next_page)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

View Code

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

View Code

response.follow(next_page, callback=self.parse)

View Code

Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.

You can also pass a selector to response.follow instead of a string; this selector should extract necessary attributes:

View Code

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

View Code

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

View Code

稍微複雜的例子

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }
This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before.

Here we’re passing callbacks to response.follow as positional arguments to make the code shorter; it also works for scrapy.Request.

The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.

View Code

spider

簡介

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

For spiders, the scraping cycle goes through something like this:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

View Code

常見屬性

name
start_urls
- not necessary，能夠用 def start_requests() 來代替
custom_settings
logger
from_crawler ??

start_requests

Scrapy calls it only once, so it is safe to implement start_requests() as a generator

能夠override。當你POST，而不是get，應該是由於start_urls 默認都是 get。只能在這override了吧

If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

View Code

parse
- This method, as well as any other Request callback, must return an iterable of Requestand/or dicts or Item objects.

基本示例

注意self.logger，start_urls屬性和 start_requests()方法，Item（後面細說）

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        for h3 in response.xpath('//h3').getall():
            yield {"title": h3}

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)
Instead of start_urls you can use start_requests() directly; to give data more structure you can use Items:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        for h3 in response.xpath('//h3').getall():
            yield MyItem(title=h3)

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

View Code

CrawlSpider

https://docs.scrapy.org/en/latest/topics/spiders.html#crawling-rules
這個是之後寫的模板。會方便許多，link都幫咱們提取好了。
只需建立spider的時候加上 -t crawl 參數
- ```
scrapy genspider -t crawl xx  xx.com
```
  View Code
new attribute
- rules
  - follow屬性設爲True，自動跟進下一頁。咱們只負責parse_item就能夠了，省了一半的功夫。好使，碉堡了。
- parse_start_url

示例

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
        item['link_text'] = response.meta['link_text']
        return item

View Code

Selectors

https://docs.scrapy.org/en/latest/topics/selectors.html#

簡介

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they 「select」 certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.

View Code

Using text nodes in a condition

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
So, using the .//text() node-set won’t select anything in this case:

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
But using the . to mean the node, works:

>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

View Code

Variables in XPath expressions

https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions

XPath allows you to reference variables in your XPath expressions, using the $somevariable syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?, which are then substituted with values passed with the query.

View Code

Here’s another example, to find the 「id」 attribute of a <div> tag containing five <a> children (here we pass the value 5 as an integer):

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'

View Code

Here’s an example to match an element based on its normalized string-value:

>>> str_to_match = "Name: My image 3"
>>> selector.xpath('//a[normalize-space(.)=$match]',
...                match=str_to_match).get()
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>'

View Code

Here’s another example using a position range passed as two integers:

>>> start, stop = 2, 4
>>> selector.xpath('//a[position()>=$_from and position()<=$_to]',
...                _from=start, _to=stop).getall()
['<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>']

View Code

Named variables can be useful when strings need to be escaped for single or double quotes characters. The example below would be a bit tricky to get right (or legible) without a variable reference:

>>> html = u'''<html>
... <body>
...   <p>He said: "I don't know why, but I like mixing single and double quotes!"</p>
... </body>
... </html>'''
>>> selector = Selector(text=html)
>>>
>>> selector.xpath('//p[contains(., $mystring)]',
...                mystring='''He said: "I don't know''').get()
'<p>He said: "I don\'t know why, but I like mixing single and double quotes!"</p>'

View Code

Built-in Selectors reference
- Selector objects
  - attrib：Return the attributes dictionary for underlying element.
  - register_namespace ？
  - remove_namespaces ？
- SelectorList objects
  - attrib：Return the attributes dictionary for the first element. If the list is empty, return an empty dict.
  - ... ...
Selecting element attributes
- https://docs.scrapy.org/en/latest/topics/selectors.html#selecting-attributes
- 三種方法
  - xpath('//title/@href').getall()
  - css("a::attr(href)").getall()
  - [a.attrib['href'] for a in response.xpath('//a')] 即attrib['xx']內置方法

Items

簡介

The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.

View Code

metadata key？

Field objects are used to specify metadata for each field. For example, the serializer function for the last_updated field illustrated in the example above.

You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects. For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects could be used by a different component, and only those components know about it. You can also define and use any other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by each component.

View Code

class Myitem(scrapy.Item):
    name = scrapy.Field()
    age = scrapy.Field()
    salary = scrapy.Field()

item = Myitem({'name':'z','age':28,'salary':10000})

print(item.fields)
print(Myitem.fields)

===>
{'age': {}, 'name': {}, 'salary': {}}
{'age': {}, 'name': {}, 'salary': {}}



class Myitem(scrapy.Item):
    name = scrapy.Field()
    age = scrapy.Field()
    salary = scrapy.Field(dd='geagd')

item = Myitem({'name':'z','age':28,'salary':10000})

print(item.fields)
print(Myitem.fields)

===>
{'age': {}, 'name': {}, 'salary': {'dd': 'geagd'}}
{'age': {}, 'name': {}, 'salary': {'dd': 'geagd'}}

View Code

Item objects

Items replicate the standard dict API, including its constructor. The only additional attribute provided by Items is:fields

View Code

Field objects

The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or attributes. In other words, Field objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.

View Code

Extending Items
- scrapy.Field(Product.fields['name'])

You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.

For example:

class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()
You can also extend field metadata by using the previous field metadata and appending more values, or changing existing values, like this:

class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
That adds (or replaces) the serializer metadata key for the name field, keeping all the previously existing metadata values.

View Code

Item Loader

簡介 --快速填充數據。應該是新的scrapy的新的功能。不用你在手動的extract data，而後給Item的field賦值了。我擦，省老多事了，牛逼。

在哪使用？spider 。是什麼？是類。參數是什麼？是item，response。

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.

View Code

簡單例子

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)            #首先實例化 
    #ItemLoader.default_item_class屬性控制默認實例化的類
    l.add_xpath('name', '//div[@class="product_name"]')   # 獲取值的三種方式
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()                        # 返回item populated with the data 
                                                         #previously extracted

View Code

To use an Item Loader, you must first instantiate it. You can either instantiate it with a dict-like object (e.g. Item or dict) or without one, in which case an Item is automatically instantiated in the Item Loader constructor using the Item class specified in the ItemLoader.default_item_class attribute.

Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to the same item field; the Item Loader will know how to 「join」 those values later using a proper processing function.

Here is a typical Item Loader usage in a Spider, using the Product item declared in the Items chapter:

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()
By quickly looking at that code, we can see the name field is being extracted from two different XPath locations in the page:

//div[@class="product_name"]
//div[@class="product_title"]
In other words, data is being collected by extracting it from two XPath locations, using the add_xpath() method. This is the data that will be assigned to the name field later.

Afterwards, similar calls are used for price and stock fields (the latter using a CSS selector with the add_css() method), and finally the last_update field is populated directly with a literal value (today) using a different method: add_value().

Finally, when all data is collected, the ItemLoader.load_item() method is called which actually returns the item populated with the data previously extracted and collected with the add_xpath(), add_css(), and add_value() calls.

View Code

input and output processors ??
- An Item Loader contains one input processor and one output processor for each (item) field.
- The result of the input processor is collected and kept inside the ItemLoader。
- The result of input processors will be appended to an internal list (in the Loader) containing the collected values (for that field).
- The result of the output processor is the final value that gets assigned to the item.（ The result of the output processor is the value assigned to the name field in the item.）
- It’s worth noticing that processors are just callable objects, which are called with the data to be parsed, and return a parsed value. So you can use any function as input or output processor. The only requirement is that they must accept one (and only one) positional argument, which will be an iterator.
- Both input and output processors must receive an iterator as their first argument. The output of those functions can be anything. The result of input processors will be appended to an internal list (in the Loader) containing the collected values (for that field). The result of the output processors is the value that will be finally assigned to the item.
- The other thing you need to keep in mind is that the values returned by input processors are collected internally (in lists) and then passed to output processors to populate the fields.
- https://docs.scrapy.org/en/latest/topics/loaders.html#input-and-output-processors

available build-in processors

https://docs.scrapy.org/en/latest/topics/loaders.html#topics-loaders-available-processors
Join
TakeFirst
Compose

MapCompose

This processor provides a convenient way to compose functions that only work with single values (instead of iterables). For this reason the MapCompose processor is typically used as input processor, since data is often extracted using the extract() method of selectors, which returns a list of unicode strings.

View Code

>>> def filter_world(x):
...     return None if x == 'world' else x
...
>>> from scrapy.loader.processors import MapCompose
>>> proc = MapCompose(filter_world, str.upper)
>>> proc(['hello', 'world', 'this', 'is', 'scrapy'])
['HELLO, 'THIS', 'IS', 'SCRAPY']

View Code

declaring item loaders
- https://docs.scrapy.org/en/latest/topics/loaders.html#declaring-item-loaders

declaring input and out processors

https://docs.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors

對比

input and output processors declared in the Itemloader defination.優先級最高。KEYWORD Item Loader definition.

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class ProductLoader(ItemLoader):

    default_output_processor = TakeFirst()

    name_in = MapCompose(unicode.title)
    name_out = Join()

    price_in = MapCompose(unicode.strip)

View Code

use Item Field metadata to specify the input and output processors.優先級第二高。 KEYWORD:Item Filed metadata

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_price(value):
    if value.isdigit():
        return value

class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Join(),
    )
    price = scrapy.Field(
        input_processor=MapCompose(remove_tags, filter_price),
        output_processor=TakeFirst(),
    )
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
>>> il.add_value('price', [u'&euro;', u'<span>1000</span>'])
>>> il.load_item()
{'name': u'Welcome to my website', 'price': u'1000'}

View Code

reusing and extending ltem loaders
- https://docs.scrapy.org/en/latest/topics/loaders.html#topics-loaders-extending
item loader context
- https://docs.scrapy.org/en/latest/topics/loaders.html#item-loader-context
- The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in the Item Loader. It can be passed when declaring, instantiating or using Item Loader. They are used to modify the behaviour of the input/output processors.

nested loaders

不用也行
https://docs.scrapy.org/en/latest/topics/loaders.html#nested-loaders
nested_xpath() add_xpath('相對路徑')

loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()

View Code

scrapy shell

https://docs.scrapy.org/en/latest/topics/shell.html

設置 shell 工具爲 ipython　　

在scrapy.cfg中設置

[settings]
shell = bpython

Invoking the shell from spiders to inspect responses

from scrapy.shell import inspect_response

Here’s an example of how you would call it from your spider:

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.
When you run the spider, you will get something similar to this:

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'
Then, you can check if the extraction code is working:

>>> response.xpath('//h1[@class="fn"]')
[]
Nope, it doesn’t. So you can open the response in your web browser and see if it’s the response you were expecting:

>>> view(response)
True
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:

>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

View Code

The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider
Once you get familiarized with the Scrapy shell, you’ll see that it’s an invaluable tool for developing and debugging your spiders. 能夠考慮下。
另外一種方法就是加log吧。 self.logger.info('xxx')

Item Pipeline

是幹什麼的

Typical uses of item pipelines are:

cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database

View Code

是什麼

是個class

is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

View Code

方法

process_item(self,item,spider)

This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components

View Code

from_crawler(cls,crawler)

If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

View Code

例子

簡單例子

from scrapy.exceptions import DropItem

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item.get('price'):
            if item.get('price_excludes_vat'):
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

View Code

Write items to MongoDB

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

View Code

Duplicates filter

簡單的去重，注意領會精神。

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

View Code

在設置 ITEM_PIPELINES 配置中將pipeline類添加上。

Feed exports
- https://docs.scrapy.org/en/latest/topics/feed-exports.html
- 簡介
  - to be consumed by other systems.
- 用到的時候在細看吧

requests and responses

https://docs.scrapy.org/en/latest/topics/request-response.html

參數

meta

A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

View Code

link extractors

https://docs.scrapy.org/en/latest/topics/link-extractors.html
這個也是個技巧吧。能夠很便捷的找到連接

from scrapy.linkextractors import LinkExtractor

The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.

View Code

參數
- allow
- deny
- restrict_xpath
- process_value
- attrs

setting

designating the setting

設置環境變量，告訴scrapy使用的是哪一個配置，環境變量名是 SCRAPY_SETTINGS_MODULE

When you use Scrapy, you have to tell it which settings you’re using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE.

View Code

下面這句話不太懂？？是在說 SCRAPY_SETTINGS_MODULE 應該在setting中配置吧

The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, e.g. myproject.settings. Note that the settings module should be on the Python import search path.

View Code

populating the setting

優先級

1 Command line options (most precedence)
2 Settings per-spider
3 Project settings module
4 Default settings per-command
5 Default global settings (less precedence)

View Code

command line options

經過 -s / --set 指定

Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.

Example:

scrapy crawl myspider -s LOG_FILE=scrapy.log

View Code

settings per-spider

經過 custom_settings指定

Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute:

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

View Code

project settings module
- 經過setting.py 指定
Default settings per-command
Default global settings
- 位於 scrapy.settings.default_settings模塊

how to access settings

經過 self.settings 方法得到

In a spider, the settings are available through self.settings:

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())

View Code

spider 初始化以前，使用這些settings，重寫類方法from_crawler()。這個 from_crawler 仍是至關有用的。插一句，settings取值建議使用官方API

Note

The settings attribute is set in the base Spider class after the spider is initialized. If you want to use the settings before the initialization (e.g., in your spider’s __init__() method), you’ll need to override the from_crawler() method.

Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions, middlewares and item pipelines:

class MyExtension(object):
    def __init__(self, log_is_enabled=False):
        if log_is_enabled:
            print("log is enabled!")

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.getbool('LOG_ENABLED'))
The settings object can be used like a dict (e.g., settings['LOG_ENABLED']), but it’s usually preferred to extract the setting in the format you need it to avoid type errors, using one of the methods provided by the Settings API.

View Code

built-in setting reference

不少，很是多。有幾個仍是挺有用的
https://docs.scrapy.org/en/latest/topics/settings.html#built-in-settings-reference

CONCURRENT_ITEMS

Default: 100

Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).

View Code

CONCURRENT_REQUESTS

Default: 16

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.

View Code

CONCURRENT_REQUESTS_PER_DOMAIN

Default: 8

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

View Code

CONCURRENT_REQUESTS_PER_IP

Default: 0

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

View Code

DEFAULT_REQUEST_HEADERS

我覺的這個仍是至關有用的！！

Default:

{
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware.

View Code

DOWNLOADER_MIDDLEWARES

Default:: {}

A dict containing the downloader middlewares enabled in your project, and their orders.

View Code

DOWNLOAD_DELAY

Default: 0

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:

DOWNLOAD_DELAY = 0.25    # 250 ms of delay
This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.

When CONCURRENT_REQUESTS_PER_IP is non-zero, delays are enforced per ip address instead of per domain.

You can also change this setting per spider by setting download_delay spider attribute.

View Code

DOWNLOAD_HANDLERS

Default: {}

A dict containing the request downloader handlers enabled in your project.

View Code

DOWNLOAD_TIMEOUT

Default: 180

The amount of time (in secs) that the downloader will wait before timing out.

Note

This timeout can be set per spider using download_timeout spider attribute and per-request using download_timeout Request.meta key.

View Code

DOWNLOAD_MAXSIZE
DOWNLOAD_WARNSIZE

DUPEFILTER_CLASS

Default: 'scrapy.dupefilters.RFPDupeFilter'

The class used to detect and filter duplicate requests.

The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).

You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.

View Code

EXTENSIONS

Default:: {}

A dict containing the extensions enabled in your project, and their orders.

View Code

ITEM_PIPELINES

Default: {}

A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define them in the 0-1000 range. Lower orders process before higher orders.

Example:

ITEM_PIPELINES = {
    'mybot.pipelines.validate.ValidateMyItem': 300,
    'mybot.pipelines.validate.StoreMyItem': 800,
}

View Code

MEMDEBUG_ENABLED

Default: False

Whether to enable memory debugging.

View Code

MEMDEBUG_NOTIFY

和上面那個搭配的，發郵件，可能有用吧

Default: []

When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.

Example:

MEMDEBUG_NOTIFY = ['user@example.com']

View Code

RANDOMIZE_DOWNLOAD_DELAY

Default: True

If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.

This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.

The randomization policy is the same used by wget --random-wait option.

If DOWNLOAD_DELAY is zero (default) this option has no effect.

View Code

SPIDER_MIDDLEWARES

Default:: {}

A dict containing the spider middlewares enabled in your project, and their orders.

View Code

USER_AGENT

或者在這也能夠重寫，不須要再每一個spider中從新寫一遍

Default: "Scrapy/VERSION (+https://scrapy.org)"

The default User-Agent to use when crawling, unless overridden.

View Code

。。。還有不少

Exceptions

https://docs.scrapy.org/en/latest/topics/exceptions.html#module-scrapy.exceptions

built-in exceptions reference

DropItem

exception scrapy.exceptions.DropItem
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item Pipeline.

View Code

CloseSpider

exception scrapy.exceptions.CloseSpider(reason='cancelled')
This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:

Parameters:    reason (str) – the reason for closing
For example:

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

View Code

built-in service

Logging

更多的用法須要看內置的logging模塊。我覺的自帶的已經夠用了。之後細看吧

how to log messages 的兩種方法

import logging
logging.warning("This is a warning")

View Code

import logging
logging.log(logging.WARNING, "This is a warning")

logging from spider

Scrapy provides a logger within each Spider instance, which can be accessed and used like this:

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapinghub.com']

    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

View Code

That logger is created using the Spider’s name, but you can use any custom Python logger you want

import logging
import scrapy

logger = logging.getLogger('mycustomlogger')

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapinghub.com']

    def parse(self, response):
        logger.info('Parse function called on %s', response.url)

View Code

logging configuration

Loggers on their own don’t manage how messages sent through them are displayed. For this task, different 「handlers」 can be attached to any logger instance and they will redirect those messages to appropriate destinations, such as the standard output, files, emails, etc.

View Code

上面這句話一樣適用於 logging 模塊

logging settings
- LOG_FILE
- LOG_SHORT_NAMES
- 等等，就是 built-in setting 的一些關於log的設置

Advanced customization (高級定製)

說實話，這個有點吊。 https://docs.scrapy.org/en/latest/topics/logging.html#advanced-customization

Because Scrapy uses stdlib logging module, you can customize logging using all features of stdlib logging.

For example, let’s say you’re scraping a website which returns many HTTP 404 and 500 responses, and you want to hide all messages like this:

2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring
response <500 http://quotes.toscrape.com/page/1-34/>: HTTP status code
is not handled or not allowed
The first thing to note is a logger name - it is in brackets: [scrapy.spidermiddlewares.httperror]. If you get just [scrapy] then LOG_SHORT_NAMES is likely set to True; set it to False and re-run the crawl.

Next, we can see that the message has INFO level. To hide it we should set logging level for scrapy.spidermiddlewares.httperror higher than INFO; next level after INFO is WARNING. It could be done e.g. in the spider’s __init__ method:

import logging
import scrapy


class MySpider(scrapy.Spider):
    # ...
    def __init__(self, *args, **kwargs):
        logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
        logger.setLevel(logging.WARNING)
        super().__init__(*args, **kwargs)
If you run this spider again then INFO messages from scrapy.spidermiddlewares.httperror logger will be gone.

View Code

這裏說的應該是將httperror的顯示log的等級設置爲WARNING，因此等級爲INFO的就不顯示了。

logger = logging.getLogger('scrapy.spidermiddlewares.httperror') 還能夠這樣用。name能夠不填。一般logger的名字咱們對應模塊名，如聊天模塊、數據庫模塊、驗證模塊等。
super().__init__(*args, **kwargs) 這裏不是綁定函數，因此不用傳self參數。連接 https://www.runoob.com/python/python-func-super.html

Stats Collection　　

數據收集 https://docs.scrapy.org/en/latest/topics/stats.html#stats-collection

還不知道有什麼用？網上說和 Signals 配合使用。？？？待定

class ExtensionThatAccessStats(object):

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

View Code

Sending e-mail
- https://docs.scrapy.org/en/latest/topics/email.html
- 有須要的時候在研究吧
- quick example
  - from scrapy.mail import MailSender mailer = MailSender()
    View Code
  - mailer = MailSender.from_settings(settings)
    View Code
telnet console
- pass https://docs.scrapy.org/en/latest/topics/telnetconsole.html
Web Service
- https://docs.scrapy.org/en/latest/topics/webservice.html

SOLVING SPECIFIC PROBLEMS
- Frequently Asked Questions
  - https://docs.scrapy.org/en/latest/faq.html#frequently-asked-questions
  - Does Scrapy work with HTTP proxies?
    - class scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
  - How can I scrape an item with attributes in different pages
    - https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
  - How can I simulate a user login in my spider?
    - https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin
  - What is the recommended way to deploy a Scrapy crawler in production?
    - https://docs.scrapy.org/en/latest/topics/deploy.html#topics-deploy
  - Does Scrapy manage cookies automatically?
    - Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.
      
      View Code
  - How can I prevent my Scrapy bot from getting banned?
    - https://docs.scrapy.org/en/latest/topics/practices.html#bans
- Debugging Spiders
  - https://docs.scrapy.org/en/latest/topics/debug.html#debugging-spiders
  - 有四種方法
- Spiders Contracts
  - https://docs.scrapy.org/en/latest/topics/contracts.html#spiders-contracts
  - 這是個什麼玩意？？
- Common Practices
  - Running multiple spiders in the same process
    - https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
    - import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished
      
      View Code
    - import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
      
      View Code
  - Avoiding getting banned
    - https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
    - 這個應該頗有用，實踐中試試看！！！
- Broad Crawl
  - https://docs.scrapy.org/en/latest/topics/broad-crawls.html
  - 也是挺有用的。
  - Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.
    View Code
  - Use the right SCHEDULER_PRIORITY_QUEUE
    - Scrapy’s default scheduler priority queue is 'scrapy.pqueues.ScrapyPriorityQueue'. It works best during single-domain crawl. It does not work well with crawling many different domains in parallel To apply the recommended priority queue use: SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
      
      View Code
  - Increase concurrency
  - Reduce log level
  - Disable cookies !! 不少次說起這個了
  - Reduce download timeout
  - Enable crawling of 「Ajax Crawlable Pages」
    - 重點研究下。有種預感頗有用。若是真的能夠自動找到ajax，那簡直是牛逼。
    - https://docs.scrapy.org/en/latest/topics/broad-crawls.html#enable-crawling-of-ajax-crawlable-pages
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#ajaxcrawl-middleware
  - Be mindful of memory leaks
    - https://docs.scrapy.org/en/latest/topics/leaks.html#topics-leaks
    - 不知道，反正見這個東西好幾回了，有機會研究下。
  - Using your browser’s Developer Tools for scraping
    - https://docs.scrapy.org/en/latest/topics/developer-tools.html#using-your-browser-s-developer-tools-for-scraping
    - 頗有用，這就是正常爬取網頁的分析過程。官方文檔用的是火狐瀏覽器，看了下，確實比chrome好一些，有必定的道理。
    - Caveats with inspecting the live browser DOM
      - 可能第一點會有用，禁用後，就是最原始的頁面，沒有動態加載的內容
      - Disable Javascript while inspecting the DOM looking for XPaths to be used in Scrapy (in the Developer Tools settings click Disable JavaScript) Never use full XPath paths, use relative and clever ones based on attributes (such as id, class, width, etc) or any identifying features like contains(@href, 'image'). Never include <tbody> elements in your XPath expressions unless you really know what you’re doing
        
        View Code
    - 總之,重中之重,沒事多看幾遍,注意領會精神!!!!!!!!!
  - Selecting dynamically-loaded content
    - 很是重要的。沒事多看看！！
    - headless browser
      - selenium
      - scrapy-selenium
    - Pre-rendering JavaScrip
      - https://docs.scrapy.org/en/latest/topics/dynamic-content.html#pre-rendering-javascript
      - scrapy-splash
        
        還不知道有什麼用，怎麼用。看它的介紹說明頗有用
        
        多是比價low的方法，可能相似是無頭瀏覽器的方法
    - Parsing JavaScript code
      - get the JavaScript code
        
        response.text
        
        在html元素中，使用selector
      - extract the desired data
        
        方法1
        
        regular expression
        
        json.loads
        
        方法2
        
        js2xml
        
        selector
    - Handling different response formats
      - HTML or XM
        
        selector
      - JSON
        
        json.loads()
      - JavaScript, or HTML with a <script/>
        
        https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-parsing-javascript
      - an image or another format based on images (e.g. PDF)
        
        OCR
        
        pytesseract
        
        tabula-py
      - SVG, or HTML with embedded SVG
        
        Selector
        
        OCR
  - Debugging memory leaks
    - https://docs.scrapy.org/en/latest/topics/leaks.html
    - 之後細看
  - Downloading and processing files and images
    - 這個提供了一個此場景的簡單方法，用也行，不用也行。用的好很省事
    - https://docs.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images
    - https://github.com/factsbenchmarks/360img
    - item 的 image_urls 字段是必須的。images 字段多是重寫 item_completed 方法會用到吧
    - 通常直接使用 ImagesPipeline，就能夠了。若是想改存的圖片的路徑，名稱，須要重寫 file_path 方法
    - 文件介紹了三個方法，除了上面的 file_path 方法，還有 get_medie_requests 方法和 item_completed 方法還不知道具體有什麼用。
    - 須要多配置幾個設置項
      - pipeline
      - 文件存儲的路徑
      - 文件過時時間
      - 圖片能夠存縮略圖
        
        IMAGES_THUMBS
      - 圖片能夠設置最小圖片的尺度大小，小於的就不要了
        
        IMAGES_MIN_HEIGHT
        
        IMAGES_MIN_WIDTH
  - deploying spider　　
    - https://docs.scrapy.org/en/latest/topics/deploy.html#deploying-spiders
  - autothrottle extension　　
    - https://docs.scrapy.org/en/latest/topics/autothrottle.html#autothrottle-extension
    - 給個人感受是更智能化的自動節流。每一個ip 有延遲訪問，延遲下載等。
  - benchmarking
    - 沒啥用
    - https://docs.scrapy.org/en/latest/topics/benchmarking.html#benchmarking
  - Jobs: pausing and resuming crawls
    - https://docs.scrapy.org/en/latest/topics/jobs.html#jobs-pausing-and-resuming-crawls
    - 感受有些用
    - Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.
    - Scrapy supports this functionality out of the box by providing the following facilities: a scheduler that persists scheduled requests on disk a duplicates filter that persists visited requests on disk an extension that keeps some spider state (key/value pairs) persistent between batches
      
      View Code

EXTENDING SCRAPY

Architecture overview
- 面試必問
- https://docs.scrapy.org/en/latest/topics/architecture.html#architecture-overview
Downloader Middleware
- https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#downloader-middleware
- Activating a downloader middleware
  - The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
    View Code
- Writing your own downloader middleware
  - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#writing-your-own-downloader-middleware
  - from_crawler
  - process_request
  - process_response
  - prcoess_exception
- Built-in downloader middleware reference
  - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#built-in-downloader-middleware-reference
  - CookiesMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.cookies
  - DefaultHeadersMiddleware
  - DownloadTimeoutMiddleware
  - HttpAuthMiddleware
  - HttpCompressionMiddleware
  - HttpProxyMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy
  - RedirectMiddleware
  - RetryMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#retrymiddleware-settings
  - UserAgentMiddleware
  - AjaxCrawlMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.ajaxcrawl
Spider Middleware
- https://docs.scrapy.org/en/latest/topics/spider-middleware.html#spider-middleware
- Writing your own spider middleware
  - process_spider_input
  - process_spider_output
  - process_spider_exception
  - process_start_requests ？？
  - from_crawler
- Built-in spider middleware reference
  - HttpErrorMiddleware

Extensions

https://docs.scrapy.org/en/latest/topics/extensions.html#extensions
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
可能牛逼的人才會用到吧

示例和signal 搭配

Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:

a spider is opened
a spider is closed
a specific number of items are scraped
The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through the MYEXT_ITEMCOUNT setting.

Here is the code of such extension:

import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)

class SpiderOpenCloseLogging(object):

    def __init__(self, item_count):
        self.item_count = item_count
        self.items_scraped = 0

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise
        # NotConfigured otherwise
        if not crawler.settings.getbool('MYEXT_ENABLED'):
            raise NotConfigured

        # get the number of items from settings
        item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)

        # instantiate the extension object
        ext = cls(item_count)

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        logger.info("opened spider %s", spider.name)

    def spider_closed(self, spider):
        logger.info("closed spider %s", spider.name)

    def item_scraped(self, item, spider):
        self.items_scraped += 1
        if self.items_scraped % self.item_count == 0:
            logger.info("scraped %d items", self.items_scraped)

View Code

Core API
- https://docs.scrapy.org/en/latest/topics/api.html#core-api
- 用到的時候再說吧，東西不少

Signals

多是高端用法吧 https://docs.scrapy.org/en/latest/topics/signals.html#signals

Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not provided out of the box.

View Code

from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider


    def spider_closed(self, spider):
        spider.logger.info('Spider closed: %s', spider.name)


    def parse(self, response):
        pass

View Code

Item Exporters
- https://docs.scrapy.org/en/latest/topics/exporters.html#module-scrapy.exporters
- 感受用不到吧，都是用數據庫吧

數據庫

mysql

版本
- 5.76
端口
- 3306
書籍推薦
- MySQL Cookbook
圖形化管理客戶端
- Navicat
- MySQL Workbench
增刪改查
- SELECT id,name FROM pages WHERE title LIKE "%dota%" ;
  - % 符號表示MySQL字符串通配符
- UPDATE pages SET name = "storm" WHERE id = 1;

與python整合

PyMySQL
下面的代碼和異常處理有效的結合起來了

import pymysql
import random
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
import datetime



conn = pymysql.connect(host='localhost',user='root',passwd='123',db='zuo',charset='utf8')

cur = conn.cursor()

cur.execute('SELECT title FROM pages WHERE id = 2 ')

print(cur.fetchone())

def store(title,content):
    cur.execute("INSERT INTO pages (title,content) VALUES (%s,%s)",(title,content))
    cur.connection.commit()

def getLinks(url):
    html = urlopen(url=url)
    bs = BeautifulSoup(html,'lxml')
    title = bs.find_all('title')
    content = bs.find_all('content')
    store(title,content)
    return bs.find_all('links')
    
links = getLinks('www.baidu.com')

try:
    while len(links) > 0:
        newLink = 'x'
        print(newLink)
        links = getLinks('url')
finally:
    cur.close()
    conn.close()

View Code

鏈接對象（conn）和 光標對象（cur）
鏈接/光標模式是數據庫編程中經常使用的模式。
鏈接模式除了要鏈接數據庫以外，還要發送數據庫信息，處理回滾操做（當一個查詢或一組查詢被中斷時，數據庫須要回到初始狀態，通常用事務控制手段實現狀態回滾），建立光標對象，等等。

而一個鏈接能夠有不少個光標，一個光標跟蹤一種狀態（state）信息，好比正在使用的是哪一個數據庫。若是你有多個數據庫，且須要向全部數據庫寫內容，就須要多個光標來進行處理。光標還會包含最後一次查詢執行的結果，經過調用光標函數，好比 cur.fetchone()，能夠獲取查詢結果。

第三方工具
- 數據清理
  - OpenRefine
    - http://openrefine.org/
- fiddler　　　　
JavaScript
- https://www.w3school.com.cn/js/index.asp
- JavaScript 是屬於 HTML 和 Web 的編程語言。
- 頁面加載會引用外部腳本，這相信你確定見識過了。一個頁面或許會引用多個腳本。
- 瞭解下函數，基本概念基本就能夠吧
- jQuery
  - jQuery是一個快速、簡潔的JavaScript框架。jQuery設計的宗旨是「write Less，Do More」，即倡導寫更少的代碼，作更多的事情。它封裝JavaScript經常使用的功能代碼，提供一種簡便的JavaScript設計模式，優化HTML文檔操做、事件處理、動畫設計和Ajax交互。
    View Code
- 請記住：一個網站使用JavaScript並不意味着全部傳統的網頁抓取工具都失效了。JavaScript的目標是生成HTML和CSS，而後被瀏覽器渲染，或者是經過HTML請求和響應與服務器動態通訊。一旦使用了Selenium，頁面上的HTML和CSS就能夠和其餘網站代碼同樣被讀取和解析。另外，JavaScript對於網絡爬蟲來講甚至會帶來一些好處，由於它做爲「瀏覽器的內容管理系統」，可能會想外界暴露有用的API，讓你能夠更直接的獲取數據。

AJAX

基本概念
- AJAX = Asynchronous JavaScript and XML（異步的 JavaScript 和 XML）
- AJAX 是一種在無需從新加載整個網頁的狀況下，可以更新部分網頁的技術。若是提交表單後，或者從Web服務器獲取信息時，網站的頁面不須要從新加載，那麼你訪問的網站極可能是用來Ajax技術。
- 與一些人的想法相反，Ajax並非一門語言，而是用來完成某個任務的一個技術，網站不須要使用單獨的網頁請求就能夠和Web服務器進行交互。須要注意的是，你不該該說「這個網站是用Ajax寫的」，正確的說法應該是「這個表單使用Ajax與Web服務器通訊」。
- ```
AJAX 是一種用於建立快速動態網頁的技術。

經過在後臺與服務器進行少許數據交換，AJAX 可使網頁實現異步更新。這意味着能夠在不從新加載整個網頁的狀況下，對網頁的某部分進行更新。

傳統的網頁（不使用 AJAX）若是須要更新內容，必需重載整個網頁面。
```
  View Code

簡單例子

div 部分用於顯示來自服務器的信息。當按鈕被點擊時，它負責調用名爲 loadXMLDoc() 的函數：

<html>
<body>

<div id="myDiv"><h3>Let AJAX change this text</h3></div>
<button type="button" onclick="loadXMLDoc()">Change Content</button>

</body>
</html>
接下來，在頁面的 head 部分添加一個 <script> 標籤。該標籤中包含了這個 loadXMLDoc() 函數：

<head>
<script type="text/javascript">
function loadXMLDoc()
{
.... AJAX script goes here ...
}
</script>
</head>

View Code

function是在 <script>（腳本）標籤內實現的，能夠放在<head>,<body>中，推薦放在body元素的底部，可改善顯示速度，由於腳本編譯會拖慢顯示。實際上放在<head>,<body>的任意位置都可。
觸發通常有 onclick，onkeyup等等
和JavaScript 有極爲密切的關係

AJAX XHR

XHR 指的是 XMLHttpRequest

XHR 建立對象

var xmlhttp;
if (window.XMLHttpRequest)
  {// code for IE7+, Firefox, Chrome, Opera, Safari
  xmlhttp=new XMLHttpRequest();
  }
else
  {// code for IE6, IE5
  xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
  }

View Code

XHR請求

xmlhttp.open("GET","test1.txt",true);
xmlhttp.send();

View Code

XHR響應

如需得到來自服務器的響應，請使用 XMLHttpRequest 對象的 responseText 或 responseXML 屬性。

document.getElementById("myDiv").innerHTML=xmlhttp.responseText;

View Code

xmlDoc=xmlhttp.responseXML;
txt="";
x=xmlDoc.getElementsByTagName("ARTIST");
for (i=0;i<x.length;i++)
  {
  txt=txt + x[i].childNodes[0].nodeValue + "<br />";
  }
document.getElementById("myDiv").innerHTML=txt;

View Code

XHR readyState

https://www.w3school.com.cn/ajax/ajax_xmlhttprequest_onreadystatechange.asp

xmlhttp.onreadystatechange=function()
  {
  if (xmlhttp.readyState==4 && xmlhttp.status==200)
    {
    document.getElementById("myDiv").innerHTML=xmlhttp.responseText;
    }
  }

View Code

處理辦法
- 對於那些使用了Ajax和DHTML技術來改變和加載內容的頁面，用python解決這個問題只有兩種途徑
  - 直接從JavaScript代碼中抓取內容
  - 用python的第三方庫執行JavaScript，直接抓取你在瀏覽器裏看到的頁面

利用API抓取數據

利用API 能夠直接得到數據源。API 定義了容許一個軟件與另外一個軟件通訊的標準語法。這裏的API指的是 Web API。
API的響應一般是JSON（JavaScript Object Notation,JavaScript對象表示法）或者XML（eXtensible Markup Language，可擴展標記語言）格式。如今的JSON遠比XML流行，首先JSON文件一般比設計良好的XML文件小，另外一個緣由是Web技術的改變，如今服務器也會用一些JavaScript框架做爲API的發送和接收端。

下面這段話，說明了如今Ajax如此多的緣由。

因爲JavaScript框架變的愈來愈廣泛，不少HTML建立任務從原來的服務器處理變成了由瀏覽器處理。服務器可能給用戶瀏覽器發送一個硬編碼的HTML模板，可是還須要單獨的AJAX請求來加載內容。並將這些內容放在HTML模板中正確的位置。全部這些都發生在瀏覽器/客戶端上。
最初，上述機制對於網絡爬蟲是一個麻煩的問題。過去，爬蟲請求一個HTML頁面時，獲取的就是原封不動的HTML頁面，全部的內容都在HTML頁面上，而如今爬蟲得到的是一個不帶任何內容的HTML模板。
selenium能夠解決這個問題。

然而，因爲整個內容管理系統基本上已經移動的瀏覽器端，連最簡單的網站均可以激增到幾兆節的內容和十幾個HTTP請求。

此外，當時用selenium時，用戶不須要的「額外信息」也被加載了。加載側邊欄廣告，圖像，CSS，第三方的字體。這些內容看起來很好，可是當你編寫一個須要快速移動，抓取特定數據並儘量對Web服務器形成較小負擔的爬蟲時，這可能會加載比你實際所需多上百倍的數據。

可是 對於JavaScript，Ajax和現代化Web來講仍有一線但願：由於服務器再也不將數據處理成HTML格式， 因此它們一般做爲數據庫自己的一個弱包裝器。該弱包裝器簡單的從數據庫中抽取數據，並經過一個API將數據返回給頁面。

固然，這些API並未打算供除網頁自己之外的任何人或者任何事使用，所以，開發者並未這個鞋API提供文檔。

View Code

查找無文檔的API
- 須要作一些偵查工做。API調用有這幾個特徵。
  - 它們一般包含JSON或XML。你能夠利用搜索/過濾字段過濾請求列表
  - 利用GET請求，URL中會包含一個傳遞給它的參數。若是你要尋找一個返回搜索結果或者加載特定頁面數據的API調用，這將很是有用。只須要你使用的搜索詞，頁面ID或者其它的識別信息，過濾結果便可。
  - 它們一般是XHR類型的。
API 數據與其它數據源結合
- 若是你用API做爲惟一的數據源，那麼你最多就是複製別人數據庫的數據，並且這些數據基本上都是已經發表過的，真正有意思的事情，是以一個新穎的方式將兩個或多個數據源組合起來，或者把API做爲一種工具，從全新的角度對抓取到的數據進行數據解釋。（這就是數據分析嗎）

表單和登陸窗口抓取與requests模塊
- form表單和input標籤
  - <form action="form_action.asp" method="get"> <p>First name: <input type="text" name="fname" /></p> <p>Last name: <input type="text" name="lname" /></p> <input type="submit" value="Submit" /> </form>
    View Code
- 能夠用requests模塊的post方法模擬表單登陸。
  - form表單的action屬性就是post方法的url。
  - input標籤的name屬性就是params的key值。params是個字典。data是
  - r = requests.post('url',data=params)
    View Code
- 如何處理cookies
  - 大多數現代網站都是用cookies跟蹤用戶是否已登陸的狀態信息。一旦網站驗證你的登陸憑據，就會在你的瀏覽器上將其保存爲一個cookies，裏面一般包含一個由服務器生成的令牌，登陸有效時長和狀態跟蹤信息。
  - requests模塊的Session()方法
- HTTP基本接入驗證
  - 在發明cookies之前，處理網站登陸的一種經常使用方法是 HTTP基本接入認證（HTTP basic access authentication），你會時不時見到它們，尤爲是在一些安全性較高的網站或公司網站上。
  - requests庫有一個auth模塊，專門處理HTTP認證
    - import requests from requests.auth import AuthBase from requests.auth import HTTPBasicAuth auth = HTTPBasicAuth('zuo','password') r = requests.post('xx',auth=auth) print(r.text)
      
      View Code
    - 雖然這裏看着像一個普通的post請求，可是有一個HTTPBasicAuth對象做爲auth參數傳遞到了請求中。顯示的結果將是用戶名和密碼驗證成功的頁面。若是驗證失敗，就是一個拒絕接入頁面。
- 其它表單內容
  - 網頁表單敵網絡惡意機器人酷愛的網站切入點。你固然不但願機器人建立垃圾帳號，佔用昂貴的服務器資源，或者在博客上提交垃圾評論。所以，現代網站常常在HTML中採起不少安全措施，讓表單不能被快速穿越。
  - 驗證碼
  - 蜜罐（honey pot）
  - 隱藏字段（hidden field）
  - else　
避開抓取陷阱
- 修改請求頭
  - Host
  - Connection
  - Accept
  - User-Agent
    - 最重要。
    - 妙用：能夠假裝成手機訪問，就能夠看到一個更容易抓取的網站
  - Referer
  - Accept-Encoding
  - Accept-Language
    - 修改這個值，大型網站自動變成相關的語言，不用翻譯
- 用JavaScript處理Cookie
  - cookie是個雙刃劍。若是暴露了你的身份，即使從新鏈接網站，或者改變IP來假裝都是白費。有些網站cookie是不可或缺的。
  - selenium
    - delete_cookie()
    - add_cookie()
    - delete_all_cookies()
- 常見表單安全措施
  - 隱藏輸入字段值
    - 在HTML表單中，「隱藏」字段以字段的值對瀏覽器可見，可是對用戶不可見（除非看網頁源代碼）。隱含字段在短暫失寵以後找到了另外一個不錯的用處：阻止爬蟲自動提交表單。
      - 第一種是表單頁面上的一個字段能夠用服務器生成的隨機變量填充。若是提價時這個值再也不表單頁面上，服務器就有理由認爲他不是從原始頁面提交的，而是由網絡機器人直接提交到表單處理頁面的。
      - 第二種方式是蜜罐。表單包含一個具備普通名稱的隱含字段（設置蜜罐圈套），設計的不太好的網絡機器人每每無論這個字段是否是對用戶可見，直接填寫這個字段並向服務器提交，這樣就會中服務器的蜜罐圈套。填寫隱含字段的用戶有可能被網站封殺。
    - 檢查是否有隱藏字段，或者較大的隨機字符串變量
  - 避免蜜罐
    - 幾種方式隱藏
      - CSS display:none
      - type="hide"
    - 能夠用selenium的is_displayed()來區分可見元素與隱含元素
      - from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement driver = webdriver.Chrome() driver.get('url') links = driver.find_element_by_tag_name('a') for link in links: if not links.is_displayed(): print('The link {} is a trap'.format(link.get_attribute('href'))) fields = driver.find_elements_by_tag_name('input') for field in fields: if not field.is_displayed(): print('Do not change value of {}'.format(field.get_attribute('name')))
        
        View Code
遠程抓取
- 爲何用遠程服務器
  - 避免IP地址被封殺。封殺IP地址是最後一步棋，不過是一種很是有效的方法。
- Tor代理服務器
- 遠程主機
  - 雲主機
動態Html
- 和Ajax，DHTML也是用於某一常見目的的一系列技術。DHTML是客戶端腳本改變頁面的HTML元素時，改變的HTML代碼,CSS語言，或者兩者兼而有之。
selenium
- Selenium是一個強大的網頁抓取工具，最初是爲網站自動化測試而開發的，近幾年，它還被普遍用於獲取精確的網站快照，由於網站能夠直接運行在瀏覽器中。Selenium可以讓瀏覽者自動加載網站，獲取須要的數據，甚至對網頁截屏，或者判斷網站上是否發生了某些操做。Selenium本身不帶瀏覽器，它須要與第三方瀏覽器集成才能運行。好比你在FireFox上運行selenium。雖然這樣能夠看的更清楚，可是我更喜歡讓程序在後臺靜靜的運行，因此我用一個叫PhantomJS的工具代替真是的瀏覽器。
- PhantomJS是一個無頭瀏覽器（headless browser）。它會把網站加載到內存並執行頁面上的JavaScript，可是他不會像用用戶展現網頁的圖形界面。把Selenium和PhantomJS結合在一塊兒，就能夠運行一個很是強大的網絡爬蟲來輕鬆處理cookies，JavaScript，header以及任何你須要作的事情。目前PhantomJS已經再也不更新了，Chrome和FireFox都有本身的無頭瀏覽器，並且新版本的selenium支持Chrome和FireFox的無頭瀏覽器，再也不支持PhantomJS了。
- 在selenium中一樣能夠用BeautifulSoup來解析
  - from selenium import webdriver ... pageSource = webdriver.page_source bs = BeautfulSoup(pageSource,'html.parser') print(bs.find(id='content').get_text())
    View Code
分佈式
實戰
最後的話
- 不管如今仍是未來，遇到一個網頁抓取項目時，你都應該問問本身如下幾個問題
  - 我須要回答或要解決的問題是什麼
  - 什麼數據能夠幫到我？它們都在哪裏
  - 網站是如何展現數據的？我能準確的識別網站代碼中包含這一信息的部分嗎
  - 如何定位這些數據並獲取它們？
  - 爲了讓數據更實用，應該作怎樣的處理和分析？
  - 怎樣才能讓抓取過程更好，更快，更穩定？
- 例如，首先用selenium獲取在亞馬遜圖書預覽頁面中經過Ajax加載的圖片，而後用Tesseract讀取圖片，識別裏面的文字。在「維基百科六度分離」問題中，先用正則表達式實現一個爬蟲，把維基百科詞條鏈接信息存儲到數據庫中。而後用有向圖算法尋找詞條「xx」與詞條「xx」之間的最短連接路徑。
- 在使用自動化技術抓取互聯網數據時，其實不多遇到徹底沒法解決的問題。記住一點就行：互聯網其實就是一個用戶界面不太友好的超級API。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。