爬蟲、網頁測試及 java servlet 測試框架等介紹

時間 2019-11-20

標籤爬蟲網頁測試 java servlet 框架介紹欄目網絡爬蟲简体版

原文原文鏈接

scrapy 抓取網頁並存入 mongodb的完整示例：javascript

https://github.com/rmax/scrapy-rediscss

https://github.com/geekan/scrapy-examples # Multifarious(多樣的) Scrapy examples.html

https://github.com/DormyMo/scrappy # scrapy best practice，這個庫用了 https://github.com/rmax/scrapy-redis，但用的不是最新版本java

https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/node

https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/python

https://github.com/sebdah/scrapy-mongodblinux

https://github.com/xiyouMc/WebHubBotandroid

https://github.com/Chyroc/WechatSogou # 基於搜狗微信搜索的微信公衆號爬蟲接口git

https://github.com/gnemoug/distribute_crawler # 使用scrapy,redis, mongodb,graphite實現的一個分佈式網絡爬蟲,底層存儲mongodb集羣,分佈式使用redis實現,爬蟲狀態顯示使用graphite實現github

https://github.com/aivarsk/scrapy-proxies # Random proxy middleware for Scrapy

https://github.com/scrapinghub/portia # 可視化界面的 scrapy

HttpUnit：是一個集成測試工具，主要關注Web應用的測試，提供的幫助類讓測試者能夠經過Java類和服務器進行交互，而且將服務器端的響應看成文本或者DOM對象進行處理。HttpUnit還提供了一個模擬Servlet容器，讓你能夠不須要發佈Servlet，就能夠對Servlet的內部代碼進行測試。

Selenium WebDriver：是一個能夠模擬瀏覽器(會對html文本進行渲染執行，即會執行文本中的js腳本)執行的測試框架，還能夠抓取裏面的DOM元素，它自己已包含 HttpUnit。

Jsoup：只是獲取網頁的靜態html文本，並不渲染，所以不會執行文本中的js。若是隻是抓取網頁文本中的元素，可使用jsoup。

selendroid: Selendroid 是一個 Android 原生應用的 UI 自動化測試框架。測試使用 Selenium 2 客戶端 API 編寫。Selendroid 能夠在模擬器和實際設備上使用，也能夠集成網格節點做爲縮放和並行測試。使用Selenium還能夠獲取節點，填充表單，選擇元素等交互操做。 http://selendroid.io; https://github.com/selendroid/selendroid .

若是不使用瀏覽器模擬方式抓取網頁，建議使用scrapy + BeautifulSoup4 做爲爬蟲和分析工具，Scrapy原生不支持js渲染，須要單獨下載[scrapy-splash](GitHub - scrapy-plugins/scrapy-splash: Scrapy+Splash for JavaScript integration)。 #### 如何用 PyCharm 調試 scrapy 項目，詳見個人另外一篇文章。

不過還有更高級的用法，用 scrapy + Selenium+berserkJS+BeautifulSoup4 一塊兒能夠拼湊成一個動態爬蟲,實現抓取、渲染、頁面自動交互的功能，但不建議使用，太難集成，用上面說的scrapy-splash足夠。

Selenium針對android系統也推出了android版的 AndroidDriver，能夠區看看。但彷佛已經中止更新了？不能肯定。

selenium本身不帶瀏覽器，它須要與第三方瀏覽器結合一塊兒使用。這裏使用phantomjs的工具代替真實的瀏覽器。可是有一個叫berserkJS的(是基於Phantomjs的改進版本)。

PhantomJS 是一個基於 WebKit 的服務器端 JavaScript API。它全面支持web而不需瀏覽器支持，其快速，原生支持各類Web標準： DOM 處理, CSS 選擇器, JSON, Canvas, 和 SVG。 PhantomJS 能夠用於頁面自動化，網絡監測，網頁截屏，以及無界面測試等。

把selenium和phantomjs結合在一塊兒，就能夠運行一個很是強大的爬蟲了，能夠處理cookie，js，header，以及任何須要你作的事。

安裝：

selenium有Python庫，能夠用pip等安裝；phantomjs是一個功能完善的「無頭「瀏覽器，並不是一個python庫，因此它不須要想python的其餘庫同樣安裝，也不能用pip安裝。

有人問，爲何不直接用瀏覽器而用一個沒界面的 PhantomJS 呢？答案是：效率高！

安裝selenium 和 phantomjs:

$ pip install selenium

而後從這裏( http://phantomjs.org/download.html ) 下載 phantomjs，而後繼續閱讀下面的文檔查看怎麼使用它：

如何在python中使用phantomjs:

https://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python

The easiest way to use PhantomJS in python is via Selenium. The simplest installation method is

Install NodeJS
Using Node's package manager install phantomjs: npm -g install phantomjs-prebuilt
install selenium (in your virtualenv, if you are using that)

After installation, you may use phantom as simple as:

from selenium import webdriver

driver = webdriver.PhantomJS() # or add to your PATH
driver.set_window_size(1024, 768) # optional
driver.get('https://google.com/')
driver.save_screenshot('screen.png') # save a screenshot to disk
sbtn = driver.find_element_by_css_selector('button.gbqfba')
sbtn.click()

If your system path environment variable isn't set correctly, you'll need to specify the exact path as an argument to webdriver.PhantomJS(). Replace this:

driver = webdriver.PhantomJS() # or add to your PATH

... with the following:

driver = webdriver.PhantomJS(executable_path='/usr/local/lib/node_modules/phantomjs/lib/phantom/bin/phantomjs')

References:

我本身的一段使用PhantomJs的example代碼：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys

browser = webdriver.PhantomJS(executable_path='/home/hzh/hzh/soft/phantomjs/bin/phantomjs')
browser.set_window_size(1120, 720)
browser.get("https://baidu.com/")


browser.find_element_by_xpath(".//*[@id='kw']").send_keys("hzh")
# browser.find_element_by_xpath(".//*[@id='kw']").send_keys(Keys.ENTER)
browser.find_element_by_xpath(".//*[@id='su']").click()


delay = 5 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.XPATH, ".//*[@id='help']/a[1]")))
    print("success get page")
except TimeoutException:
    print("Loading took too much time!")

print(browser.current_url)
browser.save_screenshot('/home/hzh/screen.png')

browser.quit()

修改 PhantomJS 的 user agent：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys

from selenium import webdriver

def init_phantomjs_driver(*args, **kwargs):
    headers = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
        "Accept-Encoding": "gzip",
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
        'Connection': 'keep-alive'
    }

    for key, value in enumerate(headers):
        webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.{}'.format(key)] = value

    webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.settings.userAgent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'

    driver =  webdriver.PhantomJS(executable_path='/home/hzh/hzh/soft/phantomjs/bin/phantomjs')
    driver.set_window_size(1120, 720)

    return driver


service_args = [
        '--proxy=127.0.0.1:9999',
        '--proxy-type=http',
        '--ignore-ssl-errors=true'
        ]
browser = init_phantomjs_driver(service_args=service_args)
browser.get("https://www.huobi.com/")


# browser.find_element_by_xpath(".//*[@id='kw']").send_keys("hzh")
# browser.find_element_by_xpath(".//*[@id='kw']").send_keys(Keys.ENTER)
# browser.find_element_by_xpath(".//*[@id='su']").click()


delay = 5 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.XPATH, ".//*[@id='doc_body']/div[7]/div/div[1]/div[1]")))
    print("success get page")
except TimeoutException:
    print("Loading took too much time!")

print(browser.current_url)
browser.save_screenshot('/home/hzh/screen.png')

browser.quit()

selenium 使用 firefox:

1,  Download geckodriver
2,  Copy geckodriver in /usr/local/bin

而後這樣使用：

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['binary'] = '/home/hzh/hzh/soft/firefox'

profile = webdriver.FirefoxProfile('/home/hzh/.mozilla/firefox/f3dcxoyp.default')
profile.add_extension("/home/hzh/.mozilla/firefox/f3dcxoyp.default/extensions/xpath_finder@xpath_finder.com.xpi")
profile.add_extension("/home/hzh/.mozilla/firefox/f3dcxoyp.default/extensions/FireXPath@pierre.tholence.com.xpi")
profile.add_extension("/home/hzh/.mozilla/firefox/f3dcxoyp.default/extensions/firefinder@robertnyman.com.xpi")
driver = webdriver.Firefox(firefox_profile=profile, capabilities=firefox_capabilities)
driver = webdriver.Firefox(capabilities=firefox_capabilities)

selenium 使用 firefox 的多tab功能(建議用第一種方法，第二種方法沒有驗證過)：

一、能夠這樣：

test_link = browser.find_element_by_xpath(".//*[@id='doc_head']/div/div[3]/ul/li[1]/a")
# Save the window opener (current window, do not mistaken with tab... not the same)
main_window = browser.current_window_handle
time.sleep(2)

# Open the link in a new tab by sending key strokes on the element
# Use: Keys.CONTROL + Keys.SHIFT + Keys.RETURN to open tab on top of the stack
test_link.send_keys(Keys.CONTROL + Keys.RETURN)  # 在某個鏈接上使用 ctrl+enter 鍵在新的tab中打開該連接
time.sleep(2)

# Get the list of window handles
tabs = browser.window_handles
print(len(tabs))
# Use the list of window handles to switch between windows
browser.switch_to_window(tabs[1])
test_link2 = browser.find_element_by_xpath(".//*[@id='doc_body']/div[4]/div[2]/div[1]/h2")
print(test_link2.text)
time.sleep(2)

# Switch back to original window
browser.switch_to_window(main_window)

二、也能夠這樣：

browser = webdriver.Firefox()
browser.get('https://www.google.com?q=python#q=python')
first_result = ui.WebDriverWait(browser, 15).until(lambda browser: browser.find_element_by_class_name('rc'))
first_link = first_result.find_element_by_tag_name('a')

# Save the window opener (current window, do not mistaken with tab... not the same)
main_window = browser.current_window_handle

# Open the link in a new tab by sending key strokes on the element
# Use: Keys.CONTROL + Keys.SHIFT + Keys.RETURN to open tab on top of the stack 
first_link.send_keys(Keys.CONTROL + Keys.RETURN)                # 再某個鏈接上使用 ctrl+enter 鍵在新的tab中打開該連接

# Switch tab to the new tab, which we will assume is the next one on the right
browser.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.TAB)       # 在第一個窗口上使用 ctrl+tab 鍵切換到下一個tab
    
# Put focus on current window which will, in fact, put focus on the current visible tab
browser.switch_to_window(main_window)           # 再切換回第一個tab

# do whatever you have to do on this page, we will just got to sleep for now
sleep(2)

# Close current tab
browser.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')      # 關閉這個tab

# Put focus on current window which will be the window opener
browser.switch_to_window(main_window)

若是CTRL+W不能關閉tab的話，能夠這樣：

curWindowHndl = browser.current_window_handle
elem.send_keys(Keys.CONTROL + Keys.ENTER) #open link in new tab keyboard shortcut
sleep(2) #wait until new tab finishes loading
browser.switch_to_window(browser.window_handles[1]) #assuming new tab is at index 1
browser.close() #closes new tab
browser.switch_to_window(curWindowHndl)

scrapy-splash的使用

http://scrapy-cookbook.readthedocs.io/zh_CN/latest/scrapy-12.html

前面咱們介紹的都是去抓取靜態的網站頁面，也就是說咱們打開某個連接，它的內容所有呈現出來。可是現在的互聯網大部分的web頁面都是動態的，常常逛的網站例如京東、淘寶等，商品列表都是js，並有Ajax渲染，下載某個連接獲得的頁面裏面含有異步加載的內容，這樣再使用以前的方式咱們根本獲取不到異步加載的這些網頁內容。

使用Javascript渲染和處理網頁是種很是常見的作法，如何處理一個大量使用Javascript的頁面是Scrapy爬蟲開發中一個常見的問題，這篇文章將說明如何在Scrapy爬蟲中使用scrapy-splash來處理頁面中得Javascript。

scrapy-splash簡介

scrapy-splash利用Splash將javascript和Scrapy集成起來，使得Scrapy能夠抓取動態網頁。

Splash是一個javascript渲染服務，是實現了HTTP API的輕量級瀏覽器，底層基於Twisted和QT框架，Python語言編寫。因此首先你得安裝Splash實例

安裝docker

官網建議使用docker容器安裝方式Splash。那麼首先你得先安裝docker

參考官方安裝文檔，這裏我選擇Ubuntu 12.04 LTS版本安裝

升級內核版本，docker須要3.13內核

 
   $ sudo apt-get update
$ sudo apt-get install linux-image-generic-lts-trusty
$ sudo reboot

安裝CA認證

$ sudo apt-get install apt-transport-https ca-certificates

增長新的GPGkey

 
   $ sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D

打開/etc/apt/sources.list.d/docker.list，若是沒有就建立一個，而後刪除任何已存在的內容，再增長下面一句

 
   deb https://apt.dockerproject.org/repo ubuntu-precise main  
  

更新APT

 
   $ sudo apt-get update
$ sudo apt-get purge lxc-docker
$ apt-cache policy docker-engine

安裝

$ sudo apt-get install docker-engine

啓動docker服務

$ sudo service docker start

驗證是否啓動成功

$ sudo docker run hello-world

上面這條命令會下載一個測試鏡像並在容器中運行它，它會打印一個消息，而後退出。

安裝Splash

拉取鏡像下來

$ sudo docker pull scrapinghub/splash

啓動容器

 
   $ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash  
  

如今能夠經過0.0.0.0:8050(http),8051(https),5023 (telnet)來訪問Splash了。

安裝scrapy-splash

使用pip安裝

$ pip install scrapy-splash

配置scrapy-splash

在你的scrapy工程的配置文件settings.py中添加

 
   SPLASH_URL = 'http://192.168.203.92:8050'  
  

添加Splash中間件，仍是在settings.py中經過DOWNLOADER_MIDDLEWARES指定，而且修改HttpCompressionMiddleware的優先級

 
   DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }  
  

默認狀況下，HttpProxyMiddleware的優先級是750，要把它放在Splash中間件後面

設置Splash本身的去重過濾器

 
   DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'  
  

若是你使用Splash的Http緩存，那麼還要指定一個自定義的緩存後臺存儲介質，scrapy-splash提供了一個scrapy.contrib.httpcache.FilesystemCacheStorage的子類

 
   HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'  
  

若是你要使用其餘的緩存存儲，那麼須要繼承這個類而且將全部的scrapy.util.request.request_fingerprint調用替換成scrapy_splash.splash_request_fingerprint

使用scrapy-splash

SplashRequest

最簡單的渲染請求的方式是使用scrapy_splash.SplashRequest，一般你應該選擇使用這個

 
    yield SplashRequest(url, self.parse_result, args={ # optional; parameters passed to Splash HTTP API 'wait': 0.5, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, endpoint='render.json', # optional; default is render.html splash_url='<url>', # optional; overrides SPLASH_URL slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional )  
   

另外，你還能夠在普通的scrapy請求中傳遞splash請求meta關鍵字達到一樣的效果

 
    yield scrapy.Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # optional; overrides SPLASH_URL 'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN, 'splash_headers': {}, # optional; a dict with headers sent to Splash 'dont_process_response': True, # optional, default is False 'dont_send_headers': True, # optional, default is False 'magic_response': False, # optional, default is True } })  
   

Splash API說明，使用SplashRequest是一個很是便利的工具來填充request.meta['splash']裏的數據

meta[‘splash’][‘args’] 包含了發往Splash的參數。
meta[‘splash’][‘endpoint’] 指定了Splash所使用的endpoint，默認是render.html
meta[‘splash’][‘splash_url’] 覆蓋了settings.py文件中配置的Splash URL
meta[‘splash’][‘splash_headers’] 運行你增長或修改發往Splash服務器的HTTP頭部信息，注意這個不是修改發往遠程web站點的HTTP頭部
meta[‘splash’][‘dont_send_headers’] 若是你不想傳遞headers給Splash，將它設置成True
meta[‘splash’][‘slot_policy’] 讓你自定義Splash請求的同步設置
meta[‘splash’][‘dont_process_response’] 當你設置成True後，SplashMiddleware不會修改默認的scrapy.Response請求。默認是會返回SplashResponse子類響應好比SplashTextResponse
meta[‘splash’][‘magic_response’] 默認爲True，Splash會自動設置Response的一些屬性，好比response.headers,response.body等

若是你想經過Splash來提交Form請求，可使用scrapy_splash.SplashFormRequest，它跟SplashRequest使用是同樣的。

Responses

對於不一樣的Splash請求，scrapy-splash返回不一樣的Response子類

SplashResponse 二進制響應，好比對/render.png的響應
SplashTextResponse 文本響應，好比對/render.html的響應
SplashJsonResponse JSON響應，好比對/render.json或使用Lua腳本的/execute的響應

若是你只想使用標準的Response對象，就設置meta['splash']['dont_process_response']=True

全部這些Response會把response.url設置成原始請求URL(也就是你要渲染的頁面URL)，而不是Splash endpoint的URL地址。實際地址經過response.real_url獲得

Session的處理

Splash自己是無狀態的，那麼爲了支持scrapy-splash的session必須編寫Lua腳本，使用/execute

 
    function main(splash) splash:init_cookies(splash.args.cookies) -- ... your script return { cookies = splash:get_cookies(), -- ... other results, e.g. html } end  
   

而標準的scrapy session參數可使用SplashRequest將cookie添加到當前Splash cookiejar中

使用實例

接下來我經過一個實際的例子來演示怎樣使用，我選擇爬取京東網首頁的異步加載內容。

京東網打開首頁的時候只會將導航菜單加載出來，其餘具體首頁內容都是異步加載的，下面有個」猜你喜歡」這個內容也是異步加載的，我如今就經過爬取這個」猜你喜歡」這四個字來講明下普通的Scrapy爬取和經過使用了Splash加載異步內容的區別。

首先咱們寫個簡單的測試Spider，不使用splash：

 
   class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["jd.com"] start_urls = [ "http://www.jd.com/" ] def parse(self, response): logging.info(u'---------我這個是簡單的直接獲取京東網首頁測試---------') guessyou = response.xpath('//div[@id="guessyou"]/div[1]/h2/text()').extract_first() logging.info(u"find：%s" % guessyou) logging.info(u'---------------success----------------')  
  

而後運行結果：

 
   2016-04-18 14:42:44 test_spider.py[line:20] INFO ---------我這個是簡單的直接獲取京東網首頁測試---------
2016-04-18 14:42:44 test_spider.py[line:22] INFO find：None
2016-04-18 14:42:44 test_spider.py[line:23] INFO ---------------success----------------

我找不到那個」猜你喜歡」這四個字

接下來我使用splash來爬取

 
   import scrapy from scrapy_splash import SplashRequest class JsSpider(scrapy.Spider): name = "jd" allowed_domains = ["jd.com"] start_urls = [ "http://www.jd.com/" ] def start_requests(self): splash_args = { 'wait': 0.5, } for url in self.start_urls: yield SplashRequest(url, self.parse_result, endpoint='render.html', args=splash_args) def parse_result(self, response): logging.info(u'----------使用splash爬取京東網首頁異步加載內容-----------') guessyou = response.xpath('//div[@id="guessyou"]/div[1]/h2/text()').extract_first() logging.info(u"find：%s" % guessyou) logging.info(u'---------------success----------------')  
  

運行結果：

 
   2016-04-18 14:42:51 js_spider.py[line:36] INFO ----------使用splash爬取京東網首頁異步加載內容-----------
2016-04-18 14:42:51 js_spider.py[line:38] INFO find：猜你喜歡
2016-04-18 14:42:51 js_spider.py[line:39] INFO ---------------success----------------

能夠看出結果裏面已經找到了這個」猜你喜歡」，說明異步加載內容爬取成功！