pyspider 爬蟲教程（三）：使用 PhantomJS 渲染帶 JS 的頁面

時間 2019-11-10

標籤 pyspider 爬蟲教程使用 phantomjs 渲染頁面欄目網絡爬蟲简体版

原文原文鏈接

英文原文：http://docs.pyspider.org/en/latest/tutorial/Render-with-PhantomJS/javascript

在上兩篇教程中，咱們學習了怎麼從 HTML 中提取信息，也學習了怎麼處理一些請求複雜的頁面。可是有一些頁面，它實在太複雜了，不管是分析 API 請求的地址，仍是渲染時進行了加密，讓直接抓取請求很是麻煩。這時候就是 PhantomJS 大顯身手的時候了。html

在使用 PhantomJS 以前，你須要安裝它（安裝文檔）。當你安裝了以後，在運行 all 模式的 pyspider 時就會自動啓用了。固然，你也能夠在 demo.pyspider.org 上嘗試。java

使用 PhantomJS

當 pyspider 連上 PhantomJS 代理後，你就能經過在 self.crawl 中添加 fetch_type='js' 的參數，開啓使用 PhantomJS 抓取。例如，在教程二中，咱們嘗試抓取的 http://movie.douban.com/explore 就能夠經過 PhantomJS 直接抓取：python

pythonclass Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://movie.douban.com/explore',
                   fetch_type='js', callback=self.phantomjs_parser)

    def phantomjs_parser(self, response):
        return [{
            "title": "".join(
                s for s in x('p').contents() if isinstance(s, basestring)
            ).strip(),
            "rate": x('p strong').text(),
            "url": x.attr.href,
        } for x in response.doc('a.item').items()]

我在這裏使用了一些 PyQuery 的 API，你能夠在 PyQuery complete API 得到完整的 API 手冊。

在頁面上執行自定義腳本

你會發現，在上面咱們使用 PhantomJS 抓取的豆瓣熱門電影只有 20 條。當你點擊『加載更多』時，能得到更多的熱門電影。爲了得到更多的電影，咱們可使用 self.crawl 的 js_script 參數，在頁面上執行一段腳本，點擊加載更多：api

pythondef on_start(self):
        self.crawl('http://movie.douban.com/explore#more',
                   fetch_type='js', js_script="""
                   function() {
                     setTimeout("$('.more').click()", 1000);
                   }""", callback=self.phantomjs_parser)

這個腳本默認在頁面加載結束後執行，你能夠經過 js_run_at 參數修改這個行爲

因爲是 AJAX 異步加載的，在頁面加載完成時，第一頁的電影可能尚未加載完，因此咱們用 setTimeout 延遲 1 秒執行。

你能夠間隔必定時間，屢次點擊，這樣能夠加載更多頁。

因爲相同 URL （實際是相同 taskid）的任務會被去重，因此這裏爲 URL 加了一個 #more

上面兩個例子，均可以在 http://demo.pyspider.org/debug/tutorial_douban_explore 中找到。異步

中文原文： http://blog.binux.me/2015/01/pyspider-tutorial-level-3-render-with-phantomjs/ide