pyspider示例代碼一:利用phantomjs解決js問題

本系列文章主要記錄和講解pyspider的示例代碼,但願能拋磚引玉。pyspider示例代碼官方網站是http://demo.pyspider.org/。上面的示例代碼太多,無從下手。所以本人找出一下比較經典的示例進行簡單講解,但願對新手有一些幫助。python

示例說明:

若是頁面中部分數據或文字由js生成,pyspider不能直接提取頁面的數據。pyspider獲取頁面的代碼,可是其中的js代碼phantomjs,解決js代碼執行問題。vim

使用方法:

方法一:在self.crawl函數中添加fetch_type="js"調用phantomjs執行js代碼。ide

方法二:爲函數添加參數@config(fetch_type="js")。函數

示例代碼:

一、www.sciencedirect.com網站示例fetch

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Created on 2014-10-31 13:05:52

import re
from libs.base_handler import *

class Handler(BaseHandler):
    '''
    this is a sample handler
    '''
    crawl_config = {
        "headers": {
            "User-Agent": "BaiDu_Spider",
        },
        "timeout":300,
        "connect_timeout":100
    }
    
    def on_start(self):
        self.crawl('http://www.sciencedirect.com/science/article/pii/S1568494612005741',timeout=300,connect_timeout=100,
                   callback=self.detail_page)
        self.crawl('http://www.sciencedirect.com/science/article/pii/S0167739X12000581',timeout=300,connect_timeout=100,
                   age=0, callback=self.detail_page)
        self.crawl('http://www.sciencedirect.com/science/journal/09659978',timeout=300,connect_timeout=100,
                   age=0, callback=self.index_page)
        
    @config(fetch_type="js")
    def index_page(self, response):
        for each in response.doc('a').items():
            url=each.attr.href
            #print(url)
            if url!=None:
                if re.match('http://www.sciencedirect.com/science/article/pii/\w+$', url):
                    self.crawl(url, callback=self.detail_page,timeout=300,connect_timeout=100)
        
    @config(fetch_type="js")
    def detail_page(self, response):
        self.index_page(response)
        self.crawl(response.doc('#relArtList > li > .cLink').attr.href, callback=self.index_page,timeout=300,connect_timeout=100)
        
        return {
                "url": response.url,
                "title": response.doc('.svTitle').text(),
                "authors": [x.text() for x in response.doc('.authorName').items()],
                "abstract": response.doc('.svAbstract > p').text(),
                "keywords": [x.text() for x in response.doc('.keyword span').items()],
                }
相關文章
相關標籤/搜索