本系列文章主要記錄和講解pyspider的示例代碼,但願能拋磚引玉。pyspider示例代碼官方網站是http://demo.pyspider.org/。上面的示例代碼太多,無從下手。所以本人找出一下比較經典的示例進行簡單講解,但願對新手有一些幫助。python
若是頁面中部分數據或文字由js生成,pyspider不能直接提取頁面的數據。pyspider獲取頁面的代碼,可是其中的js代碼phantomjs,解決js代碼執行問題。vim
方法一:在self.crawl函數中添加fetch_type="js"調用phantomjs執行js代碼。ide
方法二:爲函數添加參數@config(fetch_type="js")。函數
一、www.sciencedirect.com網站示例fetch
#!/usr/bin/env python # -*- encoding: utf-8 -*- # vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8: # Created on 2014-10-31 13:05:52 import re from libs.base_handler import * class Handler(BaseHandler): ''' this is a sample handler ''' crawl_config = { "headers": { "User-Agent": "BaiDu_Spider", }, "timeout":300, "connect_timeout":100 } def on_start(self): self.crawl('http://www.sciencedirect.com/science/article/pii/S1568494612005741',timeout=300,connect_timeout=100, callback=self.detail_page) self.crawl('http://www.sciencedirect.com/science/article/pii/S0167739X12000581',timeout=300,connect_timeout=100, age=0, callback=self.detail_page) self.crawl('http://www.sciencedirect.com/science/journal/09659978',timeout=300,connect_timeout=100, age=0, callback=self.index_page) @config(fetch_type="js") def index_page(self, response): for each in response.doc('a').items(): url=each.attr.href #print(url) if url!=None: if re.match('http://www.sciencedirect.com/science/article/pii/\w+$', url): self.crawl(url, callback=self.detail_page,timeout=300,connect_timeout=100) @config(fetch_type="js") def detail_page(self, response): self.index_page(response) self.crawl(response.doc('#relArtList > li > .cLink').attr.href, callback=self.index_page,timeout=300,connect_timeout=100) return { "url": response.url, "title": response.doc('.svTitle').text(), "authors": [x.text() for x in response.doc('.authorName').items()], "abstract": response.doc('.svAbstract > p').text(), "keywords": [x.text() for x in response.doc('.keyword span').items()], }