pyspider 初次使用

一 安裝

pip  install pyspider

 

請安裝PhantomJS:http://phantomjs.org/build.htmlcss

 

二 檢驗是否啓動成功

cmd中輸入: pyspider

 

安裝問題解決html

python版本:3.6

1、啓動報錯     raise ValueError("Invalid configuration:\n  - " + "\n  - ".join(errors)) ValueError: Invalid configuration:   - Deprecated option 'domaincontroller': use 'http_authenticator.domain_control
ler' instead.
 解決方法: 將wsgidav替換爲2.4.1# python -m pip install wsgidav==2.4.1
   2、PhantomJS配置 windows:在官網http://phantomjs.org/download.html下載對應版本的程序,而後放到python安裝目錄的python.exe同級目錄下。

 

pyspider Web預覽界面過小的解決方法

完美css代碼:python

複製代碼
body{margin:0;padding:0;height:100%;overflow:hidden}.warning{color:#f0ad4e}.error{color:#d9534f}#control{z-index:9999;min-width:760px;width:100%;height:35px;position:fixed;left:0;right:0;box-shadow:0 1px 2px #999}#control div{line-height:35px;margin-left:10px;margin-right:10px}#control .webdav-btn{position:relative;float:right;padding:1px 7px 0;line-height:21px;border-radius:5px;border:1px solid #428bca;background:#fff;color:#428bca;cursor:pointer;margin:6px 0 0 10px}#control .webdav-btn:hover{background:#6aa3d5;color:#fff}#control .webdav-btn.active{background:#428bca;color:#fff}#editarea{width:100%;position:fixed;top:37px}#editarea,.debug-panel{left:0;right:0;bottom:0}.debug-panel{position:absolute;top:0}.resize{ padding: 0px; color: rgb(128, 0, 128); line-height: 1.5 !important;">555;cursor:ew-resize}.resize:hover+.debug-panel{border-left:1px dashed #555!important}.overlay{position:absolute;top:0;bottom:0;left:0;right:0;z-index:9999;background:rgba(0,0,0,.4)}.focus .CodeMirror-activeline-background{background:#e8f2ff!important}.CodeMirror-activeline-background{background:transparent!important}#task-panel{height:100%;overflow-x:auto}#run-task-btn{z-index:99;position:absolute;top:0;right:0;background:#5cb85c;border-radius:0 0 0 5px;color:#fff;margin:0;padding:3px 7px 5px 10px;cursor:pointer;font-weight:700;line-height:15px}#run-task-btn:hover{background:#449d44}#undo-redo-btn-group{z-index:99;position:absolute;top:0;right:0;background:#91cf91;border-radius:0 0 0 5px;color:#fff;margin:0;padding:3px 7px 5px 10px;cursor:pointer;font-weight:700;line-height:15px;top:auto;bottom:0;border-radius:5px 0 0 0;padding:5px 0 3px;overflow:hidden}#undo-redo-btn-group:hover{background:#6ec06e;background:#91cf91}#undo-redo-btn-group a{color:#fff;text-decoration:none;padding:5px 7px 3px 10px}#undo-redo-btn-group a:hover{background:#6ec06e}#save-task-btn{z-index:99;position:absolute;top:0;right:0;background:#428bca;border-radius:0 0 0 5px;color:#fff;margin:0;padding:3px 7px 5px 10px;cursor:pointer;font-weight:700;line-height:15px}#save-task-btn:hover{background:#3071a9}#task-editor{position:relative}#task-editor .CodeMirror{height:auto;padding-bottom:3px;background:#c7e6c7}#task-editor .CodeMirror-scroll{overflow-x:auto;overflow-y:hidden}#task-editor.focus .CodeMirror-activeline-background{background:#eaf6ea!important}#tab-control{list-style-type:none;position:absolute;bottom:0;right:0;margin:8px 20px;padding:0}#tab-control li{position:relative;float:right;padding:1px 7px 0;line-height:21px;margin-left:10px;border-radius:5px;border:1px solid #428bca;background:#fff;color:#428bca;cursor:pointer}#tab-control li:hover{background:#6aa3d5;color:#fff}#tab-control li.active{background:#428bca;color:#fff}#tab-control li span{position:absolute;top:-5px;right:-10px;background:#d9534f;color:#fff;font-size:80%;font-weight:700;padding:2px 5px 0;border-radius:10px}#debug-tabs{margin-bottom:45px}#tab-web.fixed{padding-top:24px}#tab-web iframe{border-width:0;width:100%;height:900px !important}#tab-html{margin:0;padding:7px 5px}#tab-html pre{margin:0;padding:0}#tab-follows .newtask{position:relative;height:30px;line-height:30px;background:#fceedb;border-bottom:1px solid #f0ad4e;border-top:1px solid #f0ad4e;margin-top:-1px;padding-left:5px;padding-right:70px;overflow:hidden;white-space:nowrap;text-overflow:ellipsis;cursor:pointer}#tab-follows .newtask:hover,#tab-follows .newtask:hover .task-more{background:#f8d9ac}#tab-follows .newtask .task-callback{color:#ec971f}#tab-follows .newtask .task-url{font-size:95%;text-decoration:underline;font-weight:lighter;color:#428bca}#tab-follows .newtask .task-more{position:absolute;right:33px;top:0;float:right;color:#f0ad4e;padding:0 10px;background:#fceedb;border-radius:10px}#tab-follows .newtask .task-run{position:absolute;right:0;top:0;font-size:80%;padding:0 10px 0 30px;float:right;border-bottom:1px solid #a3d7a3;border-top:1px solid #a3d7a3;background:#80c780;color:#fff;text-shadow:0 0 10px #fff;font-weight:700}#tab-follows .newtask .task-run:hover{background:#5cb85c}#tab-follows .task-show pre{margin:5px 5px 10px}#python-editor{position:absolute;top:0;width:100%;bottom:0}#python-editor .CodeMirror{height:100%;padding-bottom:20px}#python-log{width:100%;min-height:10px;max-height:40%;background:rgba(0,0,0,.6);overflow:auto}#python-log #python-log-show{z-index:89;width:auto;padding-top:5px;background:#d9534f;box-shadow:0 2px 20px #d9534f;cursor:pointer}#python-log pre{margin:0;padding:10px;color:#fff}#css-selector-helper{padding:0;width:100%;height:24px;text-align:right;white-space:nowrap}#css-selector-helper.fixed{position:absolute;top:0}#css-selector-helper button{line-height:16px;vertical-align:2px}span.element{position:relative;height:24px;display:inline-block;padding:0 .2em;cursor:pointer;color:#afafaf;z-index:99999}span.element.invalid{display:none}span.element.selected{color:#000}span.element:hover{ul{display:block}span.element>ul{display:none;margin:0;padding:0;position:absolute;top:24px;left:0;border:1px solid #000;border-top-width:0;color:#afafaf}span.element>ul>li{display:block;text-align:left;white-space:nowrap;padding:0 4px}span.element>ul>li.selected{color:#000}span.element>ul>li:hover{padding:0;border:0;margin:0;padding-right:.2em;font-size:1em;text-align:right;width:100%;margin-left:-100px;background:#eee}
複製代碼

 

替換/pyspider/webui/static/debug.min.css文件中全部內容web

 

三 入門簡介

你的第一個腳本數據庫

from pyspider.libs.base_handler import *


class Handler(BaseHandler): crawl_config = { #配置信息
 } #全局變量,獲取數據庫句柄...
 client=pymongo.MongoClient(host='106.12.108.236',port=27017) db=client['trip'] @every(minutes=24 * 60) def on_start(self): #程序請求的入口,start_url
        self.crawl('http://scrapy.org/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): #解析start_url獲取詳情頁的url
        for each in response.doc('a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail_page) @config(priority=2) def detail_page(self, response): #獲取詳情頁的信息
        return { "url": response.url, "title": response.doc('title').text(), }

 

 

  • def on_start(self)腳本的入口點單擊run儀表板上按鈕時將調用它
  • self.crawl(url, callback=self.index_page)*是這裏最重要的API。它將添加一個要爬網的新任務。大多數選項將經過self.crawl參數進行spicified 
  • def index_page(self, response)獲得一個Response*對象。
  • response.doc*是一個pyquery對象,它具備相似jQuery的API來選擇要提取的元素。
  • def detail_page(self, response)返回一個dict對象做爲結果。結果將resultdb默認捕獲您能夠覆蓋on_result(self, result)方法來自行管理結果。

 

四 頁面簡介

 

點擊save按鈕windows

 

接下來執行runapi

 

 

顯示當前待解析頁面的頁面dom

 

 

 

項目啓動和刪除

 

 

五 頁面爬取代碼

#!/usr/bin/env python
#
-*- encoding: utf-8 -*- # Created on 2019-04-10 20:27:19 # Project: tripadvisor from pyspider.libs.base_handler import * import pymongo class Handler(BaseHandler): crawl_config = { }
  #鏈接數據庫 client
=pymongo.MongoClient(host='106.12.108.236',port=27017) db=client['trip'] @every(minutes=24 * 60) def on_start(self):
self.crawl(
'https://www.tripadvisor.cn/Attractions-g186338-Activities-c47-t163-London_England.html', callback=self.index_page,validate_cert=False) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('.listing_title > a').items(): self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False) next=response.doc('.pagination .nav.next').attr.href self.crawl(next,callback=self.index_page,validate_cert=False) @config(priority=2) def detail_page(self, response): url=response.url name=response.doc('.h1').text() rate=response.doc('a > .reviewCount').text() address=response.doc('.contactInfo > .address').text() phone=response.doc('.contact > .is-hidden-mobile > div').text() decs=response.doc('#component_5 > div > div:nth-child(2)').text() return { 'url':url, 'name':name, 'rate':rate, 'address':address, 'phone':phone, 'decs':decs, } def on_result(self,result): '若是有詳情頁有返回值,就調用此函數' if result: self.save_to_mongo(result) def save_to_mongo(self,result):
     # 保存到數據庫
if self.db['london'].insert(result): print('save to mongo',result)

 

六 查看 pyspider all 執行後產生的本地文件

 

七 設置啓動配置scrapy

相關文章
相關標籤/搜索