PyCharm+Scrapy爬取安居客樓盤信息

1、說明

1.1 開發環境說明

開發環境--PyCharmjavascript

爬蟲框架--Scrapyhtml

開發語言--Python 3.6java

安裝第三方庫--Scrapy、pymysql、matplotlibnode

數據庫--MySQL-5.5(監聽地址--127.0.0.1:3306,用戶名--root,密碼--root,數據庫--anjuke)mysql

 

1.2 程序簡要說明

本程序以安居客-深圳爲例,其餘城市使用的是同樣的結構爬取其餘程序基本修改start_urls和rules中的url便可移植git

本程序實現安居客新樓盤和二手房的信息爬取,還存在一些小問題,但算基本可用github

程序的整體思路是:使用CrawlSpider爬蟲----從start_urls開始爬行----爬行到的url若是符合某個rule就會自動調用回調函數----回調函數使用xpath解析和獲取樓盤信息item----pipe將傳過來的item寫入數據庫--report腳本從數據庫中讀出數據生成圖表sql

新樓盤和二手房爬取的區別是,新樓盤沒有反爬蟲機制。二手房一是限制了訪問頻率,若是超過某個頻率就須要輸入驗證碼才能訪問(我這裏經過限制5秒發一個請求進行處理),二是二手房信息頁面通過javascript處理,禁用javascript時信息處理div[4]啓用javascript時信息被移動到div[3],scrapy默認是不運行javascript的因此須要使用禁用javascript時的路徑才能獲取信息。數據庫

項目源碼已上傳github:https://github.com/PrettyUp/Anjukeapp

 

2、建立數據庫表結構

sql建立代碼:

# Host: localhost  (Version: 5.5.53) # Date: 2018-06-06 18:27:08 # Generator: MySQL-Front 5.3  (Build 4.234) /*!40101 SET NAMES utf8 */; # # Structure for table "sz_loupan_info" # CREATE TABLE `sz_loupan_info` ( `loupan_name` varchar(255) DEFAULT NULL, `loupan_status` varchar(255) DEFAULT NULL, `loupan_price` int(11) DEFAULT NULL, `loupan_discount` varchar(255) DEFAULT NULL, `loupan_layout` varchar(255) DEFAULT NULL, `loupan_location` varchar(255) DEFAULT NULL, `loupan_opening` varchar(255) DEFAULT NULL, `loupan_transfer` varchar(255) DEFAULT NULL, `loupan_type` varchar(255) DEFAULT NULL, `loupan_age` varchar(255) DEFAULT NULL, `loupan_url` varchar(255) DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC; # # Structure for table "sz_sh_house_info" # CREATE TABLE `sz_sh_house_info` ( `house_title` varchar(255) DEFAULT NULL, `house_cost` varchar(255) DEFAULT NULL, `house_code` varchar(255) DEFAULT NULL, `house_public_time` varchar(255) DEFAULT NULL, `house_community` varchar(255) DEFAULT NULL, `house_location` varchar(255) DEFAULT NULL, `house_build_years` varchar(255) DEFAULT NULL, `house_kind` varchar(255) DEFAULT NULL, `house_layout` varchar(255) DEFAULT NULL, `house_size` varchar(255) DEFAULT NULL, `house_face_to` varchar(255) DEFAULT NULL, `house_point` varchar(255) DEFAULT NULL, `house_price` varchar(255) DEFAULT NULL, `house_first_pay` varchar(255) DEFAULT NULL, `house_month_pay` varchar(255) DEFAULT NULL, `house_decorate_type` varchar(255) DEFAULT NULL, `house_agent` varchar(255) DEFAULT NULL, `house_agency` varchar(255) DEFAULT NULL, `house_url` varchar(255) DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
View Code

 

3、程序實現

3.1 使用scrapy建立項目

打開cmd,切換到PyCharm工程目錄,執行:

scrapy startproject Anjuke

 

3.2 使用pycharm打開項目並安裝好所需的第三方庫

直接在pycharm中安裝scrapy、pymysql和matplotlib(其餘依賴庫會自動安裝);另外注意,安裝scrapy安成後複製一份cmdline.py到項目主目錄下

 

3.3 建立程序所需文件

anjuke_sz_spider.py----樓盤信息爬取腳本

anjuke_sz_sh_spider.py----二手房信息爬取腳本

anjuke_sz_report.py----樓盤信息報告圖表生成腳本

anjuke_sz_sh_report.py----二手房信息報告圖表生成腳本

項目目錄結構以下:

 

3.4 配置好scrapy調試運行環境

爲anjuke_sz_sider.py和anjuke_sz_sh_spider.py配置好運行參數

 

3.5 各文件實現

settings.py

BOT_NAME = 'Anjuke' SPIDER_MODULES = ['Anjuke.spiders'] NEWSPIDER_MODULE = 'Anjuke.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0' ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 5 COOKIES_ENABLED = False ITEM_PIPELINES = { 'Anjuke.pipelines.AnjukeSZPipeline': 300, 'Anjuke.pipelines.AnjukeSZSHPipeline': 300, }
View Code

 items.py

import scrapy class AnjukeSZItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    loupan_id = scrapy.Field() loupan_name = scrapy.Field() loupan_status = scrapy.Field() loupan_price = scrapy.Field() loupan_discount = scrapy.Field() loupan_layout = scrapy.Field() loupan_location = scrapy.Field() loupan_opening = scrapy.Field() loupan_transfer = scrapy.Field() loupan_type = scrapy.Field() loupan_age = scrapy.Field() loupan_url = scrapy.Field() class AnjukeSZSHItem(scrapy.Item): house_title = scrapy.Field() house_cost = scrapy.Field() house_code = scrapy.Field() house_public_time = scrapy.Field() house_community = scrapy.Field() house_location = scrapy.Field() house_build_years = scrapy.Field() house_kind = scrapy.Field() house_layout = scrapy.Field() house_size = scrapy.Field() house_face_to = scrapy.Field() house_point = scrapy.Field() house_price = scrapy.Field() house_first_pay = scrapy.Field() house_month_pay = scrapy.Field() house_decorate_type = scrapy.Field() house_agent = scrapy.Field() house_agency = scrapy.Field() house_url = scrapy.Field()
View Code

 pipelines.py

import pymysql class AnjukeSZPipeline(object): def __init__(self): self.db = pymysql.connect("localhost", "root", "root", "anjuke", charset="utf8") self.cursor = self.db.cursor() def process_item(self, item, spider): sql = "insert into sz_loupan_info(loupan_name,loupan_status,loupan_price,loupan_discount,loupan_layout,loupan_location,loupan_opening,loupan_transfer,loupan_type,loupan_age,loupan_url)\ values('%s','%s','%d','%s','%s','%s','%s','%s','%s','%s','%s')"\ %(item['loupan_name'],item['loupan_status'],int(item['loupan_price']),item['loupan_discount'],item['loupan_layout'],item['loupan_location'], \ item['loupan_opening'],item['loupan_transfer'],item['loupan_type'],item['loupan_age'],item['loupan_url']) self.cursor.execute(sql) self.db.commit() return item def __del__(self): self.db.close() class AnjukeSZSHPipeline(object): def __init__(self): self.db = pymysql.connect("localhost", "root", "root", "anjuke", charset="utf8") self.cursor = self.db.cursor() def process_item(self, item, spider): sql = "insert into sz_sh_house_info(house_title,house_cost,house_code,house_public_time,house_community,house_location,house_build_years,house_kind,house_layout,house_size,\ house_face_to,house_point,house_price,house_first_pay,house_month_pay,house_decorate_type,house_agent,house_agency,house_url)\ values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"\ %(item['house_title'],item['house_cost'],item['house_code'],item['house_public_time'],item['house_community'],item['house_location'],\ item['house_build_years'],item['house_kind'], item['house_layout'],item['house_size'],item['house_face_to'],item['house_point'],item['house_price'],\ item['house_first_pay'],item['house_month_pay'],item['house_decorate_type'],item['house_agent'],item['house_agency'],item['house_url']) self.cursor.execute(sql) self.db.commit() return item def __del__(self): self.db.close()
View Code

anjuke_sz_spider.py

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule from Anjuke.items import AnjukeSZItem class AnjukeSpider(scrapy.spiders.CrawlSpider): name = 'anjuke_sz' allow_domains = ["anjuke.com"] start_urls = [ 'https://sz.fang.anjuke.com/loupan/all/p1/', ] rules = [ Rule(LinkExtractor(allow=("https://sz\.fang\.anjuke\.com/loupan/all/p\d{1,}"))), Rule(LinkExtractor(allow=("https://sz\.fang\.anjuke\.com/loupan/\d{1,}")), follow=False, callback='parse_item') ] def is_number(self,s): try: float(s) return True except ValueError: pass
        try: import unicodedata unicodedata.numeric(s) return True except (TypeError, ValueError): pass
        return False def get_sellout_item(self,response): loupan_nodes = {} loupan_nodes['loupan_name_nodes'] = response.xpath('//*[@id="j-triggerlayer"]/text()') loupan_nodes['loupan_status_nodes'] = response.xpath('/html/body/div[1]/div[3]/div/div[2]/i/text()') loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/span/text()') if loupan_nodes['loupan_price_nodes']: if self.is_number(loupan_nodes['loupan_price_nodes'].extract()[0].strip()): loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[4]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/p[1]/span/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/p[2]/span/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[1]/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[1]/p/em/text()') loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/p[1]/span/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/p[2]/span/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/div[1]/ul[1]/li/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[1]/p/em/text()') if loupan_nodes['loupan_price_nodes']: loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/p[1]/span/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/p[2]/span/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/div[1]/ul[1]/li/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[1]/p/em/text()') if loupan_nodes['loupan_price_nodes']: loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[2]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[3]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[1]/span/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[2]/span/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/div[1]/ul[1]/li/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[2]/span/text()') loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[4]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[1]/span/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[2]/span/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[3]/div[1]/ul[1]/li/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') if not loupan_nodes['loupan_location_nodes']: loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/span/text()') loupan_item = self.struct_loupan_item(loupan_nodes) return loupan_item def get_sellwait_item(self,response): loupan_nodes = {} loupan_nodes['loupan_name_nodes'] = response.xpath('//*[@id="j-triggerlayer"]/text()') loupan_nodes['loupan_status_nodes'] = response.xpath('/html/body/div[1]/div[3]/div/div[2]/i/text()') loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/span/text()') if loupan_nodes['loupan_price_nodes']: loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[4]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[1]/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[2]/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[1]/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[2]/span/text()') if loupan_nodes['loupan_price_nodes']: loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[4]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[5]/p[1]/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[5]/p[2]/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[1]/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[1]/p/em/text()') loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[1]/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[2]/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') if not loupan_nodes['loupan_location_nodes']: loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/span/text()') loupan_item = self.struct_loupan_item(loupan_nodes) return loupan_item def get_common_item(self,response): loupan_nodes = {} loupan_nodes['loupan_name_nodes'] = response.xpath('//*[@id="j-triggerlayer"]/text()') loupan_nodes['loupan_status_nodes'] = response.xpath('/html/body/div[1]/div[3]/div/div[2]/i/text()') loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[1]/p/em/text()') if loupan_nodes['loupan_price_nodes']: loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[4]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[1]/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[2]/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[1]/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[1]/p/em/text()') if loupan_nodes['loupan_price_nodes']: loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[2]/a[1]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[3]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/dl/dd[4]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[5]/p[1]/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[5]/p[2]/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[1]/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') else: loupan_nodes['loupan_price_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[2]/span/text()') loupan_nodes['loupan_discount_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/text()') loupan_nodes['loupan_layout_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[4]/div/text()') loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[5]/span/text()') loupan_nodes['loupan_opening_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[1]/span/text()') loupan_nodes['loupan_transfer_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/p[2]/span/text()') loupan_nodes['loupan_type_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[1]/span/text()') loupan_nodes['loupan_age_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[4]/div/ul[1]/li[2]/span/text()') if not loupan_nodes['loupan_location_nodes']: loupan_nodes['loupan_location_nodes'] = response.xpath('/html/body/div[2]/div[1]/div[2]/div[1]/dl/dd[3]/span/text()') loupan_item = self.struct_loupan_item(loupan_nodes) return loupan_item def struct_loupan_item(self,loupan_nodes): loupan_item = AnjukeSZItem() if loupan_nodes['loupan_name_nodes']: loupan_item['loupan_name'] = loupan_nodes['loupan_name_nodes'].extract()[0].strip() if loupan_nodes['loupan_status_nodes']: loupan_item['loupan_status'] = loupan_nodes['loupan_status_nodes'].extract()[0].strip() else: loupan_item['loupan_status'] = ''
        if loupan_nodes['loupan_price_nodes']: loupan_item['loupan_price'] = loupan_nodes['loupan_price_nodes'].extract()[0].strip() else: loupan_item['loupan_price'] = ''
        if loupan_nodes['loupan_discount_nodes']: loupan_item['loupan_discount'] = loupan_nodes['loupan_discount_nodes'].extract()[0].strip() else: loupan_item['loupan_discount'] = ''
        if loupan_nodes['loupan_layout_nodes']: loupan_item['loupan_layout'] = loupan_nodes['loupan_layout_nodes'].extract()[0].strip() else: loupan_item['loupan_layout'] = ''
        if loupan_nodes['loupan_location_nodes']: loupan_item['loupan_location'] = loupan_nodes['loupan_location_nodes'].extract()[0].strip() else: loupan_item['loupan_location'] = ''
        if loupan_nodes['loupan_opening_nodes']: loupan_item['loupan_opening'] = loupan_nodes['loupan_opening_nodes'].extract()[0].strip() else: loupan_item['loupan_opening'] = ''
        if loupan_nodes['loupan_transfer_nodes']: loupan_item['loupan_transfer'] = loupan_nodes['loupan_transfer_nodes'].extract()[0].strip() else: loupan_item['loupan_transfer'] = ''
        if loupan_nodes['loupan_type_nodes']: loupan_item['loupan_type'] = loupan_nodes['loupan_type_nodes'].extract()[0].strip() else: loupan_item['loupan_type'] = ''
        if loupan_nodes['loupan_age_nodes']: loupan_item['loupan_age'] = loupan_nodes['loupan_age_nodes'].extract()[0].strip() else: loupan_item['loupan_age'] = ''
        return loupan_item def parse_item(self, response): loupan_status_nodes = response.xpath('/html/body/div[1]/div[3]/div/div[2]/i/text()') if loupan_status_nodes.extract()[0].strip() == '售罄': loupan_item = self.get_sellout_item(response) elif loupan_status_nodes.extract()[0].strip() == '待售': loupan_item = self.get_sellwait_item(response) else: loupan_item = self.get_common_item(response) loupan_item['loupan_url'] = response.url return loupan_item
View Code

anjuke_sz_sh_spider.py

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule from Anjuke.items import AnjukeSZSHItem class AnjukeSpider(scrapy.spiders.CrawlSpider): name = 'anjuke_sz_sh' allow_domains = ["anjuke.com"] start_urls = [ 'https://shenzhen.anjuke.com/sale/p1', ] rules = [ Rule(LinkExtractor(allow=("https://shenzhen\.anjuke\.com/sale/p\d{1,}"))), Rule(LinkExtractor(allow=("https://shenzhen\.anjuke\.com/prop/view/A\d{1,}")), follow=False, callback='parse_item') ] def get_house_item(self,response): house_nodes = {} house_nodes["house_title_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[3]/h3/text()') house_nodes["house_cost_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[1]/span[1]/em/text()') house_nodes["house_code_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/h4/span[2]/text()') house_nodes["house_community_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[1]/dl[1]/dd/a/text()') house_nodes["house_location_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[1]/dl[2]/dd/p') house_nodes["house_build_years_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[1]/dl[3]/dd/text()') house_nodes["house_kind_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[1]/dl[4]/dd/text()') house_nodes["house_layout_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[2]/dl[1]/dd/text()') house_nodes["house_size_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[2]/dl[2]/dd/text()') house_nodes["house_face_to_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[2]/dl[3]/dd/text()') house_nodes["house_point_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[2]/dl[4]/dd/text()') house_nodes["house_price_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[3]/dl[1]/dd/text()') house_nodes["house_first_pay_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[3]/dl[2]/dd/text()') house_nodes["house_month_pay_nodes"] = response.xpath('//*[@id="reference_monthpay"]/text()') house_nodes["house_decorate_type_nodes"] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[1]/div[3]/div/div[1]/div/div[3]/dl[4]/dd/text()') house_nodes['house_agent_nodes'] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div/div[1]/div[1]/div/div/text()') house_nodes['house_agency_nodes'] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div/div[1]/div[5]/div/p[1]/a/text()') if not house_nodes['house_agency_nodes']: house_nodes['house_agency_nodes'] = response.xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div/div[1]/div[5]/div/p/text()') house_item = self.struct_house_item(house_nodes) return house_item def struct_house_item(self,house_nodes): house_item = AnjukeSZSHItem() if house_nodes['house_title_nodes']: house_item['house_title'] = house_nodes['house_title_nodes'].extract()[0].strip() else: house_item['house_title'] = ''
        if house_nodes['house_cost_nodes']: house_item['house_cost'] = house_nodes['house_cost_nodes'].extract()[0].strip() else: house_item['house_cost'] = ''
        if house_nodes['house_code_nodes']: temp_dict = house_nodes['house_code_nodes'].extract()[0].strip().split('') house_item['house_code'] = temp_dict[0] house_item['house_public_time'] = temp_dict[1] else: house_item['house_code'] = '' house_item['house_public_time'] = ''
        if house_nodes['house_community_nodes']: house_item['house_community'] = house_nodes['house_community_nodes'].extract()[0].strip() else: house_item['house_community'] = ''
        if house_nodes['house_location_nodes']: house_item['house_location'] = house_nodes['house_location_nodes'].xpath('string(.)').extract()[0].strip().replace('\t','').replace('\n','') else: house_item['house_location'] = ''
        if house_nodes['house_build_years_nodes']: house_item['house_build_years'] = house_nodes['house_build_years_nodes'].extract()[0].strip() else: house_item['house_build_years'] = ''
        if house_nodes['house_kind_nodes']: house_item['house_kind'] = house_nodes['house_kind_nodes'].extract()[0].strip() else: house_item['house_kind'] = ''
        if house_nodes['house_layout_nodes']: house_item['house_layout'] = house_nodes['house_layout_nodes'].extract()[0].strip().replace('\t','').replace('\n','') else: house_item['house_layout'] = ''
        if house_nodes['house_size_nodes']: house_item['house_size'] = house_nodes['house_size_nodes'].extract()[0].strip() else: house_item['house_size'] = ''
        if house_nodes['house_face_to_nodes']: house_item['house_face_to'] = house_nodes['house_face_to_nodes'].extract()[0].strip() else: house_item['house_face_to'] = ''
        if house_nodes['house_point_nodes']: house_item['house_point'] = house_nodes['house_point_nodes'].extract()[0].strip() else: house_item['house_point'] = ''
        if house_nodes['house_price_nodes']: house_item['house_price'] = house_nodes['house_price_nodes'].extract()[0].strip() else: house_item['house_price'] = ''
        if house_nodes['house_first_pay_nodes']: house_item['house_first_pay'] = house_nodes['house_first_pay_nodes'].extract()[0].strip() else: house_item['house_first_pay'] = ''
        if house_nodes['house_month_pay_nodes']: house_item['house_month_pay'] = house_nodes['house_month_pay_nodes'].extract()[0].strip() else: house_item['house_month_pay'] = ''
        if house_nodes['house_decorate_type_nodes']: house_item['house_decorate_type'] = house_nodes['house_decorate_type_nodes'].extract()[0].strip() else: house_item['house_decorate_type'] = ''
        if house_nodes['house_agent_nodes']: house_item['house_agent'] = house_nodes['house_agent_nodes'].extract()[0].strip() else: house_item['house_agent'] = ''
        if house_nodes['house_agency_nodes']: house_item['house_agency'] = house_nodes['house_agency_nodes'].extract()[0].strip() else: house_item['house_agency'] = ''
        return house_item def parse_item(self, response): house_item = self.get_house_item(response) house_item['house_url'] = response.url return house_item
View Code

anjuke_sz_report.py

import matplotlib.pyplot as plt import pymysql import numpy as np class AjukeSZReport(): def __init__(self): self.db = pymysql.connect('127.0.0.1', 'root', 'root', 'anjuke', charset='utf8') self.cursor = self.db.cursor() def export_result_piture(self): district = ['南山','寶安','福田','羅湖','光明','龍華','龍崗','坪山','鹽田','大鵬','深圳','惠州','東莞'] x = np.arange(len(district)) house_price_avg = [] for district_temp in district: if district_temp == '深圳': sql = "select avg(loupan_price) from sz_loupan_info where loupan_location not like '%周邊%' and loupan_price > 5000"
            else: sql = "select avg(loupan_price) from sz_loupan_info where loupan_location like '%" + district_temp + "%' and loupan_price > 5000" self.cursor.execute(sql) results = self.cursor.fetchall() house_price_avg.append(results[0][0]) bars = plt.bar(x, house_price_avg) plt.xticks(x, district) plt.rcParams['font.sans-serif'] = ['SimHei'] i = 0 for bar in bars: plt.text((bar.get_x()+bar.get_width()/2),bar.get_height(),'%d'%house_price_avg[i],ha='center',va='bottom') i += 1 plt.show() def __del__(self): self.db.close() if __name__ == '__main__': anjukeSZReport = AjukeSZReport() anjukeSZReport.export_result_piture()
View Code

anjuke_sz_sh_report.py

import matplotlib.pyplot as plt import pymysql import numpy as np class AjukeSZSHReport(): def __init__(self): self.db = pymysql.connect('127.0.0.1', 'root', 'root', 'anjuke', charset='utf8') self.cursor = self.db.cursor() def export_result_piture(self): district = ['南山','寶安','福田','羅湖','光明','龍華','龍崗','坪山','鹽田','大鵬','深圳','惠州','東莞'] x = np.arange(len(district)) house_price_avg = [] for district_temp in district: if district_temp == '深圳': sql = "select house_price from sz_sh_house_info where house_location not like '%周邊%'"
            else: sql = "select house_price from sz_sh_house_info where house_location like '%" + district_temp + "%'" self.cursor.execute(sql) results = self.cursor.fetchall() house_price_sum = 0 house_num = 0 for result in results: house_price_dict = result[0].split(' ') house_price_sum += int(house_price_dict[0]) house_num += 1 house_price_avg.append(house_price_sum/house_num) bars = plt.bar(x, house_price_avg) plt.xticks(x, district) plt.rcParams['font.sans-serif'] = ['SimHei'] i = 0 for bar in bars: plt.text((bar.get_x()+bar.get_width()/2),bar.get_height(),'%d'%house_price_avg[i],ha='center',va='bottom') i += 1 plt.show() def __del__(self): self.db.close() if __name__ == '__main__': anjukeSZReport = AjukeSZSHReport() anjukeSZReport.export_result_piture()
View Code

其餘文件未作修改,保執自動生成時的模樣不動便可

 

3、項目結果演示

anjuke_sz_spider.py收集部份樓盤數據截圖

anjuke_sz_report.py生成圖表截圖:

anjuke_sz_sh_spider.py收集部份二手房數據截圖:

anjuke_sz_sh_report.py生成報表截圖:

 

相關文章
相關標籤/搜索