python+scrapy爬蟲（爬取鏈家的二手房信息）

時間 2019-11-19

標籤 python+scrapy python scrapy 爬蟲家的二手房信息欄目 Python 简体版

原文原文鏈接

以前用過selenium和request爬取數據，可是感受速度慢，而後看了下scrapy教程，準備用這個框架爬取試一下。html

一、目的：經過爬取成都鏈家的二手房信息，主要包含小區名，小區周邊環境，小區樓層以及價格等信息。而且把這些信息寫入mysql。python

二、環境：scrapy1.5.1 +python3.6mysql

三、建立項目：建立scrapy項目，在項目路徑執行命令：scrapy startproject LianJiaScrapyweb

四、項目路徑：（其中run.py新加的，run.py是在eclipse裏面啓動scrapy項目，方便調試的）sql

這些文件分別是：數據庫

scrapy.cfg:項目的配置文件
LianJiaScrapy:該項目的python模塊。以後您將在此加入代碼。
LianJiaScrapy/items.py:項目中的item文件，設置對應的參數名，把抓取的數據存到對應的字段裏面。（相似字典來存數據，而後可提供給後面的pipelines.py處理數據）
LianJiaScrapy/pipelines.py:項目中的pipelines文件,抓取後的數據經過這個文件進行處理。（好比我把數據寫到數據庫裏面就是在這裏操做的）
LianJiaScrapy/spiders/：放置spider代碼的目錄。（數據抓取的過程，而且把抓取的數據和items的數據一一對應）

五、建立爬蟲的主文件：cmd進入到主目錄，輸入命令：scrapy genspider lianjia_spider，查看spiders目錄下，新建了一個lianjia_spider.pycookie

六、items.py編寫：app

# -*- coding: utf-8 -*-框架

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmldom

from scrapy import Field, Item

class ScrapylianjiaItem(Item):
　　'''
　　houseName:小區樓盤
　　description:房子描述
　　floor:此條信息的關注度和發佈時間
　　positionIcon:房子所屬區
　　followInfo:樓層信息
　　subway:是否臨近地鐵
　　taxfree:是否有稅
　　haskey:是否隨時看房
　　totalPrice:總價
　　unitPrice：單價
　　'''
　　houseName = Field()
　　description = Field()
　　floor = Field()
　　positionIcon = Field()
　　followInfo = Field()
　　subway = Field()
　　taxfree = Field()
　　haskey = Field()
　　totalPrice = Field()
　　unitPrice = Field()

七、爬蟲文件lianjia_spider.py編寫

# -*- coding: utf-8 -*-
'''
Created on 2018年8月23日

@author: zww
'''
import scrapy
import random
import time
from LianJiaScrapy.items import ScrapylianjiaItem


class LianJiaSpider(scrapy.Spider):
    name = "Lianjia"
    start_urls = [
        "https://cd.lianjia.com/ershoufang/pg1/",
    ]

    def parse(self, response):
        # 組裝下一頁要抓取的網址
        init_url = 'https://cd.lianjia.com/ershoufang/pg'
        # 房子列表在//li[@class="clear LOGCLICKDATA"]路徑下面，每頁有30條
        sels = response.xpath('//li[@class="clear LOGCLICKDATA"]')
        # 這裏是一次性所有獲取30條的信息
        houseName_list = sels.xpath(
            '//div[@class="houseInfo"]/a/text()').extract()
        description_list = sels.xpath(
            '//div[@class="houseInfo"]/text()').extract()
        floor_list = sels.xpath(
            '//div[@class="positionInfo"]/text()').extract()
        positionIcon_list = sels.xpath(
            '//div[@class="positionInfo"]/a/text()').extract()
        followInfo_list = sels.xpath(
            '//div[@class="followInfo"]/text()').extract()
        subway_list = sels.xpath('//span[@class="subway"]/text()').extract()
        taxfree_list = sels.xpath('//span[@class="taxfree"]/text()').extract()
        haskey_list = sels.xpath('//span[@class="haskey"]/text()').extract()
        totalPrice_list = sels.xpath(
            '//div[@class="totalPrice"]/span/text()').extract()
        unitPrice_list = sels.xpath(
            '//div[@class="unitPrice"]/span/text()').extract()
        # 爬取的數據和item文件裏面的數據對應起來
        i = 0
        for sel in sels:
            item = ScrapylianjiaItem()

            item['houseName'] = houseName_list[i].strip()
            item['description'] = description_list[i].strip()
            item['floor'] = floor_list[i].strip()
            item['positionIcon'] = positionIcon_list[i].strip()
            item['followInfo'] = followInfo_list[i].strip()
            item['subway'] = subway_list[i].strip()
            item['taxfree'] = taxfree_list[i].strip()
            item['haskey'] = haskey_list[i].strip()
            item['totalPrice'] = totalPrice_list[i].strip()
            item['unitPrice'] = unitPrice_list[i].strip()
            i += 1
            yield item
        # 獲取當前頁數，獲取出來的格式是{"totalPage":100,"curPage":98}
        has_next_page = sels.xpath(
            '//div[@class="page-box fr"]/div[1]/@page-data').extract()[0]
        # 取出來的值是str類型的，轉成字典，而後取curPage這個字段的值
        to_dict = eval(has_next_page)
        current_page = to_dict['curPage']
        # 鏈家只展現100頁的內容，抓完100頁就終止爬蟲
        if current_page != 100:
            next_page = current_page + 1
            url = ''.join([init_url, str(next_page), '/'])
            print('starting crapy url:', url)
            # 隨機爬取時間，防止封ip
            time.sleep(round(random.uniform(1, 2), 2))
            yield scrapy.Request(url, callback=self.parse)
        else:
            print('scrapy done!')

八、數據處理文件pipelines.py的編寫：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from scrapy.utils.project import get_project_settings


class LianjiascrapyPipeline(object):
    InsertSql = '''insert into scrapy_LianJia
        (houseName,description,floor,followInfo,haskey,
        positionIcon,subway,taxfree,totalPrice,unitPrice)  
        values('{houseName}','{description}','{floor}','{followInfo}',
        '{haskey}','{positionIcon}','{subway}','{taxfree}','{totalPrice}','{unitPrice}')'''

    def __init__(self):
        self.settings = get_project_settings()
        # 鏈接數據庫
        self.connect = pymysql.connect(
            host=self.settings.get('MYSQL_HOST'),
            port=self.settings.get('MYSQL_PORT'),
            db=self.settings.get('MYSQL_DBNAME'),
            user=self.settings.get('MYSQL_USER'),
            passwd=self.settings.get('MYSQL_PASSWD'),
            charset='utf8',
            use_unicode=True)
        # 經過cursor執行增刪查改
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        sqltext = self.InsertSql.format(
            houseName=item['houseName'], description=item['description'], floor=item['floor'], followInfo=item['followInfo'],
            haskey=item['haskey'], positionIcon=item['positionIcon'], subway=item['subway'], taxfree=item['taxfree'],
            totalPrice=item['totalPrice'], unitPrice=item['unitPrice'])
        try:
            self.cursor.execute(sqltext)
            self.connect.commit()
        except Exception as e:
            print('插入數據失敗', e)
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

九、要使用pipelines文件，須要在settings.py裏面設置：

ITEM_PIPELINES = {
'LianJiaScrapy.pipelines.LianjiascrapyPipeline': 300,
}

#設置mysql鏈接信息：

MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'test_scrapy'
MYSQL_USER = ‘這裏填寫鏈接庫的帳號’
MYSQL_PASSWD = '填寫密碼'
MYSQL_PORT = 3306

#設置爬蟲的信息頭

DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '填寫的你cookie',
'Host': 'cd.lianjia.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

十、在mysql的庫test_scrapy裏面新建表：

CREATE TABLE `scrapy_lianjia` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`houseName` varchar(255) DEFAULT NULL COMMENT '小區名',
`description` varchar(255) DEFAULT NULL COMMENT '房子描述',
`floor` varchar(255) DEFAULT NULL COMMENT '樓層',
`followInfo` varchar(255) DEFAULT NULL COMMENT '此條信息的關注度和發佈時間',
`haskey` varchar(255) DEFAULT NULL COMMENT '看房要求',
`positionIcon` varchar(255) DEFAULT NULL COMMENT '房子所屬區',
`subway` varchar(255) DEFAULT NULL COMMENT '是否近地鐵',
`taxfree` varchar(255) DEFAULT NULL COMMENT '房屋稅',
`totalPrice` varchar(11) DEFAULT NULL COMMENT '總價',
`unitPrice` varchar(255) DEFAULT NULL COMMENT '單價',
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=3001 DEFAULT CHARSET=utf8;

十一、運行爬蟲項目：

這裏能夠直接在cmd裏面輸入命令：scrapy crawl Lianjia執行。

我在寫腳本的時候，須要調試，因此新加了run.py,能夠直接運行，也能夠debug。

個人run.py文件：

# -*- coding: utf-8 -*-
'''
Created on 2018年8月23日

@author: zww
'''
from scrapy import cmdline
name = 'Lianjia'
cmd = 'scrapy crawl {0}'.format(name)

#下面這2中方式均可以的，好像python2.7版本和3.6版本還有點不同，
#2.7版本用第二種的話，須要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])

#下面這2中方式均可以的，好像python2.7版本和3.6版本還有點不同，
#2.7版本用第二種的話，須要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])

十二、爬取的過程：

1三、爬取的結果：