python+scrapy爬蟲(爬取鏈家的二手房信息)

 

 

以前用過selenium和request爬取數據,可是感受速度慢,而後看了下scrapy教程,準備用這個框架爬取試一下。html

一、目的:經過爬取成都鏈家的二手房信息,主要包含小區名,小區周邊環境,小區樓層以及價格等信息。而且把這些信息寫入mysql。python

二、環境:scrapy1.5.1 +python3.6mysql

三、建立項目:建立scrapy項目,在項目路徑執行命令:scrapy startproject LianJiaScrapyweb

四、項目路徑:(其中run.py新加的,run.py是在eclipse裏面啓動scrapy項目,方便調試的)sql

這些文件分別是:數據庫

  • scrapy.cfg:項目的配置文件
  • LianJiaScrapy:該項目的python模塊。以後您將在此加入代碼。
  • LianJiaScrapy/items.py:項目中的item文件,設置對應的參數名,把抓取的數據存到對應的字段裏面。(相似字典來存數據,而後可提供給後面的pipelines.py處理數據)
  • LianJiaScrapy/pipelines.py:項目中的pipelines文件,抓取後的數據經過這個文件進行處理。(好比我把數據寫到數據庫裏面就是在這裏操做的)
  • LianJiaScrapy/spiders/:放置spider代碼的目錄。(數據抓取的過程,而且把抓取的數據和items的數據一一對應)

五、建立爬蟲的主文件:cmd進入到主目錄,輸入命令:scrapy genspider lianjia_spider,查看spiders目錄下,新建了一個lianjia_spider.pycookie

六、items.py編寫:app

# -*- coding: utf-8 -*-框架

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmldom

from scrapy import Field, Item


class ScrapylianjiaItem(Item):
  '''
  houseName:小區樓盤
  description:房子描述
  floor:此條信息的關注度和發佈時間
  positionIcon:房子所屬區
  followInfo:樓層信息
  subway:是否臨近地鐵
  taxfree:是否有稅
  haskey:是否隨時看房
  totalPrice:總價
  unitPrice:單價
  '''
  houseName = Field()
  description = Field()
  floor = Field()
  positionIcon = Field()
  followInfo = Field()
  subway = Field()
  taxfree = Field()
  haskey = Field()
  totalPrice = Field()
  unitPrice = Field()

七、爬蟲文件lianjia_spider.py編寫

# -*- coding: utf-8 -*-
'''
Created on 2018年8月23日

@author: zww
'''
import scrapy
import random
import time
from LianJiaScrapy.items import ScrapylianjiaItem


class LianJiaSpider(scrapy.Spider):
    name = "Lianjia"
    start_urls = [
        "https://cd.lianjia.com/ershoufang/pg1/",
    ]

    def parse(self, response):
        # 組裝下一頁要抓取的網址
        init_url = 'https://cd.lianjia.com/ershoufang/pg'
        # 房子列表在//li[@class="clear LOGCLICKDATA"]路徑下面,每頁有30條
        sels = response.xpath('//li[@class="clear LOGCLICKDATA"]')
        # 這裏是一次性所有獲取30條的信息
        houseName_list = sels.xpath(
            '//div[@class="houseInfo"]/a/text()').extract()
        description_list = sels.xpath(
            '//div[@class="houseInfo"]/text()').extract()
        floor_list = sels.xpath(
            '//div[@class="positionInfo"]/text()').extract()
        positionIcon_list = sels.xpath(
            '//div[@class="positionInfo"]/a/text()').extract()
        followInfo_list = sels.xpath(
            '//div[@class="followInfo"]/text()').extract()
        subway_list = sels.xpath('//span[@class="subway"]/text()').extract()
        taxfree_list = sels.xpath('//span[@class="taxfree"]/text()').extract()
        haskey_list = sels.xpath('//span[@class="haskey"]/text()').extract()
        totalPrice_list = sels.xpath(
            '//div[@class="totalPrice"]/span/text()').extract()
        unitPrice_list = sels.xpath(
            '//div[@class="unitPrice"]/span/text()').extract()
        # 爬取的數據和item文件裏面的數據對應起來
        i = 0
        for sel in sels:
            item = ScrapylianjiaItem()

            item['houseName'] = houseName_list[i].strip()
            item['description'] = description_list[i].strip()
            item['floor'] = floor_list[i].strip()
            item['positionIcon'] = positionIcon_list[i].strip()
            item['followInfo'] = followInfo_list[i].strip()
            item['subway'] = subway_list[i].strip()
            item['taxfree'] = taxfree_list[i].strip()
            item['haskey'] = haskey_list[i].strip()
            item['totalPrice'] = totalPrice_list[i].strip()
            item['unitPrice'] = unitPrice_list[i].strip()
            i += 1
            yield item
        # 獲取當前頁數,獲取出來的格式是{"totalPage":100,"curPage":98}
        has_next_page = sels.xpath(
            '//div[@class="page-box fr"]/div[1]/@page-data').extract()[0]
        # 取出來的值是str類型的,轉成字典,而後取curPage這個字段的值
        to_dict = eval(has_next_page)
        current_page = to_dict['curPage']
        # 鏈家只展現100頁的內容,抓完100頁就終止爬蟲
        if current_page != 100:
            next_page = current_page + 1
            url = ''.join([init_url, str(next_page), '/'])
            print('starting crapy url:', url)
            # 隨機爬取時間,防止封ip
            time.sleep(round(random.uniform(1, 2), 2))
            yield scrapy.Request(url, callback=self.parse)
        else:
            print('scrapy done!')

  

八、數據處理文件pipelines.py的編寫:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from scrapy.utils.project import get_project_settings


class LianjiascrapyPipeline(object):
    InsertSql = '''insert into scrapy_LianJia
        (houseName,description,floor,followInfo,haskey,
        positionIcon,subway,taxfree,totalPrice,unitPrice)  
        values('{houseName}','{description}','{floor}','{followInfo}',
        '{haskey}','{positionIcon}','{subway}','{taxfree}','{totalPrice}','{unitPrice}')'''

    def __init__(self):
        self.settings = get_project_settings()
        # 鏈接數據庫
        self.connect = pymysql.connect(
            host=self.settings.get('MYSQL_HOST'),
            port=self.settings.get('MYSQL_PORT'),
            db=self.settings.get('MYSQL_DBNAME'),
            user=self.settings.get('MYSQL_USER'),
            passwd=self.settings.get('MYSQL_PASSWD'),
            charset='utf8',
            use_unicode=True)
        # 經過cursor執行增刪查改
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        sqltext = self.InsertSql.format(
            houseName=item['houseName'], description=item['description'], floor=item['floor'], followInfo=item['followInfo'],
            haskey=item['haskey'], positionIcon=item['positionIcon'], subway=item['subway'], taxfree=item['taxfree'],
            totalPrice=item['totalPrice'], unitPrice=item['unitPrice'])
        try:
            self.cursor.execute(sqltext)
            self.connect.commit()
        except Exception as e:
            print('插入數據失敗', e)
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

九、要使用pipelines文件,須要在settings.py裏面設置:

ITEM_PIPELINES = {
'LianJiaScrapy.pipelines.LianjiascrapyPipeline': 300,
}

#設置mysql鏈接信息:

MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'test_scrapy'
MYSQL_USER = ‘這裏填寫鏈接庫的帳號’
MYSQL_PASSWD = '填寫密碼'
MYSQL_PORT = 3306

#設置爬蟲的信息頭

DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '填寫的你cookie',
'Host': 'cd.lianjia.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

十、在mysql的庫test_scrapy裏面新建表:

CREATE TABLE `scrapy_lianjia` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`houseName` varchar(255) DEFAULT NULL COMMENT '小區名',
`description` varchar(255) DEFAULT NULL COMMENT '房子描述',
`floor` varchar(255) DEFAULT NULL COMMENT '樓層',
`followInfo` varchar(255) DEFAULT NULL COMMENT '此條信息的關注度和發佈時間',
`haskey` varchar(255) DEFAULT NULL COMMENT '看房要求',
`positionIcon` varchar(255) DEFAULT NULL COMMENT '房子所屬區',
`subway` varchar(255) DEFAULT NULL COMMENT '是否近地鐵',
`taxfree` varchar(255) DEFAULT NULL COMMENT '房屋稅',
`totalPrice` varchar(11) DEFAULT NULL COMMENT '總價',
`unitPrice` varchar(255) DEFAULT NULL COMMENT '單價',
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=3001 DEFAULT CHARSET=utf8;

十一、運行爬蟲項目:

這裏能夠直接在cmd裏面輸入命令:scrapy crawl Lianjia執行。

我在寫腳本的時候,須要調試,因此新加了run.py,能夠直接運行,也能夠debug。

個人run.py文件:

 

# -*- coding: utf-8 -*-
'''
Created on 2018年8月23日

@author: zww
'''
from scrapy import cmdline
name = 'Lianjia'
cmd = 'scrapy crawl {0}'.format(name)

#下面這2中方式均可以的,好像python2.7版本和3.6版本還有點不同,
#2.7版本用第二種的話,須要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])

  

 

#下面這2中方式均可以的,好像python2.7版本和3.6版本還有點不同,
#2.7版本用第二種的話,須要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])

十二、爬取的過程:

1三、爬取的結果:

相關文章
相關標籤/搜索