以前用過selenium和request爬取數據,可是感受速度慢,而後看了下scrapy教程,準備用這個框架爬取試一下。html
一、目的:經過爬取成都鏈家的二手房信息,主要包含小區名,小區周邊環境,小區樓層以及價格等信息。而且把這些信息寫入mysql。python
二、環境:scrapy1.5.1 +python3.6mysql
三、建立項目:建立scrapy項目,在項目路徑執行命令:scrapy startproject LianJiaScrapyweb
四、項目路徑:(其中run.py新加的,run.py是在eclipse裏面啓動scrapy項目,方便調試的)sql
這些文件分別是:數據庫
五、建立爬蟲的主文件:cmd進入到主目錄,輸入命令:scrapy genspider lianjia_spider,查看spiders目錄下,新建了一個lianjia_spider.pycookie
六、items.py編寫:app
# -*- coding: utf-8 -*-框架
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmldom
from scrapy import Field, Item
class ScrapylianjiaItem(Item):
'''
houseName:小區樓盤
description:房子描述
floor:此條信息的關注度和發佈時間
positionIcon:房子所屬區
followInfo:樓層信息
subway:是否臨近地鐵
taxfree:是否有稅
haskey:是否隨時看房
totalPrice:總價
unitPrice:單價
'''
houseName = Field()
description = Field()
floor = Field()
positionIcon = Field()
followInfo = Field()
subway = Field()
taxfree = Field()
haskey = Field()
totalPrice = Field()
unitPrice = Field()
七、爬蟲文件lianjia_spider.py編寫
# -*- coding: utf-8 -*- ''' Created on 2018年8月23日 @author: zww ''' import scrapy import random import time from LianJiaScrapy.items import ScrapylianjiaItem class LianJiaSpider(scrapy.Spider): name = "Lianjia" start_urls = [ "https://cd.lianjia.com/ershoufang/pg1/", ] def parse(self, response): # 組裝下一頁要抓取的網址 init_url = 'https://cd.lianjia.com/ershoufang/pg' # 房子列表在//li[@class="clear LOGCLICKDATA"]路徑下面,每頁有30條 sels = response.xpath('//li[@class="clear LOGCLICKDATA"]') # 這裏是一次性所有獲取30條的信息 houseName_list = sels.xpath( '//div[@class="houseInfo"]/a/text()').extract() description_list = sels.xpath( '//div[@class="houseInfo"]/text()').extract() floor_list = sels.xpath( '//div[@class="positionInfo"]/text()').extract() positionIcon_list = sels.xpath( '//div[@class="positionInfo"]/a/text()').extract() followInfo_list = sels.xpath( '//div[@class="followInfo"]/text()').extract() subway_list = sels.xpath('//span[@class="subway"]/text()').extract() taxfree_list = sels.xpath('//span[@class="taxfree"]/text()').extract() haskey_list = sels.xpath('//span[@class="haskey"]/text()').extract() totalPrice_list = sels.xpath( '//div[@class="totalPrice"]/span/text()').extract() unitPrice_list = sels.xpath( '//div[@class="unitPrice"]/span/text()').extract() # 爬取的數據和item文件裏面的數據對應起來 i = 0 for sel in sels: item = ScrapylianjiaItem() item['houseName'] = houseName_list[i].strip() item['description'] = description_list[i].strip() item['floor'] = floor_list[i].strip() item['positionIcon'] = positionIcon_list[i].strip() item['followInfo'] = followInfo_list[i].strip() item['subway'] = subway_list[i].strip() item['taxfree'] = taxfree_list[i].strip() item['haskey'] = haskey_list[i].strip() item['totalPrice'] = totalPrice_list[i].strip() item['unitPrice'] = unitPrice_list[i].strip() i += 1 yield item # 獲取當前頁數,獲取出來的格式是{"totalPage":100,"curPage":98} has_next_page = sels.xpath( '//div[@class="page-box fr"]/div[1]/@page-data').extract()[0] # 取出來的值是str類型的,轉成字典,而後取curPage這個字段的值 to_dict = eval(has_next_page) current_page = to_dict['curPage'] # 鏈家只展現100頁的內容,抓完100頁就終止爬蟲 if current_page != 100: next_page = current_page + 1 url = ''.join([init_url, str(next_page), '/']) print('starting crapy url:', url) # 隨機爬取時間,防止封ip time.sleep(round(random.uniform(1, 2), 2)) yield scrapy.Request(url, callback=self.parse) else: print('scrapy done!')
八、數據處理文件pipelines.py的編寫:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysql from scrapy.utils.project import get_project_settings class LianjiascrapyPipeline(object): InsertSql = '''insert into scrapy_LianJia (houseName,description,floor,followInfo,haskey, positionIcon,subway,taxfree,totalPrice,unitPrice) values('{houseName}','{description}','{floor}','{followInfo}', '{haskey}','{positionIcon}','{subway}','{taxfree}','{totalPrice}','{unitPrice}')''' def __init__(self): self.settings = get_project_settings() # 鏈接數據庫 self.connect = pymysql.connect( host=self.settings.get('MYSQL_HOST'), port=self.settings.get('MYSQL_PORT'), db=self.settings.get('MYSQL_DBNAME'), user=self.settings.get('MYSQL_USER'), passwd=self.settings.get('MYSQL_PASSWD'), charset='utf8', use_unicode=True) # 經過cursor執行增刪查改 self.cursor = self.connect.cursor() def process_item(self, item, spider): sqltext = self.InsertSql.format( houseName=item['houseName'], description=item['description'], floor=item['floor'], followInfo=item['followInfo'], haskey=item['haskey'], positionIcon=item['positionIcon'], subway=item['subway'], taxfree=item['taxfree'], totalPrice=item['totalPrice'], unitPrice=item['unitPrice']) try: self.cursor.execute(sqltext) self.connect.commit() except Exception as e: print('插入數據失敗', e) return item def close_spider(self, spider): self.cursor.close() self.connect.close()
九、要使用pipelines文件,須要在settings.py裏面設置:
ITEM_PIPELINES = {
'LianJiaScrapy.pipelines.LianjiascrapyPipeline': 300,
}
#設置mysql鏈接信息:
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'test_scrapy'
MYSQL_USER = ‘這裏填寫鏈接庫的帳號’
MYSQL_PASSWD = '填寫密碼'
MYSQL_PORT = 3306
#設置爬蟲的信息頭
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '填寫的你cookie',
'Host': 'cd.lianjia.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
十、在mysql的庫test_scrapy裏面新建表:
CREATE TABLE `scrapy_lianjia` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`houseName` varchar(255) DEFAULT NULL COMMENT '小區名',
`description` varchar(255) DEFAULT NULL COMMENT '房子描述',
`floor` varchar(255) DEFAULT NULL COMMENT '樓層',
`followInfo` varchar(255) DEFAULT NULL COMMENT '此條信息的關注度和發佈時間',
`haskey` varchar(255) DEFAULT NULL COMMENT '看房要求',
`positionIcon` varchar(255) DEFAULT NULL COMMENT '房子所屬區',
`subway` varchar(255) DEFAULT NULL COMMENT '是否近地鐵',
`taxfree` varchar(255) DEFAULT NULL COMMENT '房屋稅',
`totalPrice` varchar(11) DEFAULT NULL COMMENT '總價',
`unitPrice` varchar(255) DEFAULT NULL COMMENT '單價',
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=3001 DEFAULT CHARSET=utf8;
十一、運行爬蟲項目:
這裏能夠直接在cmd裏面輸入命令:scrapy crawl Lianjia執行。
我在寫腳本的時候,須要調試,因此新加了run.py,能夠直接運行,也能夠debug。
個人run.py文件:
# -*- coding: utf-8 -*- ''' Created on 2018年8月23日 @author: zww ''' from scrapy import cmdline name = 'Lianjia' cmd = 'scrapy crawl {0}'.format(name) #下面這2中方式均可以的,好像python2.7版本和3.6版本還有點不同, #2.7版本用第二種的話,須要加空格 cmdline.execute(cmd.split()) # cmdline.execute(['scrapy', 'crawl', name])
#下面這2中方式均可以的,好像python2.7版本和3.6版本還有點不同,
#2.7版本用第二種的話,須要加空格
cmdline.execute(cmd.split())
# cmdline.execute(['scrapy', 'crawl', name])
十二、爬取的過程:
1三、爬取的結果: