做爲一個活躍在京津冀地區的開發者,要閒着沒事就看看石家莊
這個國際化大都市的一些數據,這篇博客爬取了鏈家網的租房信息,爬取到的數據在後面的博客中能夠做爲一些數據分析的素材。
咱們須要爬取的網址爲:https://sjz.lianjia.com/zufang/
html
首先肯定一下,哪些數據是咱們須要的
session
能夠看到,黃色框就是咱們須要的數據。app
接下來,肯定一下翻頁規律dom
https://sjz.lianjia.com/zufang/pg1/ https://sjz.lianjia.com/zufang/pg2/ https://sjz.lianjia.com/zufang/pg3/ https://sjz.lianjia.com/zufang/pg4/ https://sjz.lianjia.com/zufang/pg5/ ... https://sjz.lianjia.com/zufang/pg80/
有了分頁地址,就能夠快速把連接拼接完畢,咱們採用lxml
模塊解析網頁源碼,獲取想要的數據。async
本次編碼使用了一個新的模塊 fake_useragent
,這個模塊,能夠隨機的去獲取一個UA(user-agent),模塊使用比較簡單,能夠去百度百度就不少教程。ide
本篇博客主要使用的是調用一個隨機的UA函數
self._ua = UserAgent() self._headers = {"User-Agent": self._ua.random} # 調用一個隨機的UA
因爲能夠快速的把頁碼拼接出來,因此採用協程進行抓取,寫入csv文件採用的pandas
模塊oop
from fake_useragent import UserAgent from lxml import etree import asyncio import aiohttp import pandas as pd class LianjiaSpider(object): def __init__(self): self._ua = UserAgent() self._headers = {"User-Agent": self._ua.random} self._data = list() async def get(self,url): async with aiohttp.ClientSession() as session: try: async with session.get(url,headers=self._headers,timeout=3) as resp: if resp.status==200: result = await resp.text() return result except Exception as e: print(e.args) async def parse_html(self): for page in range(1,77): url = "https://sjz.lianjia.com/zufang/pg{}/".format(page) print("正在爬取{}".format(url)) html = await self.get(url) # 獲取網頁內容 html = etree.HTML(html) # 解析網頁 self.parse_page(html) # 匹配咱們想要的數據 print("正在存儲數據....") ######################### 數據寫入 data = pd.DataFrame(self._data) data.to_csv("鏈家網租房數據.csv", encoding='utf_8_sig') # 寫入文件 ######################### 數據寫入 def run(self): loop = asyncio.get_event_loop() tasks = [asyncio.ensure_future(self.parse_html())] loop.run_until_complete(asyncio.wait(tasks)) if __name__ == '__main__': l = LianjiaSpider() l.run()
上述代碼中缺乏一個解析網頁的函數,咱們接下來把他補全編碼
def parse_page(self,html): info_panel = html.xpath("//div[@class='info-panel']") for info in info_panel: region = self.remove_space(info.xpath(".//span[@class='region']/text()")) zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()")) meters = self.remove_space(info.xpath(".//span[@class='meters']/text()")) where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()")) con = info.xpath(".//div[@class='con']/text()") floor = con[0] # 樓層 type = con[1] # 樣式 agent = info.xpath(".//div[@class='con']/a/text()")[0] has = info.xpath(".//div[@class='left agency']//text()") price = info.xpath(".//div[@class='price']/span/text()")[0] price_pre = info.xpath(".//div[@class='price-pre']/text()")[0] look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0] one_data = { "region":region, "zone":zone, "meters":meters, "where":where, "louceng":floor, "type":type, "xiaoshou":agent, "has":has, "price":price, "price_pre":price_pre, "num":look_num } self._data.append(one_data) # 添加數據
不一會,數據就爬取的差很少了。url