初識爬蟲——遊天下租房信息

時間 2020-12-20

標籤 css html python 正則表達式 chrome 瀏覽器 dom 函數工具學習欄目網絡爬蟲简体版

原文原文鏈接

昨今兩天，學習了基本的爬蟲，感受很不錯，寫下分享分享！！！css

首先，你們都關心的問題，學習爬蟲須要具有什麼知識呢？？大體以下：html

python的基礎知識（函數的定義、列表的操做、文件操做、正則表達式）難度：***
python額外知識（BeautifulSoup、requests、re（正則表達式））
html+css的基礎知識（類選擇器、id選擇器以及dom）難度：******

而後就是作應該爬蟲的基本流程：python

明確本身的目標，在哪一個網站爬取什麼數據
分析單個頁面中所需數據的獲取規律
將規律提煉成函數
循環遍歷獲取數據
保存數據
分析數據

好比，假設咱們的任務以下：
正則表達式

經過這張圖片咱們能夠得到至少兩個很重要的信息：chrome

目標網站
目標數據

接下來就是如何去製做爬蟲，故，咱們要去學習相關的工具：瀏覽器

BeautifulSoup(https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)
requests(https://cn.python-requests.org/zh_CN/latest/)

而後，咱們就能夠對單個頁面的數據進行分析了。例如：http://www.youtx.com/chengdu/page1/dom

首先咱們在瀏覽器中輸入地址：
函數

而後，咱們能夠（這裏以chrome瀏覽器爲例）按下F12，獲得下面的畫面：
工具

而後，咱們點擊那個箭頭：
學習

而後，選取咱們想看的部分，好比：

咱們就先點擊剛剛那個箭頭，而後點擊那個地方，獲得以下：

圖中箭頭所指即便該租房的主要信息頁面，而後，咱們就點進去看看：

這兒就是咱們爬蟲最主要的操做空間了，咱們所須要的許多信息都是從這裏獲取的
仍是上面說的那樣，咱們先按F12，而後點擊箭頭，而後點擊咱們關注的那個地方，好比說：

咱們能夠經過觀察html頁面知道它的位置，而後，接下來的一步就是咱們怎樣去找到它，若是咱們仔細觀察，就會發現下面的信息：

咱們能夠發現，它是位於class爲housemessage的li的下面的，因而咱們知道，若是咱們要獲取那個內容，就要先獲取那個li標籤再得到內容。
在css中咱們知道，一個類選擇器能夠對應着多個標籤，可是一個id選擇器則只能對應一個標籤，因而，咱們須要知道該類選擇器做用於那些標籤，因而，咱們能夠用下面的方法作（涉及js）

而後：

咱們發現恰好只有一個，可省下了很多事情。
經過上面的分析，因而咱們知道了一個大體的思路：在html中獲取類名爲housemessage的元素 ——>獲取須要的值。
那麼問題來了，咱們如何獲得html頁面，且如何獲取元素呢，獲取元素後咱們如何獲取它的值呢？？？
這裏，咱們就須要開始使用bs4模塊和requests模塊了。

以www.baidu.com爲例：

# 第一步 導包
import requests
from bs4 import BeautifulSoup

# 第二步 獲取目標網頁的源代碼
source = requests.get('www.baidu.com')
# 這裏咱們能夠看看source具備哪些方法和屬性
print(dir(source))
print(source.__dict__)

# 第三步 美化源代碼
soup = BeautifulSoup(source.text, 'html.parser')
# 看看效果
print(soup.prettify())

# 第四步 獲取標籤
print(soup.select(xxxx))

# 第五步 獲取標籤的值
print(soup.select(xxx).__dict__['contents'])

經過以上，咱們就能夠在一個頁面中獲取咱們想要的數據了，那麼多個頁面爬取也就簡單了，能夠去尋找每個頁面的規律，經過上面的方法，遍歷獲取每一個頁面中理解，打開，獲取數據，也能夠是繼續獲取鏈接，一步步深刻。

如下就是完成上述任務的代碼，能夠參考參考，如有不足，請指正！！！謝謝！！！

# -*- coding:UTF-8 -*-
"""
Created on 2020/12/19 16:40

@author : Jonny Jiang
"""


import requests
import re
from bs4 import BeautifulSoup
import csv
import sys
import time

def get_info(url):
    content = requests.get(url)
    soup = BeautifulSoup(content.text, 'html.parser')
    housepercity = None  # 市
    housedistrict = None  # 區
    house_area = None  # 面積
    bedroom_num = 0 # 臥室數量
    bathroom_num = 0 # 衛生間數量
    house_style = None  # 房屋戶型
    amount = None  # 宜住人數
    today_price = None  # 今日價格
    owner = None  # 租房人
    money = None  # 是否收取押金
    days = None  # 最短入住天數
    score = 0  # 整體評價
    result = []

    # 市區匹配表達式
    addr_1_p = "housepercity = (.+);"
    addr_2_p = "housedistrict = (.+);"

    housepercity = re.findall(re.compile(addr_1_p), soup.prettify())[0]
    housedistrict = re.findall(re.compile(addr_2_p), soup.prettify())[0]
    house_area = re.findall('\d+', soup.select('.housemessage span')[2].contents[0])[0]
    bedroom_num = re.findall('\d+', soup.select('.housemessage span')[-3].contents[0])
    if not bedroom_num:
        bedroom_num = 0
    else:
        bedroom_num = bedroom_num[0]
    bathroom_num = re.findall('\d+', soup.select('.housemessage span')[1].contents[0])
    if not bathroom_num:
        bathroom_num = 0
    else:
        bathroom_num = bathroom_num[0]
    house_style = (bedroom_num, bathroom_num)
    amount = re.findall('\d+', soup.select('.housemessage span')[-3].contents[0])[0]
    today_price = soup.select('.part-two p span')[0].__dict__.get('contents')[1]
    owner = soup.select(".left a")[0].__dict__['attrs']['title']
    money = soup.select('.dsection-4 span')[-4].__dict__['contents'][0][1:]
    days = soup.select('.dsection-4 span')[2].__dict__['contents'][0][0]
    score_s = soup.select('.sec-1 div')[0].get('class')
    score_int = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'half': 0.5}
    tmp = ''
    if len(score_s) > 1:
        for s in score_s[1][:-4]:
            tmp += s
            if tmp in score_int:
                score += score_int[tmp]
                tmp = ''
    result = [housepercity, housedistrict, house_area, house_style, amount, today_price, owner, money, days, score]

    return result


def main():
    page_url = 'http://www.youtx.com/chengdu/page{}/'

    # 保存文件名
    filename = 'test.csv'
    # 開始寫入
    # filename = sys.argv[1]
    f = open(filename, 'w+', encoding='utf-8')
    csv_writer = csv.writer(f)
    csv_writer.writerow(['市', '區', '房屋面積', '房屋戶型', '宜住人數', '當日出租價格', '租房人', '是否收取押金', '最短入住時間', '整體評價'])

    try:
        for i in range(1, 33):
            now_url = page_url.format(i)
            page = requests.get(now_url)
            soup = BeautifulSoup(page.text, 'html.parser')
            url_markups = soup.select('#results>ul>li')
            for url_markup in url_markups:
                url = url_markup.a.get('href')
                print(url)
                next_page = url
                result = get_info(next_page)
                print(result)
                csv_writer.writerow(result)
                time.sleep(0.5)
    finally:

        # 結束寫入
        f.close()


if __name__ == '__main__':
    main()