基於python的批量網頁爬蟲

時間 2020-06-14

原文原文鏈接

在各個網站，較久遠的天氣信息基本須要付費購買，所以爲了花費更少的代價，獲得完整的信息，咱們常常會對一個網站進行爬蟲，這篇文章是我第一次爬蟲的心得，由於是第一次進行爬蟲，python程序運行時間較長，如有錯誤，請大佬指出。html

爬取網站https://en.tutiempo.net/climate/ws-567780.html上昆明每個月的平均天氣信息。以昆明1942年7月爲例，觀測網站https://en.tutiempo.net/climate/07-1942/ws-567780.html，能夠發現，綠色表明月份，藍色表明年份，咱們須要爬取的信息是1942年到2019年每個月的信息。即https://en.tutiempo.net/climate/01-1942/ws-567780.html到https://en.tutiempo.net/climate/12-2019/ws-567780.html每一個網頁上圖1紅框內的信息。python

圖1網站

F12觀測網頁結構如圖2，找到該紅框所對應的代碼（html小白能夠把鼠標放在代碼上，出現的藍筐即爲該代碼所構成的網頁模塊）。url

圖2spa

發現紅框對應的網頁代碼如圖3所示：.net

圖3excel

所以構造python字符匹配代碼：code

'<td class="tc2">(.*)</td><td class="tc3">(.*)</td><td class="tc4">(.*)</td><td class="tc5">(.*)</td><td class="tc6">(.*)</td><td class="tc7">(.*)</td><td class="tc8">(.*)</td><td class="tc9">(.*)</td><td class="tc10">(.*)</td><td>&nbsp;</td><td>(.*)</td><td>(.*)</td><td>(.*)</td><td>(.*)</td>'

構造出的總體python代碼以下：orm

import requests
import re
from xlwt import *

book = Workbook(encoding='utf-8')
sheet = book.add_sheet('Sheet1') #建立一個sheet
for j in range(78):
    # 一共78年
    for k in range(12):
        # 一共12個月
        print(j,k)
        try:
            # 匹配字符串
            word2 = '<td class="tc2">(.*)</td><td class="tc3">(.*)</td><td class="tc4">(.*)</td><td class="tc5">(.*)</td><td class="tc6">(.*)</td><td class="tc7">(.*)</td><td class="tc8">(.*)</td><td class="tc9">(.*)</td><td class="tc10">(.*)</td><td>&nbsp;</td><td>(.*)</td><td>(.*)</td><td>(.*)</td><td>(.*)</td>'
            # 在1到9月前面加個0
            if(k<9):
                url = "https://en.tutiempo.net/climate/0{}-{}/ws-567780.html".format(k + 1, j + 1942)
            else:
                url = "https://en.tutiempo.net/climate/{}-{}/ws-567780.html".format(k + 1, j + 1942)
            f = requests.get(url)  # Get該網頁從而獲取該html內容
            str = f.content.decode()
            # 返回查找到的數據
            wordlist2 = re.findall(re.compile(word2), str)
            for i in range(13):
                # 將數據存入book中
                print(wordlist2[0][i])
                a = j*12+k
                sheet.write(a, i, label=wordlist2[0][i])
        except:
            print()
# 將book保存到表格裏
book.save("weather.xls")