python 爬取貓眼電影top100數據

最近有爬蟲相關的需求,因此上B站找了個視頻(連接在文末)看了一下,作了一個小程序出來,大致上沒有修改,只是在最後的存儲上,由txt換成了excel。html

  • 簡要需求:爬蟲爬取 貓眼電影TOP100榜單 數據
  • 使用語言:python
  • 工具:PyCharm
  • 涉及庫:requests、re、openpyxl(高版本excel操做庫)

實現代碼python

貓眼電影Robots正則表達式

 1 # -*- coding: utf-8 -*-
 2 # @Author  : yocichen
 3 # @Email   : yocichen@126.com
 4 # @File    : MaoyanTop100.py
 5 # @Software: PyCharm
 6 # @Time    : 2019/11/6 9:52
 7 
 8 import requests
 9 from requests import RequestException
10 import re
11 import openpyxl
12 
13 # Get page's html by requests module
14 def get_one_page(url):
15     try:
16         headers = {
17             'user-agent':'Mozilla/5.0'
18         }
19         # use headers to avoid 403 Forbidden Error(reject spider)
20         response = requests.get(url, headers=headers)
21         if response.status_code == 200 :
22             return response.text
23         return None
24     except RequestException:
25         return None
26 
27 # Get useful info from html of a page by re module
28 def parse_one_page(html):
29     pattern = re.compile('<dd>.*?board-index.*?>(\d+)<.*?<a.*?title="(.*?)"'
30                          +'.*?data-src="(.*?)".*?</a>.*?star">[\\s]*(.*?)[\\n][\\s]*</p>.*?'
31                          +'releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?'
32                          +'fraction">(.*?)</i>.*?</dd>', re.S)
33     items = re.findall(pattern, html)
34     return items
35 
36 # Main call function
37 def main(url):
38     page_html = get_one_page(url)
39     parse_res = parse_one_page(page_html)
40     return parse_res
41 
42 # Write the useful info in excel(*.xlsx file)
43 def write_excel_xlsx(items):
44     wb = openpyxl.Workbook()
45     ws = wb.active
46     rows = len(items)
47     cols = len(items[0])
48     # First, write col's title.
49     ws.cell(1, 1).value = '編號'
50     ws.cell(1, 2).value = '片名'
51     ws.cell(1, 3).value = '宣傳圖片'
52     ws.cell(1, 4).value = '主演'
53     ws.cell(1, 5).value = '上映時間'
54     ws.cell(1, 6).value = '評分'
55     # Write film's info
56     for i in range(0, rows):
57         for j in range(0, cols):
58             # print(items[i-1][j-1])
59             if j != 5:
60                 ws.cell(i+2, j+1).value = items[i][j]
61             else:
62                 ws.cell(i+2, j+1).value = items[i][j]+items[i][j+1]
63                 break
64     # Save the work book as *.xlsx
65     wb.save('maoyan_top100.xlsx')
66 
67 if __name__ == '__main__':
68     res = []
69     url = 'https://maoyan.com/board/4?'
70     for i in range(0, 10):
71         if i == 0:
72             res = main(url)
73         else:
74             newUrl = url+'offset='+str(i*10)
75             res.extend(main(newUrl))
76     # test spider
77     # for item in res:
78     #     print(item)
79     # test wirte to excel
80     # res = [
81     #     [1, 2, 3, 4, 9],
82     #     [2, 3, 4, 5, 9],
83     #     [4, 5, 6, 7, 9]
84     # ]
85 
86     write_excel_xlsx(res)

目前的效果小程序

後記網絡

入門了一點後發現,若是使用正則表達式和requests庫來實行進行數據爬取的話,分析HTML頁面結構和正則表達式的構造是關鍵,剩下的工做不過是替換url罷了。ide


補充一個分析HTML構造正則的例子工具

貓眼經典科幻按照評價排序url

審查元素咱們會發現每一項都是<dd>****</dd>格式spa

 我想要獲取電影名稱和評分,先拿出HTML代碼看一看3d

試着構造正則

'.*?<dd>.*?movie-item-title.*?title="(.*?)">.*?integer">(.*?)<.*?fraction">(.*?)<.*?</dd>' (隨手寫的,未經驗證)


 

參考資料

【B站視頻 2018年最新Python3.6網絡爬蟲實戰】https://www.bilibili.com/video/av19057145/?p=14

【貓眼電影robots】https://maoyan.com/robots.txt (最好爬以前去看一下,那些可爬那些不容許爬)

相關文章
相關標籤/搜索