我用Python爬取了妹子網100G的套圖

時間 2020-02-04

標籤 python 子網 100g 欄目 Python 简体版

原文原文鏈接

前言

最近在作監控相關的配套設施，發現不少腳本都是基於Python的。很早以前就據說其大名，人生苦短，我學Python，這並不是一句戲言。隨着人工智能、機器學習、深度學習的崛起，目前市面上大部分的人工智能的代碼大多使用Python 來編寫。因此人工智能時代，是時候學點Python了。
html

進軍指南python

對於沒有任何語言開發經驗的同窗，建議從頭系統的學起，不管是書、視頻仍是文字教程均可以。
git

若是是有其餘語言開發經驗的同窗，建議從一個案例入手，好比爬取某個網站的套圖。chrome

由於語言都是想通的，語法之類的只要你要語感，代碼基本能讀個八九不離十。windows

因此不建議有經驗的開發者從頭學起，不管是視頻仍是書，對於開始學一門語言來講都是太浪費時間了。瀏覽器

固然，等你深刻進去之後，仍是要系統的去學習，這是後話。服務器

軟件工具

Python3

這裏選擇的是最新版 Python 3.7.1app

安裝教程推薦：python爬蟲

http://www.runoob.com/python3/python3-install.html框架

Win下載地址：

https://www.python.org/downloads/windows

Linux下載地址：

https://www.python.org/downloads/source

PyCharm

可視化開發工具：

http://www.jetbrains.com/pycharm

案例

實現步驟

以妹子圖爲例，其實很簡單，分如下四步：

獲取首頁的頁碼數，並建立與頁碼對應的文件夾
獲取頁面的欄目地址
進入欄目，獲取欄目頁碼數(每一個欄目下有多張圖片，分頁顯示)
獲取到欄目下對用標籤中的圖片並下載

注意事項

爬取過程當中，還須要注意如下幾點，可能對你有所幫助：

1）導庫，其實就相似於Java中框架或者是工具類，底層都被封裝好了

安裝第三方庫

# Win下直接裝的 python3pip install bs四、pip install requests# Linux python2 python3 共存pip3 install bs四、pip3 install requests

導入第三方庫

# 導入requests庫import requests# 導入文件操做庫import os# bs4全名BeautifulSoup，是編寫python爬蟲經常使用庫之一，主要用來解析html標籤。import bs4from bs4 import BeautifulSoup# 基礎類庫import sys# Python 3.x 解決中文編碼問題import importlibimportlib.reload(sys)

2）定義方法函數，一個爬蟲可能會幾百行，因此儘可能不要寫成一坨

def download(page_no, file_path):    # 這裏寫代碼邏輯

3）定義全局變量

# 給請求指定一個請求頭來模擬chrome瀏覽器global headers # 告訴編譯器這是全局變量 headers headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}# 函數內使用以前須要# 告訴編譯器我在這個方法中使用的a是剛纔定義的全局變量 headers ，而不是方法內部的局部變量。global headers

4）防盜鏈

有些網站加入了防盜鏈，無所不能的 python 解決方案

headers = {'Referer': href}img = requests.get(url, headers=headers)

5）切換版本

Linux服務器使用的是阿里雲服務器，默認版本 python2，python3 自行安裝

[root@AY140216131049Z mzitu]# python2 -VPython 2.7.5[root@AY140216131049Z mzitu]# python3 -VPython 3.7.1# 默認版本[root@AY140216131049Z mzitu]# python -VPython 2.7.5# 臨時切換版本 <whereis python>[root@AY140216131049Z mzitu]# alias python='/usr/local/bin/python3.7'[root@AY140216131049Z mzitu]# python -VPython 3.7.1

6）異常捕獲

在爬取的過程當中可能存在異常頁面，這裏咱們進行捕獲，不影響後續操做

try:    # 業務邏輯except Exception as e:   print(e)

代碼實現

編輯腳本：vi mzitu.py

  
  
  
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

   
   
   
   __name__ == :
    main()
  
  
  
  
#coding=utf-8#!/usr/bin/python# 導入requests庫import requests# 導入文件操做庫import osimport bs4from bs4 import BeautifulSoupimport sysimport importlibimportlib.reload(sys)# 給請求指定一個請求頭來模擬chrome瀏覽器global headersheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}# 爬圖地址mziTu = 'http://www.mzitu.com/'# 定義存儲位置global save_pathsave_path = '/mnt/data/mzitu'# 建立文件夾def createFile(file_path):    if os.path.exists(file_path) is False:        os.makedirs(file_path)    # 切換路徑至上面建立的文件夾    os.chdir(file_path)# 下載文件def download(page_no, file_path):    global headers    res_sub = requests.get(page_no, headers=headers)    # 解析html    soup_sub = BeautifulSoup(res_sub.text, 'html.parser')    # 獲取頁面的欄目地址    all_a = soup_sub.find('div',class_='postlist').find_all('a',target='_blank')    count = 0    for a in all_a:        count = count + 1        if (count % 2) == 0:            print("內頁第幾頁：" + str(count))            # 提取href            href = a.attrs['href']            print("套圖地址：" + href)            res_sub_1 = requests.get(href, headers=headers)            soup_sub_1 = BeautifulSoup(res_sub_1.text, 'html.parser')            # ------ 這裏最好使用異常處理 ------            try:                # 獲取套圖的最大數量                pic_max = soup_sub_1.find('div',class_='pagenavi').find_all('span')[6].text                print("套圖數量：" + pic_max)                for j in range(1, int(pic_max) + 1):                    # print("子內頁第幾頁：" + str(j))                    # j int類型須要轉字符串                    href_sub = href + "/" + str(j)                    print(href_sub)                    res_sub_2 = requests.get(href_sub, headers=headers)                    soup_sub_2 = BeautifulSoup(res_sub_2.text, "html.parser")                    img = soup_sub_2.find('div', class_='main-image').find('img')                    if isinstance(img, bs4.element.Tag):                        # 提取src                        url = img.attrs['src']                        array = url.split('/')                        file_name = array[len(array)-1]                        # print(file_name)                        # 防盜鏈加入Referer                        headers = {'Referer': href}                        img = requests.get(url, headers=headers)                        # print('開始保存圖片')                        f = open(file_name, 'ab')                        f.write(img.content)                        # print(file_name, '圖片保存成功！')                        f.close()            except Exception as e:                print(e)# 主方法def main():    res = requests.get(mziTu, headers=headers)    # 使用自帶的html.parser解析    soup = BeautifulSoup(res.text, 'html.parser')    # 建立文件夾    createFile(save_path)    # 獲取首頁總頁數    img_max = soup.find('div', class_='nav-links').find_all('a')[3].text    # print("總頁數:"+img_max)    for i in range(1, int(img_max) + 1):        # 獲取每頁的URL地址        if i == 1:            page = mziTu        else:            page = mziTu + 'page/' + str(i)        file = save_path + '/' + str(i)        createFile(file)        # 下載每頁的圖片        print("套圖頁碼：" + page)        download(page, file)

腳本在Linux服務器下運行，執行如下命令

python 3 mzitu.py # 或者後臺執行nohup python3 -u mzitu.py > mzitu.log 2>&1 &

目前只爬取了一個欄目的套圖，一共17G，5332張圖片。

[root@itstyle mzitu]# du -sh 17G     .[root@itstyle mzitu]# ll -stotal 5332

下面，請小夥伴們睜大眼睛，雞凍人心的套圖時刻來了。

小結

做爲一個初學者，腳本確定多多少少有一些問題或者待優化的地方，如遇Python大嬸，還請多多指教。

其實腳本很簡單，從配置環境、安裝集成開發環境、編寫腳本到整個腳本順利執行，差很少花費了四五個小時，最終腳本一根筋的執行。限於服務器帶寬以及配置的影響，17G的圖差很少下載了三四個小時，至於剩下的83G，小夥伴們自行下載吧。

一塊兒學Python案例：https://gitee.com/52itstyle/Python

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。