爬取虎撲NBA首頁主幹道推薦貼的一隻小爬蟲，平常爬不冷笑話解悶

時間 2019-12-12

標籤 nba 首頁主幹推薦一隻爬蟲平常不冷笑話解悶欄目網絡爬蟲简体版

原文原文鏈接

虎撲是廣大jrs的家園，步行街是這個家園裏最繁華的地段。據稱廣大jrs平均學歷985，步行街街薪30w起步。html

大學時經舍友安利，開始瞭解虎撲，主要是看看NBA的一些資訊。python

偶爾也上上這個破街，看看jrs虐虐狗，說說家長裏短等等，別的不說，jr們的三觀都是特別正的。git

不冷笑話基本是我天天必看的帖子，感受樓主很是敬業，天天都會有高質量的輸出，帖子下的熱帖也很給力，福利滿滿。github

正學python，突發奇想一想把不冷笑話的圖都爬下來。app

可是虎撲在這塊有限制，不登陸沒法查看用戶的帖子，而我目前又懶得弄登錄認證（主要是還沒學通-_-||）。url

通過長期的觀察驗證，我發現不冷笑話每次都在首頁主幹道的固定位置，因而萌生出了直接從首頁定位到帖子裏的想法。spa

說幹就幹，通過個人一通分析，終於把程序寫好了，爬蟲的工做流程以下：code

一、定位不冷笑話在首頁的位置，獲取連接和標題

二、創建以標題命名的目錄，若是目錄存在，說明已下載，程序結束

三、進入不冷笑話的界面，獲取正文中的圖片連接，存入列表

四、獲取亮貼中的圖片連接，存入列表

五、保存圖片，根據傳入參數爲正文或評論進行命名，區分圖片來源

六、大功告成

#-*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os, time
import re
url = (r'https://nba.hupu.com/')


#獲取不冷笑話在首頁的位置，返回url和標題
def get_buleng_title_url(url):
    index_html = requests.get(url)
    index_html_s = BeautifulSoup(index_html.text,'lxml')
    main_street = index_html_s.find(class_ = 'gray-list main-stem max250')
    url_list = []
    url_name_list = []
    for dd in main_street.find_all('dd',limit = 5):
        url_list.append(dd.a.get('href'))
        url_name_list.append(dd.a.get_text())
    return [url_list[4],url_name_list[4]] 

#獲取不冷笑話正文中的圖片列表,利用set去重
def get_pic_url(buleng_list):
    pic_url_list = set()
    buleng_html = requests.get(buleng_list[0])
    buleng_html_s = BeautifulSoup(buleng_html.text,'lxml')
    buleng_content = buleng_html_s.find(class_='quote-content')
    for pic_url in buleng_content.find_all('img'):
        try:
            original_url = pic_url.get('data-original')
            pic_url_list.add(original_url.split('?')[0])
        except:
            pic_url_list.add(pic_url.get('src'))
    return pic_url_list

#建立以標題命名的文件夾，並返回是否建立成功
def makedir(buleng_list):
    path = ('E:\\pic\\%s' % buleng_list[1])
    if os.path.exists(path):
        return 0
    else:
        os.makedirs(path)
        return path

#獲取亮貼中的圖片列表,set去重

def get_comment_pic_url(buleng_list):
    comment_pic_url_list = set()
    buleng_html = requests.get(buleng_list[0])
    buleng_html_s = BeautifulSoup(buleng_html.text,'lxml')
    buleng_comment = buleng_html_s.find(id='readfloor')
    for floor in buleng_comment.find_all('table'):
        for pic_url in floor.find_all('img'):            
            try:
                original_url = pic_url.get('data-original')
                comment_pic_url_list.add(original_url.split('?')[0])
            except:
                comment_pic_url_list.add(pic_url.get('src'))
    return comment_pic_url_list


#下載圖片，可下載gif、jpg、png格式
def download_pic(pic_url_list,path,pic_from = '正文'):
    a = 1
    for url in pic_url_list :
        if url.endswith('.gif'):
            pic = requests.get(url)
            with open((path+('\\%s-%s.gif' % (pic_from,a))),'wb') as f:
                f.write(pic.content)
                f.close
                print('下載一張%s動圖' % pic_from)
            a += 1
        if url.endswith('.jpg'):
            pic = requests.get(url)
            with open((path+('\\%s-%s.jpg' % (pic_from,a))),'wb') as f:
                f.write(pic.content)
                f.close
                print('下載一張%sjpg圖' % pic_from)
            a +=1
        if url.endswith('.png'):
            pic = requests.get(url)
            with open((path+('\\%s-%s.png' % (pic_from,a))),'wb') as f:
                f.write(pic.content)
                f.close
                print('下載一張%spng圖' % pic_from)
            a +=1

if __name__ == "__main__":
    buleng = get_buleng_title_url(url)
    path = makedir(buleng)
    if path != 0:
        pic_url_list = get_pic_url(buleng)
        comment_pic_url_list = get_comment_pic_url(buleng)
        download_pic(pic_url_list,path)
        download_pic(comment_pic_url_list,path,'評論')
    else:
        print('目錄已存在，等待虎撲更新')