python爬蟲

時間 2019-11-18

標籤 python 爬蟲欄目 Python 简体版

原文原文鏈接

1、主題html

本次只是簡單的爬取廣東輕工職業技術學院的校園新聞並將爬取信息生成詞雲進行分析app

2、實現過程dom

1.在廣東輕工職業技術學院官網中進入校園新聞模塊，首先點擊其中一條新聞，經過開發者工具（F12）分析獲取新聞的標題，發佈時間以及連接以字典news{}存放起來，並將新聞內容寫到content.txt中工具

# 獲取一條新聞的信息
def getNewDetails(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soup = BeautifulSoup(resd.text, 'html.parser')
    news = {}
    news['標題'] = soup.select('div > h3')[0].text
    a = news['標題']
    news['發佈時間'] = (soup.select('.title')[0].text).lstrip( '{}發佈時間   '.format(a))
    news['連接'] = newsUrl
    content = soup.find_all('div',class_='content')[0].text
    writeNewsDetail(content)
    return news

# 將獲取到的新聞內容寫到content.txt中
def writeNewsDetail(content):
    f = open('content.txt','a',encoding='utf-8')
    f.write(content)
    f.close()

2.既然能將一條新聞的信息獲取到了，就能夠將一個新聞頁面的全部新聞也獲取到。如何獲取到頁面中全部新聞的網址呢？老辦法，摁F12打開開發者工具，找到新聞網址所在標籤就行啦字體

# 獲取一個新聞頁面的全部新聞的網址
def getListPage(newsurl):
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    soup = soup.find('div', class_='mainl')
    newslist = []
    for new in soup.select('ul > li > a'):
        newsUrl = new.attrs['href']
        newslist.append(getNewDetails(newsUrl))
    return newslist

3.搞定一個頁面也就能夠把其餘頁面也解決了，因爲廣輕工的校園新聞只有54頁，因此我就都爬下來了url

# 因爲第一個新聞頁面的網址跟其餘新聞頁面不一樣
# 因此先將第一個頁面的全部新聞的信息寫入到newsTotal裏面
newsurl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index.html'
newsTotal = []
newsTotal.extend(getListPage(newsurl))
# 再將剩下的頁面的新聞信息以循環寫入
for i in range(2,55):
    listPageUrl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index_{}.html'.format(i)
    newsTotal.extend(getListPage(listPageUrl))

4.將以前獲取到的新聞信息（標題，發佈時間以及連接）存放到Excel表裏面spa

# 將獲取到新聞信息存放到Excel表news.xlsx裏面
df = pandas.DataFrame(newsTotal)
df.to_excel('news.xlsx',encoding='utf-8')

5.到了這裏就已經將全部信息爬取完了，接下來就能夠對信息進行結巴分詞並生成詞雲了命令行

file = codecs.open('content.txt', 'r','utf-8')
# 以所給圖片爲背景生成詞雲圖片
image=np.array(Image.open('tree.jpg'))
# 詞雲字體設置
font=r'C:\Windows\Fonts\simhei.ttf'
word=file.read()
#去掉英文，保留中文
resultword=re.sub("[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\！\@\#\\\&\*\%]", "",word)
wordlist_after_jieba = jieba.cut(resultword, cut_all = True)

wl_space_split = " ".join(wordlist_after_jieba)
print(wl_space_split)
my_wordcloud = WordCloud(font_path=font,mask=image,background_color='white',max_words = 100,max_font_size = 100,random_state=50).generate(wl_space_split)
#根據圖片生成詞雲
iamge_colors = ImageColorGenerator(image)
#my_wordcloud.recolor(color_func = iamge_colors)
#顯示生成的詞雲
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
#保存生成的圖片，當關閉圖片時纔會生效，中斷程序不會保存
my_wordcloud.to_file('result.jpg')

　生成詞雲的圖片（你們可根據本身喜歡的圖片進行詞雲形狀的生成）設計

3、遇到的問題及解決辦法excel

問題：安裝詞雲的時候出錯（當時出錯的時候沒有截圖）

解決方法：經過上網瞭解，發現是因爲本人電腦的Python版本是32位的，而以前安裝的詞雲是64位的，因此出現了錯誤，在網上從新下載並安裝了32位的詞雲以後就解決了

步驟：1.選擇wordcloud-1.4.1-cp36-cp36m-win32.whl進行下載

2.打開命令行輸入 pip install wordcloud-1.4.1-cp36-cp36m-win32.whl以及 pip install wordcloud 進行安裝

3.安裝成功後到pycharm添加依賴就好了

4、數據分析及結論

經過數據咱們看到校園新聞主要是對校園各院的狀況（其中設計學院爲甚）、黨委工做以及職業技術等信息的報道

5、完整代碼

import requests
from bs4 import BeautifulSoup
import pandas
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import codecs
import numpy as np
from PIL import Image
import re

# 將獲取到的新聞內容寫到content.txt中
def writeNewsDetail(content):
    f = open('content.txt','a',encoding='utf-8')
    f.write(content)
    f.close()

# 獲取一條新聞的信息
def getNewDetails(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soup = BeautifulSoup(resd.text, 'html.parser')
    news = {}
    news['標題'] = soup.select('div > h3')[0].text
    a = news['標題']
    news['發佈時間'] = (soup.select('.title')[0].text).lstrip( '{}發佈時間   '.format(a))
    news['連接'] = newsUrl
    content = soup.find_all('div',class_='content')[0].text
    writeNewsDetail(content)
    return news

# 獲取一個新聞頁面的全部新聞的網址
def getListPage(newsurl):
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    soup = soup.find('div', class_='mainl')
    newslist = []
    for new in soup.select('ul > li > a'):
        newsUrl = new.attrs['href']
        newslist.append(getNewDetails(newsUrl))
    return newslist


# 因爲第一個新聞頁面的網址跟其餘新聞頁面不一樣
# 因此先將第一個頁面的全部新聞的信息寫入到newsTotal裏面
newsurl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index.html'
newsTotal = []
newsTotal.extend(getListPage(newsurl))
# 再將剩下的頁面的新聞信息以循環寫入
for i in range(2,55):
    listPageUrl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index_{}.html'.format(i)
    newsTotal.extend(getListPage(listPageUrl))

for news in newsTotal:
    print(news)

# 將獲取到新聞信息存放到Excel表news.xlsx裏面
df = pandas.DataFrame(newsTotal)
df.to_excel('news.xlsx',encoding='utf-8')

file = codecs.open('content.txt', 'r','utf-8')
# 以所給圖片爲背景生成詞雲圖片
image=np.array(Image.open('tree.jpg'))
# 詞雲字體設置
font=r'C:\Windows\Fonts\simhei.ttf'
word=file.read()
#去掉英文，保留中文
resultword=re.sub("[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\！\@\#\\\&\*\%]", "",word)
wordlist_after_jieba = jieba.cut(resultword, cut_all = True)

wl_space_split = " ".join(wordlist_after_jieba)
print(wl_space_split)
my_wordcloud = WordCloud(font_path=font,mask=image,background_color='white',max_words = 100,max_font_size = 100,random_state=50).generate(wl_space_split)
#根據圖片生成詞雲
iamge_colors = ImageColorGenerator(image)
#my_wordcloud.recolor(color_func = iamge_colors)
#顯示生成的詞雲
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
#保存生成的圖片，當關閉圖片時纔會生效，中斷程序不會保存
my_wordcloud.to_file('result.jpg')

6、我的感覺與體會

其實原本我是想爬英雄聯盟官網的的賽事新聞模塊的，但是在爬的過程遇到不少，好比某條新聞的網址跟新聞頁標籤的網址是不同的，並且就算網址解決了，在獲取新聞信息的時候，明明標籤等打對了，獲取到的信息根本不是目標信息或者根本找不到，最後不得不放棄選擇廣輕工的新聞模塊進行爬取。雖然兩次爬蟲經歷遇到的問題不少，但看到爬取的信息一條條顯示出來的時候內心仍是有點成就感的，在過程當中也學到了很多新知識，好比詞雲的生成。總得來講，有嘗試就有收穫，還需多練習努力。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。