爬取唐詩宋詞生成詞雲

時間 2019-11-06

標籤唐詩宋詞生成简体版

原文原文鏈接

Python 高併發線程爬取詩詞之詩詞分析html

本節所講內容：python

1、5分鐘快速瞭解爬蟲概念web

2、beautifulsoup 匹配原則正則表達式

3、wordcloud 使用詳情瀏覽器

實戰：爬取中國唐詩宋詞，體驗文人雅士最經常使用的詞語！服務器

1、5分鐘快速瞭解爬蟲網絡

爬蟲（spider：網絡蜘蛛）:是一個用腳本代替瀏覽器請求服務器獲取服務器資源的程序。併發

數據收集（數據分析、人工智能）app

模擬操做（測試、數據採集）框架

接口操做（自動化）

爬蟲的原理：

說到底，咱們的爬蟲是模擬web請求，不論學習什麼框架咱們都須要對http協議的請求和響應有所瞭解：

簡單的瞭解一下這幅圖。

2、beautifulsoup 匹配原則

若是一個正則匹配稍有差池，那可能程序就處在永久的循環之中，並且有的小夥伴們也對寫正則表達式的寫法用得不熟練，不要緊，咱們還有一個更強大的工具，叫Beautiful Soup，有了它咱們能夠很方便地提取出HTML或XML標籤中的內容，實在是方便，這一節就讓咱們一塊兒來感覺一下Beautiful Soup的強大吧。

什麼是Beautiful Soup

簡單來講，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。

官方解釋以下:

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，經過解析文檔爲用戶提供須要抓取的數據，由於簡單，因此不須要多少代碼就能夠寫出一個完整的應用程序。

2.1 bs的安裝

環境介紹： pycharm 2017.2.3 + python 3.5.0

Pip install bs4

首先必需要導入 bs4 庫，建立BeautifulSoup對象

from bs4 import BeautifulSoup as BS

text = '''

<html>

<head>

    <meta = charset='UTF-8' >

    <title id =1 href = 'http://example.com/elsie' class = 'title'>Test</title>

</head>

<body>

   <div class = 'ok'>

       <div class = 'nice'>

           <p class = 'p'>

               Hello World

           </p>

            <p class = 'e'>

               風通常的男人

           </p>

       </div>

   </div>

</body> 

</html>

'''

soup = BS(text,"lxml")#前面是要解析的內容，後面是指定的解析器
 print(soup.prettify())#轉換字符串
 print(type(soup.prettify()))

print(type(soup))

2.2.2 搜索文檔樹

find()和find_all()

find_all()方法搜索當前tag的全部tag子節點,並判斷是否符合過濾器的條件。

find（）和find_all()的區別就是，find直接返回元素的一個結果，find_all返回元素列表

find_all( name , attrs , recursive , text , **kwargs )簡介一下參數

name 參數能夠查找全部名字爲name的tag,字符串對象會被自動忽略掉；name參數能夠傳入字符串、正則表達式、列表、True、自定義的方法等可是各自表明的含義不同。

字符串，在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容。

print(soup.find('body'))
print(soup.find_all('body')

若是匹配成功將會匹配全部的tag

若是一個指定名字的參數不是搜索內置的一些參數名,搜索時會把該參數看成指定名字tag的屬性來

搜索;例如id=1

若是包含一個名字爲 id 的參數,Beautiful Soup會搜索每一個tag的」id」屬性；

若是傳入 href 參數,Beautiful Soup會搜索每一個tag的」href」屬性；

使用多個指定名字的參數能夠同時過濾tag的多個屬性；

對於class ，可使用class_來搜索

#返回這個class=‘p’的標籤內容。

print(soup.find_all('p',class_='p'))

對於某些tag屬性不能經過搜索獲得值，可使用attrs參數獲得

#返回class爲e的標籤

print(soup.find_all(attrs={'class':'e'}))

3、wordcloud 使用詳情

wordcloud 簡單利用英語來看就是詞雲，它是以詞語爲基本單位，更加直觀的展現出咱們的內容。

wordcloud 的安裝

pip install wordcloud

你們順便安裝下：pip install jieba

1、基本格式

#導入詞雲
 from wordcloud import WordCloud

#打開文件而且讀取徹底
 f = open('1.txt','r').read()

#建立wc設個實例對象，裏面可傳遞相應的參數
 #generate根據文本生成詞雲
 wc = WordCloud(

    background_color='white',

    width=500,

    height=366,

    margin=2

).generate(f)

#to_file 輸出到文件
 wc.to_file('./image/0.jpg')

3、wordcloud 使用詳情

wordcloud 簡單利用英語來看就是詞雲，它是以詞語爲基本單位，更加直觀的展現出咱們的內容。

wordcloud 的安裝

pip install wordcloud

你們順便安裝下：pip install jieba

1、基本格式

#導入詞雲
 from wordcloud import WordCloud

#打開文件而且讀取徹底
 f = open('1.txt','r').read()

#建立wc設個實例對象，裏面可傳遞相應的參數
 #generate根據文本生成詞雲
 wc = WordCloud(

    background_color='white',

    width=500,

    height=366,

    margin=2

).generate(f)

#to_file 輸出到文件
 wc.to_file('./image/0.jpg')

實戰：爬取中國唐詩宋詞，體驗文人雅士最經常使用的詞語！

第一步:下載中國的唐詩宋詞

第二步：把數據保存到本地

第三步：結巴分詞

第四步：生成詞雲簡單分析

代碼以下：

下載唐詩宋詞保存本地

# -*- coding: utf-8 -*-
# @Time    : 2019/2/25 10:23
# @Author : for
# @File    : test01.py
# @Software: PyCharm
import re
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
#這是url地址
urls = ['https://so.gushiwen.org/gushi/tangshi.aspx',
        'https://so.gushiwen.org/gushi/sanbai.aspx',
        'https://so.gushiwen.org/gushi/songsan.aspx',
        'https://so.gushiwen.org/gushi/songci.aspx'
        ]
#處理獲取每一個詩詞的url地址
poem_links = []
for url in urls:
    # 請求頭部
    ua = UserAgent()
    headers = {'User-Agent': ua.random}
    req = requests.get(url, headers=headers)
    #把爬取到的文本格式改爲bs4可改變的格式
    soup = BeautifulSoup(req.text, "lxml")
    #定位到第一個class = sone的內容
    content = soup.find_all('div', class_="sons")[0]
    #獲取該content 下全部a標籤
    links = content.find_all('a')
    print(links)
    #進行比遍歷，url地址拼接
    for link in links:
        poem_links.append('https://so.gushiwen.org'+link['href'])

poem_list = []
def get_poem(url):
    # 請求頭部
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
    req = requests.get(url, headers=headers)
    soup = BeautifulSoup(req.text, "lxml")
    poem = soup.find('div', class_='contson').text.strip()
    poem = poem.replace(' ', '')
    poem = re.sub(re.compile(r"\([\s\S]*?\)"), '', poem)
    poem = re.sub(re.compile(r"（[\s\S]*?）"), '', poem)
    poem = re.sub(re.compile(r"。\([\s\S]*?）"), '', poem)
    poem = poem.replace('!', '!').replace('?', '？')
    poem_list.append(poem)
# 利用併發爬取
executor = ThreadPoolExecutor(max_workers=10) # 能夠本身調整max_workers,即線程的個數
# submit()的參數：第一個爲函數，以後爲該函數的傳入參數，容許有多個
future_tasks = [executor.submit(get_poem, url) for url in poem_links]
# 等待全部的線程完成，才進入後續的執行
wait(future_tasks, return_when=ALL_COMPLETED)

# 將爬取的詩句寫入txt文件
poems = list(set(poem_list))
poems = sorted(poems, key=lambda x:len(x))
print(poems)
for poem in poems:
    poem = poem.replace('《','').replace('》','').replace('：', '').replace('「', '')
    print(poem)
    with open('poem.txt', 'a',encoding='utf-8') as f:
        f.write(poem)
        f.write('\n')

結果展現：

生成詞雲進行分析：
import jieba
from wordcloud import WordCloud,STOPWORDS
wc = WordCloud(background_color='white', # 背景顏色
               max_words=1000, # 最大詞數
               # mask=back_color, # 以該參數值做圖繪製詞雲，這個參數不爲空時，width和height會被忽略
               max_font_size=100, # 顯示字體的最大值
               stopwords=STOPWORDS.add('國'), # 使用內置的屏蔽詞，再添加'苟利國'
               # font_path="C:/Windows/Fonts/STFANGSO.ttf", # 解決顯示口字型亂碼問題，可進入C:/Windows/Fonts/目錄更換字體
               font_path='C:\Windows\Fonts\simfang.ttf',
               random_state=42, # 爲每一個詞返回一個PIL顏色
               # width=1000, # 圖片的寬
               # height=860 #圖片的長
               )
text = open('poem.txt').read()
# 該函數的做用就是把屏蔽詞去掉，使用這個函數就不用在WordCloud參數中添加stopwords參數了
# 把你須要屏蔽的詞所有放入一個stopwords文本文件裏便可
def stop_words(texts):
    words_list = []
    word_generator = jieba.cut(texts, cut_all=False) # 返回的是一個迭代器
    for word in word_generator:
        words_list.append(word)
    print(words_list)
    return ' '.join(words_list) # 注意是空格
text = stop_words(text)
wc.generate(text)
# 顯示圖片
wc.to_file('maikou.png')

效果展現

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。