用python，生活仍有詩和遠方

時間 2019-12-04

標籤 python 生活仍有遠方欄目 Python 简体版

原文原文鏈接

常據說，如今的代碼，就和唐朝的詩同樣重要。
可對咱們來講，寫幾行代碼沒什麼，可是，要讓咱們真正地去寫一首唐詩，那可就頭大了。。既然如此，爲什麼不乾脆用代碼寫一首唐詩？html

準備：

python3.6環境
推薦使用anaconda管理python包，能夠對於每一個項目，建立環境，並在該環境下下載項目須要的包。
推薦使用pycharm做爲編譯器。

GitHub代碼：GitHub - theodore3131/TangshiGeneratorpython

具體步驟：

使用爬蟲爬取全唐詩，總共抓取了71000首。

#使用urllib3的內置函數構建爬蟲的安全驗證，來應對網站的反爬蟲機制
http = urllib3.PoolManager(
     cert_reqs='CERT_REQUIRED',
     ca_certs=certifi.where())
#爬蟲的目標網站
r = http.request('GET', url)
#爬蟲獲取的html數據
soup = BeautifulSoup(r.data, 'html.parser')
content = soup.find('div', class_="contson")

使用正則表達式對爬取的數據進行處理

p1 = r"[\u4e00-\u9fa5]{5,7}[\u3002|\uff0c]"  #[漢字]{重複5-7次}[中文句號|中文逗號]
pattern1 = re.compile(p1)        #編譯正則表達式
result = pattern1.findall(poemfile)   #搜索匹配的字符串，獲得匹配列表

對詩詞正文進行分詞操做

#使用jieba中文分詞庫的textRank算法來找出各個詞性的高頻詞
for x in jieba.analyse.textrank(content, topK=600, allowPOS=('n', 'nr', 'ns', 'nt', 'nz', 'm')):

唐詩生成git
- 處理韻腳

使用pinyin庫

pip install pinyin

verse = pinyin.get("天", format="strip")
#輸出：tian

對於韻腳，原本是想找出全部的韻腳並作成字典形式存儲起來，但韻腳總共有20多個，
後來發現其實20多個韻腳都是以元音字母開始的，咱們能夠基於這個規則來判斷：github

rhythm = ""
rhythmList = ["a", "e", "i", "o", "u"]
verse = pinyin.get(nounlist[i1][1], format="strip")
#韻腳在每一個pinyin倒敘最後一個元音字母處截止
          for p in range(len(verse)-1, -1, -1):
              if verse[p] in rhythmList:
                  ind = p

      rhythm = verse[ind:len(verse)]

目前是最初級的五言律詩，且爲名動名句式正則表達式

rhythm = ""
rhythmList = ["a", "e", "i", "o", "u"]
while num < 4:
#生成隨機數
        i = random.randint(1, len(nounlist)-1)
      i1 = random.randint(1, len(nounlist)-1)
      j = random.randint(1, len(verblist)-1)

#記錄韻腳
      ind = 0
      ind1 = 0
      if (num == 1):
          rhythm = ""
          verse = pinyin.get(nounlist[i1][1], format="strip")
#韻腳在每一個pinyin倒敘最後一個元音字母處截止
          for p in range(len(verse)-1, -1, -1):
              if verse[p] in rhythmList:
                  ind = p

      rhythm = verse[ind:len(verse)]
#確保2，4句的韻腳相同，保證押韻
      if (num == 3):
          ind1 = 0
          verse1 = pinyin.get(nounlist[i1][1], format="strip")
          for p in range(len(verse1)-1, -1, -1):
                if verse1[p] in rhythmList:
                   ind1 = p

            while verse1[ind1: len(verse1)] != rhythm:
                i1 = random.randint(1, len(nounlist)-1)
                verse1 = pinyin.get(nounlist[i1][1], format="strip")
                for p in range(len(verse1)-1, -1, -1):
                    if verse1[p] in rhythmList:
                        ind1 = p
#隨機排列組合
     print(nounlist[i]+verblist[j][1]+nounlist[i1])
     num += 1

藏頭詩

其實思路很簡單，既然咱們有了語料庫，那麼，咱們每次在排列組合詞的時候，只需保證生成每句時，第一個名詞的第一個字，是按序給定四字成語中的便可算法

for x in range(len(nounlist)):
      if nounlist[x][0] == str[num]:
          i = x

來看一下結果：shell

四言詩：segmentfault

所思浮雲
關山車馬
高樓流水
閒人腸斷

五言律詩：安全

西風時細雨
山川釣建章
龍門看蕭索
幾年鄉斜陽

藏頭詩：dom

落花流水

落暉首南宮
花枝成公子
流水名朝廷
水聲勝白石

參考：

https://segmentfault.com/a/11...

固然，如今生成的唐詩仍是比較低級的，屬於基礎的古詩文詞語排列組合。接下來考慮優化模版，提取五言和七言經常使用句式做爲模版。另外考慮使用機器學習的方法，寫RNN來讓計算機自動生成充滿韻味的詩

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。