$分析了六十多年間100萬字的政府工做報告，我看到了這樣的變遷

時間 2019-12-15

標籤分析六十年間萬字報告看到這樣變遷简体版

原文原文鏈接

版權聲明：本文爲博主AzureSky原創文章，轉載請註明出處：http://www.cnblogs.com/jiayongji/p/7420915.htmlhtml

每一年我國政府都會發布年度政府工做報告，而報告中出現最多的TopN關鍵詞都會成爲媒體熱議的焦點，更是體現了過去一年和將來政府工做的重點和趨勢。python

在中央政府網站上也能夠看到從1954年至今每一年的政府工做報告，連接：http://www.gov.cn/guoqing/2006-02/16/content_2616810.htmgit

那麼突發奇想，從這60多年間的政府工做報告中能夠看出來什麼樣的變遷呢？說幹就幹，下面就是實現這一想法的歷程。github

目標是什麼

獲取1954年至今歷年政府工做報告的全文，並統計出每一年政府工做報告中Top20的關鍵詞，並用圖表可視化展現出來。web
統計每十年的政府工做報告的合併Top20關鍵詞，並用圖表直觀展現出來，從中分析出變遷的趨勢。json

準備工做

數據獲取

數據獲取階段須要有兩個準備：app

網頁連接：

2017年政府工做報告連接：http://www.gov.cn/premier/2017-03/16/content_5177940.htm函數

1954~2017年政府工做報告彙總頁面連接：http://www.gov.cn/guoqing/2006-02/16/content_2616810.htm工具

技術準備

使用很是好用的web庫——requests獲取網頁內容。字體

數據解析

使用BeautifulSoup庫解析網頁HTML內容，從中解析出政府工做報告的文本內容。

數據處理與分析

使用結巴分詞庫（jieba）對政府工做報告文本內容進行分詞、過濾無效詞、統計詞頻。

結果展現

使用matplotlib庫畫出每十年政府工做報告關鍵詞的散點分佈圖，經過對比不一樣年代的圖，分析其中的變化趨勢。

動手搞

準備工做作好後，咱們開始按照計劃一步步地開始實施。

獲取網頁HTML內容

爲了代碼複用，建立一個html_utils.py文件，提供下載網頁內容的函數，並提供了一個HTML頁面解析異常類：

# coding:utf-8
# html工具函數
import requests

# 通用請求Hearders
HEADERS = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'}

# html頁面解析異常
class HtmlParseError(Exception):
    def __init__(self,value):
        self.value = value
        
    def __str__(self):
        return repr(self.value)

# 獲取網頁頁面html全文
def get_html(url):
    resp = requests.get(url,headers = HEADERS)
    resp.encoding = 'utf-8'
    if resp:
        return resp.text
    return None

建立一個分詞工具

咱們的整體思路是先獲取網頁內容，而後從網頁內容中解析出政府工做報告正文，而後對其進行分詞，這裏分詞須要用到jieba模塊，咱們建立一個cut_text_utils.py文件，在其中提供分詞的函數，內容以下：

# coding:utf-8
# 分詞操做工具函數
import sys
import jieba
from collections import Counter

# 對一段文本進行分詞，並過濾掉長度小於2的詞（標點、虛詞等），用全模式分詞
def cut_text(text):
    cut_list = jieba.cut(text.lower())
    return [word for word in cut_list if len(word) >= 2]
    
# 統計出一段文本中出現數量最多的前n個關鍵詞及數量
def get_topn_words(text,topn):
    cut_list = cut_text(text)
    counter = Counter(cut_list)
    return counter.most_common(topn)

if __name__ == '__main__':
    # 設置字節流編碼方式爲utf-8
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    s = u'我想和女友一塊兒去北京故宮博物院參觀和閒逛。'
    # print cut_text(s)
    print get_topn_words(s,5)

運行上述Demo腳本，輸出：

[(u'參觀', 1), (u'北京故宮博物院', 1), (u'一塊兒', 1), (u'女友', 1), (u'閒逛', 1)]

建立一個繪圖工具

最終要使用matplotlib庫繪出關鍵詞的散點圖，能夠更直觀地進行分析，因此咱們再寫一個繪圖工具文件visual_utils.py，內容以下：

# coding:utf-8
import matplotlib.pyplot as plt

# 指定默認字體，防止畫圖時中文亂碼
plt.rcParams['font.sans-serif'] = ['SimHei']  

# 傳入一組關鍵詞及詞頻列表，從高到低繪出每一個關鍵詞頻率的散點圖
# keywords示例：[(u'張三',10),(u'李四',12),(u'王五',7)]
def draw_keywords_scatter(keywords,title = None,xlabel = None,ylabel = None):
    # 先對keywords按詞頻從高到低排序
    keywords = sorted(keywords,key = lambda item:item[1],reverse = True)

    # 解析出關鍵詞列表
    words = [x[0] for x in keywords]
    # 解析出對應的詞頻列表
    times = [x[1] for x in keywords]
    
    x = range(len(keywords))
    y = times
    plt.plot(x, y, 'b^')
    plt.xticks(x, words, rotation=45)
    plt.margins(0.08)
    plt.subplots_adjust(bottom=0.15)
    # 圖表名稱
    plt.title(title)
    # x,y軸名稱
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()
    
def main():
    draw_keywords_scatter([(u'張三',10),(u'李四',12),(u'王五',7)],u'出勤統計圖',u'姓名',u'出勤次數')
    
if __name__ == '__main__':
    main()

運行上面的Demo腳本，繪圖結果以下：

解析2017年政府工做報告

接下來咱們先獲取到2017年的政府工做報告試試水，建立一個文件year2017_analysis.py，內容以下：

# coding:utf-8
# 分析2017年政府工做報告，從中提取出前20個高頻詞
import sys
from bs4 import BeautifulSoup as BS
import html_utils
import cut_text_utils

# 2017年政府工做報告全文URL
REPORT_URL = 'http://www.gov.cn/premier/2017-03/16/content_5177940.htm'

# 從2017年政府工做報告html頁面內容中解析出正文
def parse_report_article(html):
    soup = BS(html,'html.parser')
    article = soup.find('div',attrs = {'id':'UCAP-CONTENT'})  # 報告正文，這裏能夠經過分析網頁HTML結構獲取到解析的方法
    return article.text
    
# 傳入2017年政府工做報告全文的URL，解析出topn關鍵詞及詞頻
def get_topn_words(url,topn):
    html = html_utils.get_html(url)
    article = parse_report_article(html)
    return cut_text_utils.get_topn_words(article,topn)
    
def main():
    # 設置字節流編碼方式爲utf-8
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    with open('out.tmp','w+') as fout:
        fout.write(str(get_topn_words(REPORT_URL,20)))
    
if __name__ == '__main__':
    main()

運行上述腳本，而後在當前目錄下能夠看到產生了一個out.tmp文件，其內容以下：

[(u'發展', 125), (u'改革', 68), (u'推動', 65), (u'建設', 54), (u'經濟', 52), (u'增強', 45), (u'推進', 42), (u'加快', 40), (u'政府', 39), (u'創新', 36), (u'完善', 35), (u'全面', 35), (u'企業', 35), (u'促進', 34), (u'提升', 32), (u'就業', 31), (u'實施', 31), (u'中國', 31), (u'工做', 29), (u'支持', 29)]

從中能夠看出2017年的前五關鍵詞是：發展，改革，推動，建設，經濟，和咱們常常在媒體上看到的狀況也比較吻合。

解析1954到2017每一年的政府工做報告

思路是這樣的，首先從彙總頁面獲取到每一年政府工做報告網頁的連接，而後分別爬取每一個連接獲取到網頁內容，接着解析出每一年的政府工做報告正文，最後對每10年的政府工做報告合併分析出Top20關鍵詞並展現出來。

導包：

# coding:utf-8
# 1954~2017年政府工做報告彙總分析
import sys
import json
from collections import OrderedDict
from bs4 import BeautifulSoup as BS
import html_utils
from html_utils import HtmlParseError
import cut_text_utils

彙總頁面URL：

# 彙總URL
SUMMARY_URL = 'http://www.gov.cn/guoqing/2006-02/16/content_2616810.htm'

從彙總頁面解析出每一年政府工做報告全文頁面的URL列表：

# 從彙總頁面解析出每一年政府工做報告全文頁面的URL
# 注：只有2017年的頁面URL是專題頁面而非全文頁面
def get_report_urls(summary_url):
    html = html_utils.get_html(summary_url)
    soup = BS(html,'html.parser')
    reports_table = soup.select('#UCAP-CONTENT table tbody')[0]
    reports = [(atag.text,atag['href']) for trtag in reports_table.select('tr') for tdtag in trtag.select('td') if len(tdtag.select('a')) != 0 for atag in tdtag.select('a')]
    
    # 過濾去2017年的URL
    report_urls = [x for x in reports if x[0] != '2017']
    report_urls.append(('2017',REPORT2017_URL))
    # 按照年份升序排序
    report_urls = sorted(report_urls,key = lambda item:item[0])
    return report_urls

從報告正文頁面html中解析出正文內容：

注：這裏要考慮兩種不一樣的頁面結構進行解析。

# 從報告頁面html中解析出正文內容
# 考慮到不一樣年份報告的2種不一樣的html結構，採用兩種解析方案
def parse_report_article(html):
    soup = BS(html,'html.parser')
    # 解析方案1
    article = soup.select('#UCAP-CONTENT')
    # 若article爲空，則換方案2來解析
    if len(article) == 0:
        article = soup.select('.p1')
        # 若還爲空，則拋出異常
        if len(article) == 0:
            raise HtmlParseError('parse report error!')
            
    return article[0].text

經過上述函數結合使用，能夠爬取到1954年到2017年的全部政府工做報告的文本，總字數爲100萬零7000多字。

接着如下幾個函數用來解析關鍵詞：

# 傳入某一年政府工做報告全文的URL，解析出topn關鍵詞及詞頻
def get_topn_words(url,topn):
    html = html_utils.get_html(url)
    article = parse_report_article(html)
    return cut_text_utils.get_topn_words(article,topn)

# 傳入若干個政府工做報告全文的URL，解析出合併topn關鍵詞
# save_reports：是否保存文本到文件中（reports.txt）
def get_topn_words_from_urls(urls,topn,save_reports = False):
    htmls = [html_utils.get_html(url) for url in urls]
    # 彙總文本
    summary_atricle = '\n'.join([parse_report_article(html) for html in htmls])
    if save_reports:
        with open('reports.txt','w+') as fout:
            fout.write(summary_atricle)
    return cut_text_utils.get_topn_words(summary_atricle,topn)

# 根據傳入的每一年的政府工做報告全文URL，解析出每一年的topn關鍵詞
def get_topn_words_yearly(report_urls,topn):
    keywords = OrderedDict()
    # 遍歷url列表，解析出每一年政府工做報告的top30關鍵詞並存入字典keywords
    for year,url in report_urls:
        print 'start to parse {year} report...'.format(year = year)
        keywords[year] = get_topn_words(url,topn)
    return keywords

# 根據傳入的每一年的政府工做報告全文URL，解析出每一個十年的合併topn關鍵詞
def get_topn_words_decadal(report_urls,topn):
    # 統計出每一個10年的topn關鍵詞
    decade1 = ['1964','1960','1959','1958','1957','1956','1955','1954']
    decade2 = ['1987','1986','1985','1984','1983','1982','1981','1980','1979','1978','1975']
    decade3 = ['1997','1996','1995','1994','1993','1992','1991','1990','1989','1988']
    decade4 = ['2007','2006','2005','2004','2003','2002','2001','2000','1999','1998']
    decade5 = ['2017','2016','2015','2014','2013','2012','2011','2010','2009','2008']
    
    keywords = OrderedDict()
    decade_items = [('1954-1964',decade1),('1975-1987',decade2),('1988-1997',decade3),('1998-2007',decade4),('2008-2017',decade5)]
    for years,decade in decade_items:
        print 'start to parse {years} reports...'.format(years = years)
        urls = [item[1] for item in report_urls if item[0] in decade]
        keywords[years] = get_topn_words_from_urls(urls,topn)
        
    return keywords

彙總以上代碼，合併爲summary_analysis.py文件，內容以下：

# coding:utf-8
# 1954~2017年政府工做報告彙總分析
import sys
import json
from collections import OrderedDict
from bs4 import BeautifulSoup as BS
import html_utils
from html_utils import HtmlParseError
import cut_text_utils
import visual_utils

# 彙總URL
SUMMARY_URL = 'http://www.gov.cn/guoqing/2006-02/16/content_2616810.htm'

# 2017年政府工做報告全文URL
REPORT2017_URL = 'http://www.gov.cn/premier/2017-03/16/content_5177940.htm'

# 從彙總頁面解析出每一年政府工做報告全文頁面的URL
# 注：只有2017年的頁面URL是專題頁面而非全文頁面
def get_report_urls(summary_url):
    html = html_utils.get_html(summary_url)
    soup = BS(html,'html.parser')
    reports_table = soup.select('#UCAP-CONTENT table tbody')[0]
    reports = [(atag.text,atag['href']) for trtag in reports_table.select('tr') for tdtag in trtag.select('td') if len(tdtag.select('a')) != 0 for atag in tdtag.select('a')]
    
    # 過濾去2017年的URL
    report_urls = [x for x in reports if x[0] != '2017']
    report_urls.append(('2017',REPORT2017_URL))
    # 按照年份升序排序
    report_urls = sorted(report_urls,key = lambda item:item[0])
    return report_urls

# 從報告頁面html中解析出正文內容
# 考慮到不一樣年份報告的2種不一樣的html結構，採用兩種解析方案
def parse_report_article(html):
    soup = BS(html,'html.parser')
    # 解析方案1
    article = soup.select('#UCAP-CONTENT')
    # 若article爲空，則換方案2來解析
    if len(article) == 0:
        article = soup.select('.p1')
        # 若還爲空，則拋出異常
        if len(article) == 0:
            raise HtmlParseError('parse report error!')
            
    return article[0].text

# 傳入某一年政府工做報告全文的URL，解析出topn關鍵詞及詞頻
def get_topn_words(url,topn):
    html = html_utils.get_html(url)
    article = parse_report_article(html)
    return cut_text_utils.get_topn_words(article,topn)

# 傳入若干個政府工做報告全文的URL，解析出合併topn關鍵詞
# save_reports：是否保存文本到文件中（reports.txt）
def get_topn_words_from_urls(urls,topn,save_reports = False):
    htmls = [html_utils.get_html(url) for url in urls]
    # 彙總文本
    summary_atricle = '\n'.join([parse_report_article(html) for html in htmls])
    if save_reports:
        with open('reports.txt','w+') as fout:
            fout.write(summary_atricle)
    return cut_text_utils.get_topn_words(summary_atricle,topn)

# 根據傳入的每一年的政府工做報告全文URL，解析出每一年的topn關鍵詞
def get_topn_words_yearly(report_urls,topn):
    keywords = OrderedDict()
    # 遍歷url列表，解析出每一年政府工做報告的top30關鍵詞並存入字典keywords
    for year,url in report_urls:
        print 'start to parse {year} report...'.format(year = year)
        keywords[year] = get_topn_words(url,topn)
    return keywords

# 根據傳入的每一年的政府工做報告全文URL，解析出每一個十年的合併topn關鍵詞
def get_topn_words_decadal(report_urls,topn):
    # 統計出每一個10年的topn關鍵詞
    decade1 = ['1964','1960','1959','1958','1957','1956','1955','1954']
    decade2 = ['1987','1986','1985','1984','1983','1982','1981','1980','1979','1978','1975']
    decade3 = ['1997','1996','1995','1994','1993','1992','1991','1990','1989','1988']
    decade4 = ['2007','2006','2005','2004','2003','2002','2001','2000','1999','1998']
    decade5 = ['2017','2016','2015','2014','2013','2012','2011','2010','2009','2008']
    
    keywords = OrderedDict()
    decade_items = [('1954-1964',decade1),('1975-1987',decade2),('1988-1997',decade3),('1998-2007',decade4),('2008-2017',decade5)]
    for years,decade in decade_items:
        print 'start to parse {years} reports...'.format(years = years)
        urls = [item[1] for item in report_urls if item[0] in decade]
        keywords[years] = get_topn_words_from_urls(urls,topn)
        
    return keywords

def main():
    # 設置字節流編碼方式爲utf-8
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    # 按年代分析每10年的政府工做報告
    report_urls = get_report_urls(SUMMARY_URL)
    keywords = get_topn_words_decadal(report_urls,20)
    
    # 將結果保存到文件
    with open('out.tmp','w+') as fout:
        for years,words in keywords.items():
            fout.write('【{years}】\n'.format(years = years.decode('unicode-escape').encode('utf-8')))
            for word,count in words:
                fout.write('{word}:{count};'.format(word = word,count = count))
            fout.write('\n\n')
            
    # 繪出散點圖
    for years,words in keywords.items():
        visual_utils.draw_keywords_scatter(words[:20],u'{years}年政府工做報告關鍵詞Top{topn}'.format(years = years,topn = 20),u'關鍵詞',u'出現總次數')
    

if __name__ == '__main__':
    main()

運行該文件，在當前目錄下的out.tmp文件能夠看到其內容以下：

【1954-1964】
咱們:932;人民:695;國家:690;我國:664;建設:650;發展:641;社會主義:618;生產:509;工業:491;農業:481;工做:396;增加:385;增長:376;必須:361;計劃:339;已經:328;方面:299;進行:298;全國:295;企業:267;

【1975-1987】
發展:1012;咱們:1011;經濟:875;建設:664;我國:609;企業:586;人民:577;國家:569;社會主義:535;改革:499;工做:488;生產:486;必須:451;提升:368;增加:349;方面:349;進行:349;問題:320;增長:290;增強:288;

【1988-1997】
發展:1182;經濟:789;建設:696;改革:537;工做:495;增強:485;企業:485;繼續:455;國家:435;社會:432;咱們:399;我國:378;社會主義:350;積極:340;進一步:334;人民:331;提升:311;政府:289;增長:276;必須:275;

【1998-2007】
發展:814;建設:597;增強:536;經濟:459;工做:430;改革:402;企業:368;繼續:320;社會:287;政府:284;推動:261;增長:245;加快:240;積極:240;進一步:236;堅持:228;咱們:221;提升:217;農村:207;管理:203;

【2008-2017】
發展:1115;建設:597;經濟:554;推動:507;改革:479;增強:456;社會:345;加快:344;政府:320;提升:312;實施:301;促進:301;咱們:294;工做:287;制度:272;增加:259;完善:248;政策:240;就業:240;企業:240;

同時也繪出了5張圖，分別以下：