利用selenium 爬取豆瓣 武林外傳數據而且完成 數據可視化 情緒分析

全文的步驟能夠大概分爲幾步:html

一:數據獲取,利用selenium+多進程(linux上selenium 多進程可能會有問題)+kafka寫數據(linux首選必選耦合)windows直接採用的是寫mysqlpython

二:數據存儲(kafka+hive 或者mysql)+數據清洗shell +python3mysql

三: 數據可視化,詞雲  pyecharts jieba分詞 snownlp (情緒化分析)linux

 

step 1 

selenium 模擬登錄豆瓣,爬去武林外傳的短評:web

  在最開始寫爬蟲的時候,抓取豆瓣評論,咱們從F12裏面是能夠直接發現接口的,可是最近豆瓣更新,數據是JS異步加載的,因此沒有找到合適的方法爬去,因而採用了selenium來模擬瀏覽器爬取。sql

  豆瓣登錄也是改了樣式,咱們能夠發現登錄頁面是在另外一個frame裏面chrome

因此代碼以下:shell

# -*- coding:utf-8 -*-
# 導包
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# 建立chrome參數對象
opt = webdriver.ChromeOptions()
# 把chrome設置成無界面模式,不論windows仍是linux均可以,自動適配對應參數
opt.set_headless()
# 用的是谷歌瀏覽器
driver = webdriver.Chrome(options=opt)
driver=webdriver.Chrome()
# 登陸豆瓣網
driver.get("http://www.douban.com/")

# 切換到登陸框架中來
driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0])
# 點擊"密碼登陸"
bottom1 = driver.find_element_by_xpath('/html/body/div[1]/div[1]/ul[1]/li[2]')
bottom1.click()

# # 輸入密碼帳號
input1 = driver.find_element_by_xpath('//*[@id="username"]')
input1.clear()
input1.send_keys("xxxxx")

input2 = driver.find_element_by_xpath('//*[@id="password"]')
input2.clear()
input2.send_keys("xxxxx")

# 登陸
bottom = driver.find_element_by_class_name('account-form-field-submit ')
bottom.click()

 而後跳轉到評論界面      https://movie.douban.com/subject/3882715/comments?sort=new_score數據庫

點擊下一頁發現url變化  https://movie.douban.com/subject/3882715/comments?start=20&limit=20&sort=new_score 因此咱們觀察到變化後能夠直接寫循環json

 

 獲取用戶的姓名

 

driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).text
用戶的評論

driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/p/span'.format(str(i))).text
而後咱們想要知道用戶的居住地:
1    #獲取用戶的url而後點擊url獲取居住地
2             userInfo=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).get_attribute('href')
3  driver.get(userInfo) 4 try: 5 userLocation = driver.find_element_by_xpath('//*[@id="profile"]/div/div[2]/div[1]/div/a').text 6 print("用戶的居之地是: ") 7 print(userLocation) 8 except Exception as e: 9 print(e)

這裏要注意有些用戶沒有寫居住地,因此必需要捕獲異常

完整代碼

# -*- coding:utf-8 -*-
# 導包
import time
from selenium import webdriver
import pymysql
from selenium.webdriver.common.keys import Keys
from multiprocessing import Pool

class doubanwlwz_spider():
    def writeMysql(self,userName,userConment,userLocation):
        # 打開數據庫鏈接
        db = pymysql.connect("123XXX1", "zXXXan", "XXX1", "huXXXt")
        # 使用 cursor() 方法建立一個遊標對象 cursor
        cursor = db.cursor()
        sql = "insert into userinfo(username,commont,location) values(%s, %s, %s)"
        cursor.execute(sql, [userName, userConment, userLocation])
        db.commit()
        # 關閉數據庫鏈接
        cursor.close()
        db.close()

    def getInfo(self,page):
    # 切換到登陸框架中來
    # 登陸豆瓣網
        opt = webdriver.ChromeOptions()


    # 用的是谷歌瀏覽器

    # 把chrome設置成無界面模式,不論windows仍是linux均可以,自動適配對應參數
        opt.add_argument('--no-sandbox')
        opt.add_argument('--disable-gpu')
        opt.add_argument('--hide-scrollbars')  # 隱藏滾動條, 應對一些特殊頁面

        opt.add_argument('blink-settings=imagesEnabled=false')  # 不加載圖片, 提高速度
    # 用的是谷歌瀏覽器
        driver = webdriver.Chrome('D:\chromedriver_win32\chromedriver', options=opt)

        driver.get("http://www.douban.com/")
        driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0])
        # 點擊"密碼登陸"
        bottom1 = driver.find_element_by_xpath('/html/body/div[1]/div[1]/ul[1]/li[2]')
        bottom1.click()
        # # 輸入密碼帳號
        input1 = driver.find_element_by_xpath('//*[@id="username"]')
        input1.clear()
        input1.send_keys("1XXX2")

        input2 = driver.find_element_by_xpath('//*[@id="password"]')
        input2.clear()
        input2.send_keys("zXXX344")

        # 登陸
        bottom = driver.find_element_by_class_name('account-form-field-submit ')
        bottom.click()

        time.sleep(1)
        #獲取所有評論 一共有24頁,每一個頁面20個評論,一共能抓取到480個
        for i in  range((page-1)*240,page*240,20):
            driver.get('https://movie.douban.com/subject/3882715/comments?start={}&limit=20&sort=new_score'.format(i))
            print("開始抓取第%i頁面"%(i))
            search_window = driver.current_window_handle
            # pageSource=driver.page_source
            # print(pageSource)
            #獲取用戶的名字 每頁20個
            for i in range(1,21):
                userName=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).text
                print("用戶的名字是:  %s"%(userName))
         #  獲取用戶的評論
            # print(driver.find_element_by_xpath('//*[@id="comments"]/div[1]/div[2]/p/span').text)
                userConment=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/p/span'.format(str(i))).text
                print("用戶的評論是:  %s"%(userConment))
        #獲取用戶的url而後點擊url獲取居住地
                userInfo=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).get_attribute('href')
                driver.get(userInfo)
                try:
                    userLocation = driver.find_element_by_xpath('//*[@id="profile"]/div/div[2]/div[1]/div/a').text
                    print("用戶的居之地是  %s"%(userLocation))
                    driver.back()
                    self.writeMysql(userName, userConment, userLocation)
                except Exception as e:
                    userLocation='未填寫'
                    self.writeMysql(userName, userConment, userLocation)
                    driver.back()
        driver.close()



if __name__ == '__main__':
    AAA=doubanwlwz_spider()
    p = Pool(3)
    startTime = time.time()
    for i in range(1, 3):
        p.apply_async(AAA.getInfo, args=(i,))
    p.close()
    p.join()
    stopTime = time.time()
    print('Running time: %0.2f Seconds' % (stopTime - startTime))

 

step 2

 linux代碼(這裏須要注意 linux對於selenium和這裏的多進程適配不是很好,建議使用windows,linux上面寫的是kafka)kafka代碼以下

# -*- coding: utf-8 -*-
from kafka import KafkaProducer
from kafka import KafkaConsumer
from kafka.errors import KafkaError
import json
import time
import sys


class Kafka_producer():
    '''
    使用kafka的生產模塊
    '''

    def __init__(self, kafkahost, kafkaport, kafkatopic):
        self.kafkaHost = kafkahost
        self.kafkaPort = kafkaport
        self.kafkatopic = kafkatopic
        self.producer = KafkaProducer(bootstrap_servers='{kafka_host}:{kafka_port}'.format(
            kafka_host=self.kafkaHost,
            kafka_port=self.kafkaPort
        ))

    def sendjsondata(self, params):
        try:
            #parmas_message = json.dumps(params)
            parmas_message =params
            producer = self.producer
            kafkamessage=parmas_message.encode('utf-8')
            producer.send(self.kafkatopic, kafkamessage,partition= 0)
   #         producer.send('test1', value= ingo1, partition= 0)

            producer.flush()
        except KafkaError as e:
            print(e)



def main(message,topicname):
    '''
    測試consumer和producer
    :return:
    '''
    # 測試生產模塊
    producer = Kafka_producer("10XXX2", "9098", topicname)
    print("進入kafka的數據是%s"%(message))
    producer.sendjsondata(message)
    time.sleep(1)

 

 

# -*- coding:utf-8 -*-
# 導包
import time
from selenium import webdriver
import pymysql
from selenium.webdriver.common.keys import Keys
from multiprocessing import Pool
import  producer


class doubanwlwz_spider():
    def writeMysql(self,userName,userConment,userLocation):
        # 打開數據庫鏈接
        db = pymysql.connect("123.207.35.161", "zhangfan", "N$nIpms1", "hupo_test")
        # 使用 cursor() 方法建立一個遊標對象 cursor
        cursor = db.cursor()
        sql = "insert into userinfo(username,commont,location) values(%s, %s, %s)"
        cursor.execute(sql, [userName, userConment, userLocation])
        db.commit()
        # 關閉數據庫鏈接
        cursor.close()
        db.close()



    def getInfo(self,page):
    # 切換到登陸框架中來
    # 登陸豆瓣網
        opt = webdriver.ChromeOptions()

    # 把chrome設置成無界面模式,不論windows仍是linux均可以,自動適配對應參數
        opt.add_argument('--no-sandbox')
        opt.add_argument('--disable-gpu')
        opt.add_argument('--hide-scrollbars')  # 隱藏滾動條, 應對一些特殊頁面
        opt.add_argument('blink-settings=imagesEnabled=false')  # 不加載圖片, 提高速度
        opt.add_argument('--headless') #瀏覽器不提供可視化頁面. 
        # 用的是谷歌瀏覽器
        driver = webdriver.Chrome('/opt/scripts/zf/douban/chromedriver',options=opt)

    # 用的是谷歌瀏覽器

        driver.get("http://www.douban.com/")
        driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0])
        # 點擊"密碼登陸"
        bottom1 = driver.find_element_by_xpath('/html/body/div[1]/div[1]/ul[1]/li[2]')
        bottom1.click()
        # # 輸入密碼帳號
        input1 = driver.find_element_by_xpath('//*[@id="username"]')
        input1.clear()
        input1.send_keys("XXX2")

        input2 = driver.find_element_by_xpath('//*[@id="password"]')
        input2.clear()
        input2.send_keys("XX44")
        # 登陸
        bottom = driver.find_element_by_class_name('account-form-field-submit ')
        bottom.click()

        time.sleep(1)
        print("登陸成功")
        #獲取所有評論 一共有480頁,每一個頁面20個評論
        for i in  range((page-1)*120,page*120):
            driver.get('https://movie.douban.com/subject/3882715/comments?start={}&limit=20&sort=new_score'.format(i))
            print("開始抓取第%i頁面"%(i))
            search_window = driver.current_window_handle
            # pageSource=driver.page_source
            # print(pageSource)
            #獲取用戶的名字 每頁20個
            for i in range(1,21):
                userName=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).text
                print("用戶的名字是:  %s"%(userName))
#傳遞兩個參數  第一個topic信息  第二個topic名稱
                producer.main(userName,"username")
         #  獲取用戶的評論
            # print(driver.find_element_by_xpath('//*[@id="comments"]/div[1]/div[2]/p/span').text)
                userConment=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/p/span'.format(str(i))).text
                print("用戶的評論是:  %s"%(userConment))
                producer.main(userConment,"usercomment")
        #獲取用戶的url而後點擊url獲取居住地
                userInfo=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).get_attribute('href')
                driver.get(userInfo)
                try:
                    userLocation = driver.find_element_by_xpath('//*[@id="profile"]/div/div[2]/div[1]/div/a').text
                    print("用戶的居之地是  %s"%(userLocation))
                    producer.main(userLocation,"userlocation")
                except Exception as e:
                    userLocation='NUll'
                    producer.main("未填寫","userlocation")
 ##               self.writeMysql(userName, userConment, userLocation)

#driver.back()


if __name__ == '__main__':
    AAA=doubanwlwz_spider()
    p = Pool(4)
    startTime = time.time()
    for i in range(1, 5):
        p.apply_async(AAA.getInfo, args=(i,))
    p.close()
    p.join()
    stopTime = time.time()
    print('Running time: %0.2f Seconds' % (stopTime - startTime))
                                                                                              

 

step 3

在windows上面直接從mysql讀取數據作可視化                    對應linux須要消費kafka消息代碼以下

rom kafka import KafkaConsumer
from multiprocessing import Pool
import json
import re


def writeTxt(topic):
        consumer = KafkaConsumer(topic,
                         auto_offset_reset='earliest',
                         group_id = "test1",
                         bootstrap_servers=['10.8.31.2:9098'])

        for message in consumer:
    #print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition,
                                         # message.offset, message.key,
                                          #message.value.decode('utf-8')))
#匹配中文(把中文符號替換爲空) 須要注意的是kafka裏面的數據是二進制這裏必須decode
                string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+", " ", message.value.decode('utf-8'))
                print(string)
                f=open(topic,'a',encoding='utf-8')
                f.write(string)
                f.close()

p = Pool(4)
for i in {"userlocation","usercomment","username"}:
        print(i)
        p.apply_async(writeTxt, args=(i,))
p.close()
p.join()

 

------------------------------------------華麗的分割線 ----------------------------到此已經獲取到了因此的數據----------------------------------------------------

 

 

 

獲取到數據以後,首先對用戶location作可視化

第一步 作數據清洗,把裏面的數據中文符號所有轉爲爲空格

import re
f = open('name.txt','r')
for line in f.readlines():
        string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+", " ", line)
        print(line)
        print(string)
        f1=open("newname.txt",'a',encoding='utf-8')
        f1.write(string)
        f1.close()
f.close()

第二步 數據作詞雲,須要過濾停用詞,而後分詞

#定義結巴分詞的方法以及處理過程
import jieba.analyse
import jieba

#須要輸入須要分析的文本名稱,分析後輸入的文本名稱
class worldAnalysis():
    def __init__(self,inputfilename,outputfilename):
        self.inputfilename=inputfilename
        self.outputfilename=outputfilename
        self.start()
#--------------------------------這裏實現分詞和去停用詞---------------------------------------
# 建立停用詞列表
    def stopwordslist(self):
        stopwords = [line.strip() for line in open('ting.txt',encoding='UTF-8').readlines()]
        return stopwords

    # 對句子進行中文分詞
    def seg_depart(self,sentence):
        # 對文檔中的每一行進行中文分詞
        print("正在分詞")
        sentence_depart = jieba.cut(sentence.strip())
        # 建立一個停用詞列表
        stopwords = self.stopwordslist()
        # 輸出結果爲outstr
        outstr = ''
        # 去停用詞
        for word in sentence_depart:
            if word not in stopwords:
                if word != '\t':
                    outstr += word
                    outstr += " "
        return outstr

    def start(self):
        # 給出文檔路徑
        filename = self.inputfilename
        outfilename = self.outputfilename
        inputs = open(filename, 'r', encoding='UTF-8')
        outputs = open(outfilename, 'w', encoding='UTF-8')

        # 將輸出結果寫入ou.txt中
        for line in inputs:
            line_seg = self.seg_depart(line)
            outputs.write(line_seg + '\n')
            print("-------------------正在分詞和去停用詞-----------")
        outputs.close()
        inputs.close()
        print("刪除停用詞和分詞成功!!!")

        self.LyricAnalysis()

    #實現數據詞頻統計
    def splitSentence(self):
        #下面的程序完成分析前十的數據出現的次數
        f = open(self.outputfilename, 'r', encoding='utf-8')
        a = f.read().split()
        b = sorted([(x, a.count(x)) for x in set(a)], key=lambda x: x[1], reverse=True)
        #print(sorted([(x, a.count(x)) for x in set(a)], key=lambda x: x[1], reverse=True))
        print("shuchub")
#        for i in range(0,100):
#               print(b[i][0],end=',')
#        print("---------")
#        for i in range(0,100):
#               print(b[i][1],end=',')
        for i in range(0,100):
            print("("+'"'+b[i][0]+'"'+","+ str(b[i][1])+')'+',')

    #輸出頻率最多的前十個字,裏面調用splitSentence完成頻率出現最多的前十個詞的分析
    def LyricAnalysis(self):
        import jieba
        file = self.outputfilename
        #這個技巧須要注意
        alllyric = str([line.strip() for line in open(file,encoding="utf-8").readlines()])
    #獲取所有歌詞,在一行裏面
        alllyric1=alllyric.replace("'","").replace(" ","").replace("?","").replace(",","").replace('"','').replace("?","").replace(".","").replace("!","").replace(":","")
       # print(alllyric1)
        self.splitSentence()
        #下面是詞頻(單個漢字)統計
        import collections
        # 讀取文本文件,把全部的漢字拆成一個list
        f = open(file, 'r', encoding='utf8')  # 打開文件,並讀取要處理的大段文字
        txt1 = f.read()
        txt1 = txt1.replace('\n', '')  # 刪掉換行符
        txt1 = txt1.replace(' ', '')  # 刪掉換行符
        txt1 = txt1.replace('.', '')  # 刪掉逗號
        txt1 = txt1.replace('.', '')  # 刪掉句號
        txt1 = txt1.replace('o', '')  # 刪掉句號
        mylist = list(txt1)
        mycount = collections.Counter(mylist)
        for key, val in mycount.most_common(10):  # 有序(返回前10個)
            print("開始單詞排序")
            print(key, val)
#輸入文本爲
newcomment.txt 輸出 test.txt
AAA=worldAnalysis("newcomment.txt","test.txt")

輸入結果  這樣輸出的緣由是後面須要用pyechart作數據的詞雲

 

第三步 詞雲可視化

from pyecharts import options as opts
from pyecharts.charts import Page, WordCloud
from pyecharts.globals import SymbolType
from pyecharts.charts import Bar

from pyecharts.render import make_snapshot
from snapshot_selenium import snapshot

words = [
("經典",100),
("喜歡",65),
("情景喜劇",47),
("喜劇",42),
("搞笑",37),
("武林",36),
("如今",33),
("",29),
("外傳",29),
("不少",29),
("點穴",28),
("葵花",28),
("電視劇",28),
("",27),
("以爲",27),
("排山倒海",26),
("真的",25),
("",25),
("",24),
("一個",23),
("小時候",23),
("",22),
("好看",21),
("這部",21),
("一部",20),
("每一個",20),
("掌櫃",19),
("臺詞",19),
("回憶",19),
("看過",18),
("裏面",18),
("百看不厭",18),
("童年",18),
("秀才",18),
("國產",17),
("",17),
("中國",17),
("",16),
("很是",16),
("一集",15),
("",15),
("寧財神",15),
("沒有",15),
("不錯",14),
("不會",14),
("道理",14),
("重溫",14),
("",14),
("演員",14),
("",13),
("",13),
("哈哈哈",13),
("人生",13),
("老白",13),
("人物",12),
("故事",12),
("",12),
("情景劇",11),
("開心",11),
("感受",11),
("以後",11),
("",11),
("",11),
("幽默",11),
("每次",11),
("角色",10),
("",10),
("",10),
("客棧",10),
("看看",10),
("發現",10),
("生活",10),
("江湖",10),
("",10),
("記得",10),
("起來",9),
("特別",9),
("劇情",9),
("一直",9),
("一遍",9),
("印象",9),
("看到",9),
("很差",9),
("當時",9),
("最近",9),
("歡樂",9),
("知道",9),
("芙蓉",8),
("之做",8),
("絕對",8),
("沒法",8),
("十年",8),
("依然",8),
("巔峯",8),
("好像",8),
("長大",8),
("深入",8),
("無聊",8),
("之前",7),
("時間",7),
    
]


def wordcloud_base() -> WordCloud:
    c = (
        WordCloud()
        .add("", words, word_size_range=[20, 100],shape="triangle-forward",)
        .set_global_opts(title_opts=opts.TitleOpts(title="WordCloud-基本示例"))
    )
    return c
make_snapshot(snapshot, wordcloud_base().render(), "bar.png")
# wordcloud_base().render()

 

 

 

二 用戶地址可視化

 

 

 用戶所在地成都熱點圖

程序腳本:這裏須要注意這裏的城市必定要是中國城市的名稱,爲了處理元數據用了xlml(f)+py  隨便放一下py腳本

 

數據處理

f=open("city.txt",'r')
for i in f.readlines():
        #print(i,end=",")
        print('"'+i.strip()+'"',end=",")

 

 

from example.commons import Faker
from pyecharts import options as opts
from pyecharts.charts import Geo
from pyecharts.globals import ChartType, SymbolType

def geo_base() -> Geo:
    c = (
        Geo()
        .add_schema(maptype="china")
        .add("geo", [list(z) for z in zip(["北京","廣東","上海","廣州","江蘇","四川","武漢","湖北","深圳","成都","浙江","山東","福建","南京","福州","河北","江西","南寧","杭州","湖南","長沙","河南","鄭州","蘇州","重慶","濟南","黑龍江","石家莊","西安","南昌","陝西","哈爾濱","吉林","廈門","天津","瀋陽","香港","青島","無錫","貴州"], ["86","52","42","29","26","20","16","16","16","16","13","12","12","12","8","7","7","7","7","6","6","6","6","6","6","6","5","5","5","5","5","5","4","4","4","4","3","3","3","3"])])
        .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        .set_global_opts(
            visualmap_opts=opts.VisualMapOpts(),
            title_opts=opts.TitleOpts(title="城市熱點圖"),
        )
    )
    return c
geo_base().render()

 

 

 

 漏斗圖  因爲頁面適配的問題這裏已經篩減了不少城市了

from example.commons import Faker
from pyecharts import options as opts
from pyecharts.charts import Funnel, Page


def funnel_base() -> Funnel:
    c = (
        Funnel()
            .add("geo", [list(z) for z in zip(
            ["北京", "廣東", "上海", "廣州", "江蘇", "四川", "武漢", "湖北", "深圳", "成都", "浙江", "山東", "福建", "南京", "福州", "河北", "江西", "南寧",
             "杭州", "湖南", "長沙", "河南", "鄭州", "蘇州", "重慶", "濟南"],
            ["86", "52", "42", "29", "26", "20", "16", "16", "16", "16", "13", "12", "12", "12", "8", "7", "7", "7",
             "7", "6", "6", "6", "6", "6", "6", "6"])])

            .set_global_opts(title_opts=opts.TitleOpts())
    )
    return c
funnel_base().render('漏斗圖.html')

 

 

 

 

 

餅圖

from example.commons import Faker
from pyecharts import options as opts
from pyecharts.charts import Page, Pie


def pie_base() -> Pie:
    c = (
        Pie()
        .add("", [list(z) for z in zip(  ["北京", "廣東", "上海", "廣州", "江蘇", "四川", "武漢", "湖北", "深圳", "成都", "浙江", "山東", "福建", "南京", "福州", "河北", "江西", "南寧",
             "杭州", "湖南", "長沙", "河南", "鄭州", "蘇州", "重慶", "濟南"],
            ["86", "52", "42", "29", "26", "20", "16", "16", "16", "16", "13", "12", "12", "12", "8", "7", "7", "7",
             "7", "6", "6", "6", "6", "6", "6", "6"])])
        .set_global_opts(title_opts=opts.TitleOpts())
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    )
    return c

pie_base().render("餅圖.html")

 

 

 評論情緒化分析代碼以下

from snownlp import SnowNLP
f=open("comment.txt",'r')
sentiments=0
count=0

point2=0
point3=0
point4=0
point5=0
point6=0
point7=0
point8=0
point9=0
for i in f.readlines():
        s = SnowNLP(i)
        s1 = SnowNLP(s.sentences[0])
        for p in s.sentences:
                s = SnowNLP(p)
                s1 = SnowNLP(s.sentences[0])
                count+=1
                if s1.sentiments > 0.9:
                        point9+=1
                elif s1.sentiments> 0.8 and s1.sentiments <=0.9:
                        point8+=1
                elif s1.sentiments> 0.7 and s1.sentiments <=0.8:
                        point7+=1
                elif s1.sentiments> 0.6 and s1.sentiments <=0.7:
                        point6+=1
                elif s1.sentiments> 0.5 and s1.sentiments <=0.6:
                        point5+=1
                elif s1.sentiments> 0.4 and s1.sentiments <=0.5:
                        point4+=1
                elif s1.sentiments> 0.3 and s1.sentiments <=0.4:
                        point3+=1
                elif s1.sentiments> 0.2 and s1.sentiments <=0.3:
                        point2=1
                print(s1.sentiments)
                sentiments+=s1.sentiments

print(sentiments)
print(count)
avg1=int(sentiments)/int(count)
print(avg1)


print(point9)
print(point8)
print(point7)
print(point6)
print(point5)
print(point4)
print(point3)
print(point2)

 

情緒可視化

 

主要人物熱力圖

cat comment.txt  | grep -E  '佟|掌櫃|湘玉|閆妮'  | wc -l 
33
cat comment.txt  | grep -E  '老白|展堂|盜聖'  | wc -l 
25
cat comment.txt  | grep -E  '大嘴'  | wc -l 
8
cat comment.txt  | grep -E  '小郭|郭|芙蓉'  | wc -l 
17
cat comment.txt  | grep -E  '秀才|呂輕侯'  | wc -l 
17
cat comment.txt  | grep -E  '小六'  | wc -l 
2

 

from pyecharts.charts import Bar
from pyecharts import options as opts

# V1 版本開始支持鏈式調用
bar = (
    Bar()
    .add_xaxis(["佟湘玉", "老白", "小郭", "秀才", "小六", "襪子"])
    .add_yaxis("人物熱力", [33, 25, 8, 17, 17, 2])
    .set_global_opts(title_opts=opts.TitleOpts(title="人物熱力"))
    # 或者直接使用字典參數
    # .set_global_opts(title_opts={"text": "主標題", "subtext": "副標題"})
)
bar.render("人物熱力.html")

 

相關文章
相關標籤/搜索