期末大做業

時間 2019-11-25

標籤期末大做简体版

原文原文鏈接

一.迴歸模型與房價預測算法

要求：windows

1. 導入boston房價數據集數組

2. 一元線性迴歸模型，創建一個變量與房價之間的預測模型，並圖形化顯示。app

3. 多元線性迴歸模型，創建13個變量與房價之間的預測模型，並檢測模型好壞，並圖形化顯示檢查結果。函數

4. 一元多項式迴歸模型，創建一個變量與房價之間的預測模型，並圖形化顯示。性能

#導入波士頓房價數據集
from sklearn.datasets import load_boston
import  pandas as pd
 
boston = load_boston()
df = pd.DataFrame(boston.data)　

#一元線性迴歸模型，創建一個變量與房價之間的預測模型，並圖形化顯示。
from sklearn.linear_model import LinearRegression
import  matplotlib.pyplot as plt
 
x =boston.data[:,5]
y = boston.target
LinR = LinearRegression()
LinR.fit(x.reshape(-1,1),y)
w=LinR.coef_
b=LinR.intercept_
print(w,b)
 
plt.scatter(x,y)
plt.plot(x,w*x+b,'orange')
plt.show()

#多元線性迴歸模型，創建13個變量與房價之間的預測模型，並檢測模型好壞，並圖形化顯示檢查結果。
x = boston.data[:,12].reshape(-1,1)
y = boston.target
plt.figure(figsize=(10,6))
plt.scatter(x,y)
 
lineR = LinearRegression()
lineR.fit(x,y)
y_pred = lineR.predict(x)
plt.plot(x,y_pred,'r')
print(lineR.coef_,lineR.intercept_)
plt.show()

#一元多項式迴歸模型，創建一個變量與房價之間的預測模型，並圖形化顯示。

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
x_poly = poly.fit_transform(x)
print(x_poly)
lrp = LinearRegression()
lrp.fit(x_poly,y)
y_poly_pred = lrp.predict(x_poly)
plt.scatter(x,y)
plt.scatter(x,y_pred)
plt.scatter(x,y_poly_pred)
plt.show()

# 多元線性迴歸模型
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# 波士頓房價數據集
data = load_boston()

# 劃分數據集
x_train, x_test, y_train, y_test = train_test_split(data.data,data.target,test_size=0.3)

# 創建多元線性迴歸模型
mlr = LinearRegression()
mlr.fit(x_train,y_train)
print('係數',mlr.coef_,"\n截距",mlr.intercept_)

# 檢測模型好壞
from sklearn.metrics import regression
y_predict = mlr.predict(x_test)
# 計算模型的預測指標
print("預測的均方偏差：", regression.mean_squared_error(y_test,y_predict))
print("預測的平均絕對偏差：", regression.mean_absolute_error(y_test,y_predict))
# 打印模型的分數
print("模型的分數：",mlr.score(x_test, y_test))

# 多元多項式迴歸模型
# 多項式化
poly2 = PolynomialFeatures(degree=2)
x_poly_train = poly2.fit_transform(x_train)
x_poly_test = poly2.transform(x_test)

# 創建模型
mlrp = LinearRegression()
mlrp.fit(x_poly_train, y_train)

# 預測
y_predict2 = mlrp.predict(x_poly_test)

# 檢測模型好壞
# 計算模型的預測指標
print("預測的均方偏差：", regression.mean_squared_error(y_test,y_predict2))
print("預測的平均絕對偏差：", regression.mean_absolute_error(y_test,y_predict2))
# 評估模型的分數
print("模型的分數：",mlrp.score(x_poly_test, y_test))

運行結果：測試

創建13個變量與房價之間多元迴歸模型字體

檢測線性迴歸模型的好壞spa

結論：3d

經過比較一元線性迴歸模型和多元線性迴歸模型，會發現，多元線性迴歸模型所見的曲線比一元線性迴歸模型的直線更貼合樣本點的分佈。因此，多元線性迴歸模型更優，性能更好，偏差更小。

二.中文文本分類

要求：

按學號末位下載相應數據集。

258：家居、教育、科技、社會、時尚、

分別創建中文文本分類模型，實現對文本的分類。

基本步驟以下：

1.各類獲取文件，寫文件

2.除去噪聲，如：格式轉換，去掉符號，總體規範化

3.遍歷每一個個文件夾下的每一個文本文件。

4.使用jieba分詞將中文文本切割。

中文分詞就是將一句話拆分爲各個詞語，由於中文分詞在不一樣的語境中歧義較大，因此分詞極其重要。

能夠用jieba.add_word('word')增長詞，用jieba.load_userdict('wordDict.txt')導入詞庫。

維護自定義詞庫

5.去掉停用詞。

維護停用詞表

6.對處理以後的文本開始用TF-IDF算法進行單詞權值的計算

7.貝葉斯預測種類

8.模型評價

9.新文本類別預測

處理過程當中注意：

實驗過程當中文件遍歷從少許到多量，調試無誤後再處理所有文件
判斷文件大小決定讀取方法
注意保存中間結果，以避免每次從頭讀取文件重複處理
內存不足時進行分批處理
利用數組的保存np.save('x1.npy',x1)與數組的讀取np.load('x1.npy')和數組的拼接np.concatenate((x1,x2),axis=0)
及時用 del(x1) 釋放大塊內存，用gc.collect()回收內存。
邊處理邊保存數據，不要處理完了一次性保存。萬一中間發生的異常狀況，就所有白作了。
進行Python 異常處理，把出錯的文件單獨記錄，程序能夠繼續執行。回頭再單獨處理出錯的文件。

在準備長時間無監督運行程序以前，請關閉windows自動更新、自動屏保關機等...

代碼：

# 導入數據
import os
import numpy as np
import sys
from datetime import datetime
import gc
path = 'C:\\Users\\s2009\\Desktop\\dididi\\258'

# 導入結巴庫
import jieba
# 加載停用詞賦值給變量
with open(r'258\stopsCN.txt', encoding='utf-8') as f:
stopwords = f.read().split('\n')

#定義函數處理文本，字符串
def processing(tokens):
    # 去掉非字母漢字的字符
    tokens = "".join([char for char in tokens if char.isalpha()])
    # 結巴分詞，保留長度大於2的詞
    tokens = [token for token in jieba.cut(tokens,cut_all=True) if len(token) >=2]
    # 刪除停用詞
    tokens = " ".join([token for token in tokens if token not in stopwords])
return tokens

# 將處理好的數據放入文本列表和標籤列表
tokenList = []
targetList = []
# 在os包下調用walk方法獲取須要的變量，並返回文件根目錄，文件夾名稱，文件名，最後獲得每一個新聞的路徑
for root,dirs,files in os.walk(path):
    for f in files:
        filePath = os.path.join(root,f)
        with open(filePath, encoding='utf-8') as f:
            content = f.read()
   # 獲取新聞類別標籤，並處理該新聞
        target = filePath.split('\\')[-2]
        targetList.append(target)
        tokenList.append(processing(content))


#劃分訓練集，測試集，用TF-IDF算法計算單詞權值
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
x_train,x_test,y_train,y_test = train_test_split(tokenList,targetList,test_size=0.2,stratify=targetList)
# 數據向量化處理，選擇TfidfVectorizer的方式創建特徵向量。
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)
# 創建模型,使用多項式樸素貝葉斯，調用fit方法 
mnb = MultinomialNB()
module = mnb.fit(X_train, y_train)

#對模型進行預測
y_predict = module.predict(X_test)
# 輸出模型精確度
scores=cross_val_score(mnb,X_test,y_test,cv=5)
print("Accuracy:%.3f"%scores.mean())
# 輸出模型評估報告
print("classification_report:\n",classification_report(y_predict,y_test))



# 將預測結果和實際結果進行對比
import collections
import matplotlib.pyplot as plt
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體  
mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示爲方塊的問題

# 統計測試集和預測集的各種新聞個數
testCount = collections.Counter(y_test)
predCount = collections.Counter(y_predict)
print('實際：',testCount,'\n', '預測', predCount)


# 創建標籤列表，實際結果列表，預測結果列表，
nameList = list(testCount.keys())
testList = list(testCount.values())
predictList = list(predCount.values())
x = list(range(len(nameList)))
print("新聞類別：",nameList,'\n',"實際：",testList,'\n',"預測：",predictList)