《數據分析實戰-托馬茲.卓巴斯》讀書筆記第2章-變量分佈與相關性、圖表

時間 2020-03-29

標籤數據分析實戰-托馬茲.卓巴斯讀書筆記變量分佈相關性圖表欄目 Android 简体版

原文原文鏈接

python學習筆記-目錄索引html

第2章　探索數據python

本章會介紹一些技術，幫你更好地理解數據，以及探索特徵之間的關係。你將學習如下主題：mysql

·生成描述性的統計數據git

·探索特徵之間的相關性github

·可視化特徵之間的相互做用web

·生成直方圖sql

·建立多變量的圖表數據庫

·數據取樣api

·將數據集拆分紅訓練集、交叉驗證集和測試集數組

2.1導論

接下來的技巧會使用Python與D3.js來創建對數據的理解。咱們會分析變量的分佈，捋清特徵之間的關係，並將其相互做用可視化。你會學到生成直方圖及在二維圖表中表示三維數據的方法。最後，你會學習給樣本分層，並將數據集拆分紅測試集與訓練集。

2.2生成描述性的統計數據
要徹底理解任何隨機變量的分佈，咱們須要知道其平均數與標準差、最小值與最大值、中位數、四分位數、偏度和峯度。
獨立安裝 pandas

pip install pandas

生成描述性數據，簡單示例1-使用padas

pandas有個很管用的.describe（）方法，它替咱們作了大部分的工做。這個方法能生成咱們想要的大部分描述變量；輸出看起來是這樣的（爲清晰作了相應簡化）：

 1 import pandas as pd 
 2 
 3 # name of the file to read from
 4 r_filenameCSV = '../../Data/Chapter02/' + \
 5     'realEstate_trans_full.csv'
 6 
 7 # name of the output file
 8 w_filenameCSV = '../../Data/Chapter02/' + \
 9     'realEstate_descriptives.csv'
10 
11 # read the data
12 csv_read = pd.read_csv(r_filenameCSV)
13 
14 # calculate the descriptives: count, mean, std,
15 # min, 25%, 50%, 75%, max
16 # for a subset of columns
17 # 對某些列計算描述變量：總數，平均數、標準差、最小值
18 #25%數、50%數、75%數、最大值
19 csv_desc = csv_read[
20     [   
21         'beds','baths','sq__ft','price','s_price_mean',
22         'n_price_mean','s_sq__ft','n_sq__ft','b_price',
23         'p_price','d_Condo','d_Multi-Family',
24         'd_Residential','d_Unkown'
25     ]
26 ].describe().transpose()
27 
28 # and add skewness（偏度），mode（衆數） and kurtosis（峯度）
29 csv_desc['skew'] = csv_read.skew(numeric_only=True)
30 csv_desc['mode'] = \
31     csv_read.mode(numeric_only=True).transpose()
32 csv_desc['kurtosis'] = csv_read.kurt(numeric_only=True)
33 
34 print(csv_desc)

DataFrame對象的索引標明瞭描述性統計數據的名字，每一列表明咱們數據集中一個特定的變量。不過，咱們還缺偏度、峯度和衆數。爲了更方便地加入csv_desc變量，咱們使用.transpose（）移項了.describe（）方法的輸出結果，使得變量放在索引裏，每一列表明描述性的變量。

簡單示例2-使用SciPy和NumPy

獨立安裝 SciPy
pip install SciPy

.genfromtxt（...）方法以文件名做爲第一個（也是惟一必需的）參數。本例中分隔符是'，'，也能夠是\t。names參數指定爲True，意味着變量名存於第一行。最後，usecols參數指定文件中哪些列要存進csv_read對象。

最終能夠計算出要求的數據：

import scipy.stats as st
import numpy as np

# name of the file to read from
r_filenameCSV = '../../Data/Chapter02/' + \
    'realEstate_trans_full.csv'

# read the data
csv_read = np.genfromtxt(
    r_filenameCSV, 
    delimiter=',',
    names=True,
    # only numeric columns
    usecols=[4,5,6,8,11,12,13,14,15,16,17,18,19,20]
)

# calculate the descriptives
desc = st.describe([list(item) for item in csv_read])

# and print out to the screen
print(desc)

.genfromtxt（...）方法建立的數據是一系列元組。.describe（...）方法只接受列表形式的數據，因此得先（使用列表表達式）將每一個元組轉換成列表。

http://docs.scipy.org/doc/scipy/reference/stats.html#statistical-functions

2.3探索特徵之間的相關性

兩個變量之間的相關係數用來衡量它們之間的關係。

係數爲1，咱們能夠說這兩個變量徹底相關；係數爲-1，咱們能夠說第二個變量與第一個變量徹底負相關；係數0意味着二者之間不存在可度量的關係。

這裏要強調一個基礎事實：不能由於兩個變量是相關的，就說二者之間存在因果關係。要了解更多，可訪問https://web.cn.edu/kwheeler/logic_causation.html。

咱們將測算公寓的臥室數目、浴室數目、樓板面積與價格之間的相關性。

原理:pandas可用於計算三種相關度：皮爾遜積矩相關係數、肯達爾等級相關係數和斯皮爾曼等級相關係數。後二者對於非正態分佈的隨機變量並非很敏感。

咱們計算這三種相關係數，而且將結果存在csv_corr變量中。

import pandas as pd 

# name of the file to read from
r_filenameCSV = '../../Data/Chapter02/' + \
    'realEstate_trans_full.csv'

# name of the output file
w_filenameCSV = '../../Data/Chapter02/' + \
    'realEstate_corellations.csv'

# read the data and select only 4 variables
csv_read = pd.read_csv(r_filenameCSV)
csv_read = csv_read[['beds','baths','sq__ft','price']]

# calculate the correlations
#皮爾遜積矩相關係數、肯達爾等級相關係數、斯皮爾曼級相關係數
coefficients = ['pearson', 'kendall', 'spearman']

csv_corr = {}

for coefficient in coefficients:
    csv_corr[coefficient] = csv_read \
        .corr(method=coefficient) \
        .transpose()

# output to a file
with open(w_filenameCSV,'w') as write_csv:
    for corr in csv_corr:
        write_csv.write(corr + '\n')
        write_csv.write(csv_corr[corr].to_csv(sep=','))
        write_csv.write('\n') */

參考，也可使用NumPy計算皮爾遜相關係數：http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html。

2.4可視化特徵之間的相互做用
D3.js是Mike Bostock用以可視化數據的一個強大框架。它能幫你使用HTML、SVG和CSS來操做數據。本技巧將探索房屋的面積與價格之間是否存在聯繫。

代碼分兩部分：準備數據（pandas和SQLAlchemy）和呈現數據（HTML與D3.js）。

上一章1.7已經將csv數據寫入mysql，代碼以下：

 1 import pandas as pd
 2 import sqlalchemy as sa
 3 # from sqlalchemy.ext.declarative import declarative_base
 4 # from sqlalchemy import create_engine
 5 
 6 # name of the CSV file to read from and SQLite database
 7 r_filenameCSV = '../../Data/Chapter01/realEstate_trans.csv'
 8 rw_filenameSQLite = '../../Data/Chapter01/realEstate_trans__Result.db'
 9 
10 
11 
12 # create the connection to the database
13 engine = sa.create_engine("mysql+pymysql://root:downmoon@localhost:3306/test?charset=utf8")
14 
15 
16 #===============================================================================
17 # Base = declarative_base()
18 # engine = sa.create_engine("mysql+pymysql://root:downmoon@localhost:3306/test?charset=utf8")
19 #   
20 # Base.metadata.reflect(engine)
21 # tables = Base.metadata.tables
22 #   
23 # print(tables)
24 # # 獲取本地test數據庫中的 real_estates 表
25 # real_estates = tables["real_estate"]
26 #   
27 # # 查看engine包含的表名
28 # print(engine.table_names())
29 #===============================================================================
30 
31 
32 
33 # read the data
34 csv_read = pd.read_csv(r_filenameCSV)
35 
36 # transform sale_date to a datetime object
37 csv_read['sale_date'] = pd.to_datetime(csv_read['sale_date'])
38 
39 # store the data in the database
40 csv_read.to_sql('real_estate', engine, if_exists='replace')
41 
42 # print the top 10 rows from the database
43 query = 'SELECT * FROM real_estate LIMIT 5'
44 top5 = pd.read_sql_query(query, engine)
45 print(top5)

使用SQL Alchemy從mySQL數據庫中提取數據。下面給出查詢的例子（data_interaction.py文件），取出的數據保存在/Data/Chapter02/realEstate_d3.csv文件中。

 1 import pandas as pd
 2 import sqlalchemy as sa
 3 
 4 # names of the files to output the samples
 5 w_filenameD3js = '../../Data/Chapter02/realEstate_d3.csv'
 6 
 7 # database credentials
 8 usr  = 'root'
 9 pswd = 'downmoon'
10 dbname='test'
11  
12 # create the connection to the database
13 engine = sa.create_engine(
14     'mysql+pymysql://{0}:{1}@localhost:3306/{2}?charset=utf8' \
15     .format(usr, pswd,dbname)
16 )
17 
18 # read prices from the database
19 query = '''SELECT sq__ft, 
20             price / 1000 AS price 
21         FROM real_estate 
22         WHERE sq__ft > 0 
23         AND beds BETWEEN 2 AND 4'''
24 data = pd.read_sql_query(query, engine)
25 
26 # output the samples to files
27 with open(w_filenameD3js,'w',newline='') as write_csv:
28     write_csv.write(data.to_csv(sep=',', index=False))

使用這個框架前得先導入。這裏提供用D3.js建立散佈圖的僞代碼。下一節咱們會一步步解釋代碼：

原理：
首先，往HTML的DOM（Document Object Model）結構追加一個SVG（Scalable Vector Graphics）對象：

        var width   = 800;
        var height  = 600;
        var spacing = 60;

        // Append an SVG object to the HTML body
        var chart = d3.select('body')
            .append('svg')
            .attr('width',  width  + spacing)
            .attr('height', height + spacing)
        ;

使用D3.js從DOM中取出body對象，加上一個SVG對象，指定屬性、寬和高。SVG對象加到了DOM上，如今該讀取數據了。可由下面的代碼完成：

   // Read in the dataset (from CSV on the server)
  d3.csv('http://localhost:8080/examples/realEstate_d3.csv', function(d) {
            draw(d) });

使用D3.js提供的方法讀入CSV文件。.csv（...）方法的第一個參數指定了數據集；本例讀取的是tomcat上的CSV文件(realEstate_d3.csv)。
第二個參數是一個回調函數，這個函數將調用draw（...）方法處理數據。
D3.js不能直接讀取本地文件（存儲在你的電腦上的文件）。你須要配置一個Web服務器（Apache或Node.js均可以）。若是是Apache，你得把文件放在服務器上，若是是Node.js，你能夠讀取數據而後傳給D3.js。

draw（...）函數首先找出價格和樓板面積的最大值；該數據會用於定義圖表座標軸的範圍：

function draw(dataset){
            // Find the maximum price and area
            var limit_max = {
                'price': d3.max(dataset, function(d) {
                    return parseInt(d['price']); }),
                'sq__ft': d3.max(dataset, function(d) {
                    return parseInt(d['sq__ft']); })
            };

咱們使用D3.js的.max（...）方法。這個方法返回傳入數組的最大元素。做爲可選項，你能夠指定一個accessor函數，訪問數據集中的數據，根據須要進行處理。咱們的匿名函數對數據集中的每個元素都返回做爲整數解析的結果（數據是做爲字符串讀入的）。
下面定義範圍：

 // Define the scales
            var scale_x = d3.scale.linear()
                .domain([0, limit_max['price']])
                .range([spacing, width - spacing * 2]);

            var scale_y = d3.scale.linear()
                .domain([0, limit_max['sq__ft']])
                .range([height - spacing, spacing]);

D3.js的scale.linear（）方法用傳入的參數創建了線性間隔的標度。domain指定了數據集的範圍。咱們的價格從0到$884000（limit_max['price']），樓板面積從0到5822（limit_max['sq__ft']）。range（...）的參數指定了domain如何轉換爲SVG窗口的大小。對於scale_x，有這麼個關係：若是price=0，圖表中的點要置於左起60像素處；若是價格是$884000，點要置於左起600（width）-60（spacing）*2=480像素處。
下面定義座標軸：

//Define the axes
            var axis_x = d3.svg.axis()
                .scale(scale_x)
                .orient('bottom')
                .ticks(5);

            var axis_y = d3.svg.axis()
                .scale(scale_y)
                .orient('left')
                .ticks(5);

座標軸要有標度，這也是咱們首先傳的。對於axis_x，咱們但願它處於底部，因此是.orient（'bottom'），axis_y要處於圖表的左邊（.orient（'left'））。.ticks（5）指定了座標軸上要顯示多少個刻度標識；D3.js自動選擇最佳間隔。
你可使用.tickValues（...）取代.ticks（...），自行指定刻度值。
如今準備好在圖表上繪出點了：

// Draw dots on the chart
            chart.selectAll('circle')
                .data(dataset)
                .enter()
                .append('circle')
                .attr('cx', function(d) {
                    return scale_x(parseInt(d['price']));
                })
                .attr('cy', function(d) {
                    return scale_y(parseInt(d['sq__ft']));
                })
                .attr('r', 3)
                ;

首先選出圖表上全部的圓圈；由於尚未圓圈，因此這個命令返回一個空數組。.data（dataset）.enter（）鏈造成了一個for循環，對於數據集中的每一個元素，往咱們的圖表上加一個點，即.append（'circle'）。每一個圓圈須要三個參數：cx（水平位置）、cy（垂直位置）和r（半徑）。.attr（...）指定了這些參數。看代碼就知道，咱們的匿名函數返回了轉換後的價格和麪積。
繪出點以後，咱們能夠畫座標軸了：

// Append X axis to chart
            chart.append('g')
                .attr('class', 'axis')
                .attr('transform', 'translate(0,' + (height - spacing) + ')')
                .call(axis_x);
            
            // Append Y axis to chart
            chart.append('g')
                .attr('class', 'axis')
                .attr('transform', 'translate(' + spacing + ',0)')
                .call(axis_y);

D3.js的文檔中指出，要加上座標軸，必須先附加一個g元素。而後指定g元素的class，使其黑且細（不然，座標軸和刻度線都會很粗——參見HTML文件中的CSS部分）。translation屬性將座標軸從圖表的頂部挪到底部，第一個參數指定了沿着橫軸的動做，第二個參數指定了縱向的轉換。最後調用axis_x。儘管看上去很奇怪——爲啥要調用一個變量？——緣由很簡單：axis_x不是數組，而是能生成不少SVG元素的函數。調用這個函數，能夠將這些元素加到g元素上。
最後，加上標籤，以便人們瞭解座標軸表明的意義：

// Append axis labels    
            chart.append('text')
                .attr("transform", "translate(" + (width / 2) + " ," + (height - spacing / 3) + ")")
                .style('text-anchor', 'middle')
                .text('Price $ (,000)');
                
             chart.append("text")
                .attr("transform", "rotate(-90)")
                .attr("y", 14)
                .attr("x",0 - (height / 2))
                .style("text-anchor", "middle")
                .text("Floor area sq. ft.");

咱們附加了一個文本框，指定其位置，並將文本錨定在中點。rotate轉換將標籤逆時針旋轉90度。.text（...）指定了具體的標籤內容。
而後就有了咱們的圖：

可視化數據時，D3.js很給力。這裏有一套很好的D3.js教程：http://alignedleft.com/tutorials/d3/。也推薦學習Mike Bostock的例子：https://github.com/mbostock/d3/wiki/Gallery。

邀月注：其餘開源的圖表組件多的是，這裏只是原書做者的一家之言。

2.5生成直方圖

獨立安裝 Matplotlib
pip install Matplotlib
獨立安裝 Seaborn

pip install Seaborn
直方圖能幫你迅速瞭解數據的分佈。它將觀測數據分組，並以長條表示各分組中觀測數據的個數。這是個簡單而有力的工具，可檢測數據是否有問題，也可看出數據是否聽從某種已知的分佈。
本技巧將生成數據集中全部價格的直方圖。你須要用pandas和SQLAlchemy來檢索數據。Matplotlib和Seaborn處理展現層的工做。Matplotlib是用於科學數據展現的一個2D庫。Seaborn構建在Matplotlib的基礎上，併爲生成統計圖表提供了一個更簡便的方法（好比直方圖等）。
咱們假設數據可從mySQL數據庫取出。參考下面的代碼將生成價格的直方圖，並保存到PDF文件中（data_histograms.py文件）

 1 import matplotlib.pyplot as plt
 2 import pandas as pd
 3 import seaborn as sns
 4 import sqlalchemy as sa
 5 
 6 # database credentials
 7 usr  = 'root'
 8 pswd = 'downmoon'
 9 dbname='test'
10  
11 # create the connection to the database
12 engine = sa.create_engine(
13     'mysql+pymysql://{0}:{1}@localhost:3306/{2}?charset=utf8' \
14     .format(usr, pswd,dbname)
15 )
16 
17 # read prices from the database
18 query = 'SELECT price FROM real_estate'
19 price = pd.read_sql_query(query, engine)
20 
21 # generate the histograms
22 ax = sns.distplot(
23     price, 
24     bins=10, 
25     kde=True    # show estimated kernel function
26 )
27 
28 # set the title for the plot
29 ax.set_title('Price histogram with estimated kernel function')
30 
31 # and save to a file
32 plt.savefig('../../Data/Chapter02/Figures/price_histogram.pdf')
33 
34 # finally, show the plot
35 plt.show()

原理：首先從數據庫中讀取數據。咱們省略了鏈接數據庫的部分——參考之前的章節或者源代碼。price變量就是數據集中全部價格造成的一個列表。
用Seaborn生成直方圖很輕鬆，一行代碼就能夠搞定。.distplot（...）方法將一個數字列表（price變量）做爲第一個（也是惟一必需的）參數。其他參數都是可選項。bins參數指定了要建立多少個塊。kde參數指定是否要展現評估的核密度。
核密度評估是一個得力的非參數檢驗技巧，用來評估一個未知分佈的機率密度函數（PDF，probability density function）。核函數的積分爲1（也就是說，在整個函數域上，密度函數累積起來的最大值爲1），中位數爲0。
.distplot（...）方法返回一個座標軸對象（參見http://matplotlib.org/api/axes_api.html）做爲咱們圖表的畫布。.set_title（...）方法建立了圖表的標題。
咱們使用Matplotlib的.savefig（...）方法保存圖表。惟一必需的參數是文件保存的路徑和名字。.savefig（...）方法足夠智能，能從文件名的擴展中推斷出合適的格式。可接受的文件名擴展包括：原始RGBA的raw和rgba，bitmap，pdf，可縮放矢量圖形的svg和svgz，封裝式Postscript的eps，jpeg或jpg，bmp.jpg，gif，pgf（LaTex的PGF代碼），tif或tiff，以及ps（Postscript）。
最後一個方法將圖表輸出到屏幕：

2.6建立多變量的圖表
獨立安裝 Bokeh----邀月注：這是一個不小的包
pip install Bokeh

/*
Installing collected packages: PyYAML, MarkupSafe, Jinja2, pillow, packaging, tornado, bokeh
Successfully installed Jinja2-2.10.3 MarkupSafe-1.1.1 PyYAML-5.2 bokeh-1.4.0 packaging-19.2 pillow-6.2.1 tornado-6.0.3
FINISHED
*/

前一個技巧顯示，在Sacramento地區，不足兩個臥室的房屋成交量不多。在2.4節中，咱們用D3.js展示價格和樓板面積之間的關係。本技巧會在這個二維圖表中加入另外一個維度，臥室的數目。
Bokeh是一個結合了Seaborn和D3.js的模塊：它生成頗有吸引力的數據可視化圖像（就像Seaborn），也容許你經過D3.js在HTML文件中操做圖表。用D3.js生成相似的圖表須要更多的代碼。源代碼在data_multivariate_charts.py中：

 1 # prepare the query to extract the data from the database
 2 query = 'SELECT beds, sq__ft, price / 1000 AS price \
 3     FROM real_estate \
 4     WHERE sq__ft > 0 \
 5     AND beds BETWEEN 2 AND 4'
 6 
 7 # extract the data
 8 data = pd.read_sql_query(query, engine)
 9 
10 # attach the color based on the bed count
11 data['color'] = data['beds'].map(lambda x: colormap[x])
12 
13 # create the figure and specify label for axes
14 fig = b.figure(title='Price vs floor area and bed count')
15 fig.xaxis.axis_label = 'Price ($ \'000)'
16 fig.yaxis.axis_label = 'Feet sq'
17 
18 # and plot the data
19 for i in range(2,5):
20     d = data[data.beds == i]
21 
22     fig.circle(d['price'], d['sq__ft'], color=d['color'],
23         fill_alpha=.1, size=8, legend='{0} beds'.format(i))
24 
25 # specify the output HTML file
26 b.output_file(
27     '../../Data/Chapter02/Figures/price_bed_area.html',
28     title='Price vs floor area for different bed count'
29 )

原理:首先，和一般狀況同樣，咱們須要數據；從mySQL數據庫取出數據，這個作法你應該已經很熟悉了。而後給每一條記錄上色，這會幫助咱們看出在不一樣臥室個數條件下價格和樓板面積的關係。colormap變量以下：

# colors for different bed count
colormap = {
    2: 'firebrick',
    3: '#228b22',
    4: 'navy'
}

顏色能夠指定爲可讀的字符串或者十六進制值（如前面的代碼所示）。
要將臥室個數映射到特定的顏色，咱們使用了lambda。lambda是Python內置的功能，容許你用一個未命名短函數，而不是普通的函數，原地完成單個任務。它也讓代碼更可讀。
參考下面的教程，理解爲何lambda頗有用，以及方便使用的場景：https://pythonco nquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/。
如今咱們能夠建立圖像。咱們使用Bokeh的.figure（...）方法。你能夠指定圖表的標題（正如咱們所作的），以及圖形的寬和高。咱們也定義座標軸的標籤，以便讀者知道他們在看什麼。
而後咱們繪製數據。咱們遍歷臥室可能個數的列表range（2，5）。（實際上生成的列表是[2，3，4]——參考range（...）方法的文檔https://docs.python.org/3/library/stdtypes.html#typesseq-range）。對每一個數字，咱們提取數據的一個子集，並存在DataFrame對象d中。
對於對象d的每條記錄，咱們往圖表中加一個圓圈。.circle（...）方法以x座標和y座標做爲第一個和第二個參數。因爲咱們想看在不一樣臥室數目條件下，價格和麪積的關係，因此咱們指定圓圈的顏色。fill_alpha參數定義了圓圈的透明度；取值範圍爲[0，1]，0表示徹底透明，1表示徹底不透明。size參數肯定了圓圈有多大，legend是附加在圖表上的圖例。
Bokeh將圖表保存爲交互型HTML文件。title參數肯定了文件名。這是一份完備的HTML代碼。其中包括了JavaScript和CSS。咱們的代碼生成下面的圖表：

圖表能夠移動、縮放、調整長寬。
參考:Bokeh是很是有用的可視化庫。要了解它的功能，能夠看看這裏的示例：http://bokeh.pydata.org/en/latest/docs/gallery.html。

2.7數據取樣
有時候數據集過大，不方便創建模型。出於實用的考慮（不要讓模型的估計沒有個盡頭），最好從完整的數據集中取出一些分層樣本。
本技巧從MongoDB讀取數據，用Python取樣。
要實踐本技巧，你須要PyMongo(邀月注：可修改成mysql)、pandas和NumPy。
有兩種作法：肯定一個抽樣的比例（好比說，20%），或者肯定要取出的記錄條數。下面的代碼展現瞭如何提取必定比例的數據（data_sampling.py文件）：
原理:首先肯定取樣的比例，即strata_frac變量。從mySQL(邀月注：原書代碼爲MongoDB，修改成mySQL)取出數據。MongoDB返回的是一個字典。pandas的.from_dict（...）方法生成一個DataFrame對象，這樣處理起來更方便。
要獲取數據集中的一個子集，pandas的.sample（...）方法是一個很方便的途徑。不過這裏仍是有一個陷阱：全部的觀測值被選出的機率相同，可能咱們獲得的樣本中，變量的分佈並不能表明整個數據集。
在這個簡單的例子中，爲了不前面的陷阱，咱們遍歷臥室數目的取值，用.sample（...）方法從這個子集中取出一個樣本。咱們能夠指定frac參數，以返回數據集子集（臥室數目）的一部分。
咱們還使用了DataFrame的.append（...）方法：有一個DataFrame對象（例子中的sample），將另外一個DataFrame附加到這一個已有的記錄後面。ignore_index參數設爲True時，會忽略附加DataFrame的索引值，並沿用原有DataFrame的索引值。
更多：有時，你會但願指定抽樣的數目，而不是佔原數據集的比例。以前說過，pandas的.sample（...）方法也能很好地處理這種場景（data_sampling_alternative.py文件）。

 1 #import pymongo
 2 import pandas as pd
 3 import numpy as np
 4 import sqlalchemy as sa
 5 
 6 # define a specific count of observations to get back
 7 strata_cnt = 200
 8 
 9 # name of the file to output the sample
10 w_filenameSample = \
11     '../../Data/Chapter02/realEstate_sample2.csv'
12 
13 # limiting sales transactions to those of 2, 3, and 4 bedroom
14 # properties
15 beds = [2,3,4]
16 
17 # database credentials
18 usr  = 'root'
19 pswd = 'downmoon'
20 dbname='test'
21  
22 # create the connection to the database
23 engine = sa.create_engine(
24     'mysql+pymysql://{0}:{1}@localhost:3306/{2}?charset=utf8' \
25     .format(usr, pswd,dbname)
26 )
27 
28 
29 query = 'SELECT zip,city,price,beds,sq__ft FROM real_estate where \
30             beds in ("2","3","4")\
31             '
32 sales = pd.read_sql_query(query, engine)
33 
34 # calculate the expected counts
35 ttl_cnt = sales['beds'].count()
36 strata_expected_counts = sales['beds'].value_counts() / \
37                          ttl_cnt * strata_cnt
38 
39 # and select the sample
40 sample = pd.DataFrame()
41 
42 for bed in beds:
43     sample = sample.append(
44         sales[sales.beds == bed] \
45         .sample(n=np.round(strata_expected_counts[bed])),
46         ignore_index=True
47     )
48 
49 # check if the counts selected match those expected
50 strata_sampled_counts = sample['beds'].value_counts()
51 print('Expected: ', strata_expected_counts)
52 print('Sampled: ', strata_sampled_counts)
53 print(
54     'Total: expected -- {0}, sampled -- {1}' \
55     .format(strata_cnt, strata_sampled_counts.sum())
56 )
57 
58 # output to the file
59 with open(w_filenameSample,'w') as write_csv:
60     write_csv.write(sample.to_csv(sep=',', index=False))

以上代碼運行會報錯，解決方案以下：

/* Traceback (most recent call last):
  File "D:\Java2018\practicalDataAnalysis\Codes\Chapter02\data_sampling_alternative_mysql.py", line 45, in <module>
    .sample(n=np.round(strata_expected_counts[bed])),
  File "D:\tools\Python37\lib\site-packages\pandas\core\generic.py", line 4970, in sample
    locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
  File "mtrand.pyx", line 847, in numpy.random.mtrand.RandomState.choice
TypeError: 'numpy.float64' object cannot be interpreted as an integer 

.sample(n=np.round(strata_expected_counts[bed])),改成.sample(n=int(np.round(strata_expected_counts[bed]))),
*/

2.8將數據集拆分紅訓練集、交叉驗證集和測試集
獨立安裝 Bokeh----邀月注：這是一個不小的包
pip install Bokeh

/*
Installing collected packages: joblib, scikit-learn, sklearn
Successfully installed joblib-0.14.1 scikit-learn-0.22 sklearn-0.0
FINISHED
*/

要創建一個可信的統計模型，咱們須要確信它精確地抽象出了咱們要處理的現象。要得到這個保證，咱們須要測試模型。要保證精確度，咱們訓練和測試不能用一樣的數據集。
本技法中，你會學到如何將你的數據集快速分紅兩個子集：一個用來訓練模型，另外一個用來測試。
要實踐本技巧，你須要pandas、SQLAlchemy和NumPy。
咱們從Mysql數據庫讀出數據，存到DataFrame裏。一般咱們劃出20%～40%的數據用於測試。本例中，咱們選出1/3的數據（data_split_mysql.py文件）：

 1 import numpy as np
 2 import pandas as pd
 3 import sqlalchemy as sa
 4 
 5 # specify what proportion of data to hold out for testing
 6 test_size = 0.33
 7 
 8 # names of the files to output the samples
 9 w_filenameTrain = '../../Data/Chapter02/realEstate_train.csv'
10 w_filenameTest  = '../../Data/Chapter02/realEstate_test.csv'
11 
12 # database credentials
13 usr  = 'root'
14 pswd = 'downmoon'
15 dbname='test'
16  
17 # create the connection to the database
18 engine = sa.create_engine(
19     'mysql+mysqlconnector://{0}:{1}@localhost:3306/{2}?charset=utf8' \
20     .format(usr, pswd,dbname)
21 )
22 
23 # read prices from the database
24 query = 'SELECT * FROM real_estate'
25 data = pd.read_sql_query(query, engine)
26 
27 # create a variable to flag the training sample
28 data['train']  = np.random.rand(len(data)) < (1 - test_size) 
29 
30 # split the data into training and testing
31 train = data[data.train]
32 test  = data[~data.train]
33 
34 # output the samples to files
35 with open(w_filenameTrain,'w') as write_csv:
36     write_csv.write(train.to_csv(sep=',', index=False))
37 
38 with open(w_filenameTest,'w') as write_csv:
39     write_csv.write(test.to_csv(sep=',', index=False))

原理:
咱們從指定劃分數據的比例與存儲數據的位置開始：兩個存放訓練集和測試集的文件。
咱們但願隨機選擇測試數據。這裏，咱們使用NumPy的僞隨機數生成器。.rand（...）方法生成指定長度（len（data））的隨機數的列表。生成的隨機數在0和1之間。
接着咱們將這些數字與要歸到訓練集的比例（1-test_size）進行比較：若是數字小於比例，咱們就將記錄放在訓練集（train屬性的值爲True）中；不然就放到測試集中（train屬性的值爲False）。
最後兩行將數據集拆成訓練集和測試集。～是邏輯運算「否」的運算符；這樣，若是train屬性爲False，那麼「否」一下就成了True。

SciKit-learn提供了另外一種拆分數據集的方法。咱們先將原始的數據集分紅兩塊，一塊是因變量y，一塊是自變量x：

# select the independent and dependent variables
x = data[['zip', 'beds', 'sq__ft']]
y = data['price']

而後就能夠拆了：

# and perform the split
x_train, x_test, y_train, y_test = sk.train_test_split(
    x, y, test_size=0.33, random_state=42)

.train_test_split（...）方法幫咱們將數據集拆成互補的子集：一個是訓練集，另外一個是測試集。在每一個種類中，咱們有兩個數據集：一個包含因變量，另外一個包含自變量。

Tips一、

ModuleNotFoundError: No module named 'sklearn.cross_validation'
/*
在sklearn 0.18及以上的版本中，出現了sklearn.cross_validation沒法導入的狀況，緣由是新版本中此包被廢棄

只需將 cross_validation 改成 model_selection 便可
 */
Tips二、
/*  File "D:\Java2018\practicalDataAnalysis\Codes\Chapter02\data_split_alternative_mysql.py", line 40, in <module>
    y_train.reshape((x_train.shape[0], 1)), \
。。。。。。。
AttributeError: 'Series' object has no attribute 'reshape'

只需 y_train.reshape((x_train.shape[0], 1)改成 y_train.values.reshape((x_train.shape[0], 1)便可
*/