《數據分析實戰-托馬茲.卓巴斯》讀書筆記第1章-數據格式與數據交互

時間 2020-03-29

標籤數據分析實戰-托馬茲.卓巴斯讀書筆記數據格式交互欄目 Android 简体版

原文原文鏈接

趁着過年宅家，讀了托馬茲·卓巴斯的《數據分析實戰》，2018年6月出版，本系列爲讀書筆記。主要是爲了系統整理，加深記憶。html

關於做者
托馬茲·卓巴斯（Tomasz Drabas）是微軟的數據科學家，目前工做於西雅圖。他擁有超過13年的數據分析經驗，行業領域覆蓋高新技術、航空、電信、金融以及諮詢。
2003年，Tomasz得到戰略管理的碩士學位後，從位於波蘭華沙的LOT波蘭航空公司開啓了他的職業生涯。2007年，他前往悉尼，在新南威爾士大學航空學院攻讀運籌學博士學位；他的研究結合了離散選擇模型和航空做業。在悉尼的日子裏，他曾擔任過Beyond Analysis Australia公司的數據分析師，沃達豐和記澳大利亞公司的高級數據分析師/數據科學家，以及其餘職位。他也發表過學術論文，參加過國際會議，而且擔任過學術期刊的審稿人。
2015年，他搬到西雅圖，開始在微軟工做。在這裏他致力於解決高維特徵空間的問題。

python

本書深刻數據分析與建模的世界，使用多種方法、工具及算法，提供了豐富的技巧。
　　本書第一部分會講授一些實戰技巧，用於讀取、寫入、清洗、格式化、探索與理解數據；第二部分由一些較深刻的主題組成，好比分類、聚類和預測等。第三部分介紹更高深的主題，從圖論到天然語言處理，到離散選擇模型，再到模擬。
　　經過閱讀本書，你將學到：
　　- 使用Pandas與OpenRefine讀取、清洗、轉換與存儲數據
　　- 使用Pandas與D3.js理解數據，探索變量間的關係
　　- 使用Pandas、mlpy、NumPy與Statsmodels，應用多種技法，分類、聚類銀行的營銷電話
　　- 使用Pandas、NumPy與mlpy減小數據集的維度，提取重要的特徵
　　- 使用NetworkX和Gephi探索社交網絡的交互，用圖論的概念識別出欺詐行爲
　　- 經過加油站的例子，學習代理人基建模的模擬技術git

第1章講解了利用多種數據格式與數據庫來讀取與寫入數據的過程，以及使用OpenRefine與Python對數據進行清理。
第2章描述了用於理解數據的多種技巧。咱們會了解如何計算變量的分佈與相關性，並生成多種圖表。
第3章介紹了處理分類問題的種種技巧，從樸素貝葉斯分類器到複雜的神經網絡和隨機樹森林。
第4章解釋了多種聚類模型；從最多見的k均值算法開始，一直到高級的BIRCH算法和DBSCAN算法。
第5章展現了不少降維的技巧，從最知名的主成分分析出發，經由其核版本與隨機化版本，一直講到線性判別分析。
第6章涵蓋了許多回歸模型，有線性的，也有非線性的。咱們還會複習隨機森林和支持向量機，它們可用來解決分類或迴歸問題。
第7章探索瞭如何處理和理解時間序列數據，並創建ARMA模型以及ARIMA模型。
第8章介紹瞭如何使用NetworkX和Gephi來對圖數據進行處理、理解、可視化和分析。
第9章描述了多種與分析文本信息流相關的技巧：詞性標註、主題抽取以及對文本數據的分類。
第10章解釋了選擇模型理論以及一些流行的模型：多項式Logit模型、嵌套Logit模型以及混合Logit模型。
第11章涵蓋了代理人基的模擬；咱們模擬的場景有：加油站的加油過程，電動車耗盡電量以及狼——羊的掠食。github

本文主要記錄使用python工具及第1 章內容。web

（一）使用工具正則表達式

一、本人使用python 3.7.5，64位，可官方下載。https://www.python.org/getit/算法

二、IDE書中推薦使用Anacondasql

Anaconda 官方下載 https://www.anaconda.com/distribution/#download-section
注：若是想直接使用全書示例，請使用該版本。若是想學習各個組件的部署，請參照書中指示手工配置各個組件。
Anaconda 是Python的一個發行版,裏面內置了不少工具,不用單獨安裝,由於作了優化也免去了單獨安裝帶來的一些麻煩。
Anaconda 是一種Python語言的免費增值開源發行版,用於進行大規模數據處理、預測分析,和科學計算,致力於簡化包的管理和部署。
下載安裝過程時不要添加到Path變量，Anaconda會自動搜索你本機已經安裝的python版本。數據庫

國內鏡像下載
https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/

Spyder是簡單高效的IDE，Spyder是Python(x,y)的做者爲它開發的一個簡單的集成開發環境。和其餘的Python開發環境相比，它最大的優勢就是模仿MATLAB的「工做空間」的功能，能夠很方便地觀察和修改數組的值。
我的感受：
1）普通的筆記本配置運行Anaconda有點吃力，內存專給它4G還好，主要的是硬盤和CPU,明顯聽到SATA物理硬盤嘎嘎響(後改到SSD分區好些)，啓動時CPU會近60%。
2）Anaconda+spyder比通用的Eclipse+pyDev要專業的多，固然後者的項目管理功能更強。

邀月推薦：新手用Anaconda便可。畢竟集成了spyder，其中spyder建議手動升級到4.0以上或最新版本。本人鍾愛Eclipse，因此使用Eclipse2019-09版本。

三、PIP國內鏡像（主要解決域外服務器不穩定，你必定懂的。）
清華大學 https://pypi.tuna.tsinghua.edu.cn/simple/

設置方法：（以清華鏡像爲例，其它鏡像同理）

（1）臨時使用：
能夠在使用 pip 的時候，加上參數-i和鏡像地址(如https://pypi.tuna.tsinghua.edu.cn/simple)，
例如：pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas，這樣就會從清華鏡像安裝pandas庫。

（2）永久修改，一勞永逸：
（a）Linux下，修改 ~/.pip/pip.conf (沒有就建立一個文件夾及文件。文件夾要加「.」，表示是隱藏文件夾)
內容以下：

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted-host = https://pypi.tuna.tsinghua.edu.cn

(b) windows下，直接在user目錄中建立一個pip目錄，如：C:\Users\xx\pip，而後新建文件pip.ini，即 %HOMEPATH%\pip\pip.ini，在pip.ini文件中輸入如下內容（以清華鏡像爲例）：

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted-host = https://pypi.tuna.tsinghua.edu.cn

四、若是不喜歡pip，可使用conda，本文不贅述，一樣要注意設置清華鏡像站點。

本人使用win10及以上環境，編譯過了書中96%以上的代碼，其中報錯處，後續文中詳述。本文末尾附上隨書源代碼下載地址。

（二）第1 章內容：

第1章　準備數據

本章內容涵蓋了使用Python和OpenRefine來完成讀取、存儲和清理數據這些基本任務。你將學習如下內容：

·使用Python讀寫CSV/TSV文件

·使用Python讀寫JSON文件

·使用Python讀寫Excel文件

·使用Python讀寫XML文件

·使用pandas檢索HTML頁面

·存儲並檢索關係數據庫

·存儲並檢索MongoDB

·使用OpenRefine打開並轉換數據

·使用OpenRefine探索數據

·排重

·使用正則表達式與GREL清理數據

·插補缺失值

·將特性規範化、標準化

·分級數據

·編碼分類變量

1.2使用python讀取CSV/TSV文件

獨立安裝 pandas
pip install pandas

官方文檔在這裏：
https://pandas.pydata.org/pandas-docs/stable/

/*
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
........
Installing collected packages: pytz, numpy, python-dateutil, pandas
Successfully installed numpy-1.17.4 pandas-0.25.3 python-dateutil-2.8.1 pytz-2019.3
FINISHED
*/

讀取CSV：
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table

padas讀取大容量數據的速度，不得不讚一下。

Tips:
一、python3 寫CSV文件多一個空行
解決方案：打開文件的時候多指定一個參數。

#write to files
with open(w_filenameCSV,'w',newline='') as write_csv:
    write_csv.write(tsv_read.to_csv(sep=',', index=False))

1.3使用python讀取JSON文件
獨立安裝 pandas 同上，略

讀取CSV：
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-json-reader

1.4使用python讀取Excel文件
獨立安裝 pandas 同上，略

Tips:

/*
 Module Not Found Error: No module named 'openpyxl'
 */

解決方案：pip install openpyxl

Tips2:

/*
Traceback (most recent call last):
  File "D:\Java2018\practicalDataAnalysis\Codes\Chapter01\read_xlsx_alternative.py", line 16, in <module>
    labels = [cell.value for cell in xlsx_ws.rows[0]]
TypeError: 'generator' object is not subscriptable
*/

解決方案：
一、改代碼

# name of files to read from
r_filenameXLSX = '../../Data/Chapter01/realEstate_trans.xlsx'

# open the Excel file
xlsx_wb = oxl.load_workbook(filename=r_filenameXLSX)

# extract the 'Sacramento' worksheet
xlsx_ws =  xlsx_wb.active #默認第一個sheet

labels=[]
for row in  xlsx_ws.iter_cols(min_row=1,max_row=1):
    labels+=([cell.value for cell in row])
    
# extract the data and store in a list
# each element is a row from the Excel file
data = []
for row in  xlsx_ws.iter_rows(min_row=1,max_col=10, max_row=10):
    data.append([cell.value for cell in row])
    

###print the prices of the first 10 properties
print( [item[labels.index('price')] for item in data[0:10]] )

二、修改openpyxl版本
pip install -I openpyxl==2.3.3

三、採用其餘插件如xlrd

 # name of files to read from
r_filenameXLSX = '../../Data/Chapter01/realEstate_trans.xlsx'

try:  
    # 打開文件
    xlsx_wb = exls.open_workbook(r_filenameXLSX)
    #sheet2_name = workbook.sheet_names() # 獲取全部sheet名稱
    #print(sheet2_name)
    # 根據sheet索引或者名稱獲取sheet內容
    xlsx_ws = xlsx_wb.sheet_by_index(0) # sheet索引從0開始
    # sheet1 = workbook.sheet_by_name('sheet2')
    # sheet1的名稱，行數，列數
    print("sheet名："+xlsx_ws.name, "共",str(xlsx_ws.nrows)+"行", str(xlsx_ws.ncols)+"列")
    s="";
    labels=xlsx_ws.row_values(0,1,10)
    
    data=[]
    # first 10 rows of xlsx_ws.nrows
    for rownum in range(0, 10):
        #first 10 columns of xlsx_ws.ncols
        data.append(xlsx_ws.row_values(rownum,1,10))
    # print(labels)
    # print(data)
            
    print( [item[labels.index('price')] for item in data[0:10]] )

except Exception as e:
    print(e)

1.5使用python讀取XML文件
獨立安裝 pandas 同上，略

1.6使用python讀取html文件
獨立安裝 pandas 同上，略
獨立安裝 re 正則表達式模塊
pip install html5lib

/*
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
........
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.0.1 webencodings-0.5.1
FINISHED
 */

Tips:ImportError: lxml not found, please install it

/*  
   File "D:\tools\Python37\lib\site-packages\pandas\io\html.py", line 843, in _parser_dispatch
    raise ImportError("lxml not found, please install it")
ImportError: lxml not found, please install it
 */

解決方案：pip install lxml

Tips:ImportError: BeautifulSoup4 (bs4) not found, please install it

/*  
  File "D:\tools\Python37\lib\site-packages\pandas\io\html.py", line 837, in _parser_dispatch
    raise ImportError("BeautifulSoup4 (bs4) not found, please install it")
ImportError: BeautifulSoup4 (bs4) not found, please install it
 */

解決方案：pip install BeautifulSoup4

示例：
https://programtalk.com/python-examples/pandas.read_html/

1.7存儲並檢索關係數據庫
獨立安裝 pandas 同上，略
獨立安裝 sqlalchemy,psycopg2
pip install sqlalchemy
--pip install psycopg2
sqlalchemy支持各特別是常見數據庫，好比mySQL,Oracle，SqlLite，PostgreSQL等

 1 /*
 2 Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
 3 ........
 4 Installing collected packages: sqlalchemy
 5 Successfully installed sqlalchemy-1.3.11
 6 FINISHED
 7 
 8 
 9 Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
10 Collecting psycopg2
11   Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1a/85/853f11abfccfd581b099e5ae5f2dd807cc2919745b13d14e565022fd821c/psycopg2-2.8.4-cp37-cp37m-win_amd64.whl (1.1MB)
12 Installing collected packages: psycopg2
13 Successfully installed psycopg2-2.8.4
14 FINISHED
15 
16 
17 
18 D:\tools\Python37\lib\site-packages\dateutil\parser\_parser.py:1218: UnknownTimezoneWarning: tzname EDT identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
19   category=UnknownTimezoneWarning)
20    index            street        city  ...  price   latitude   longitude
21 0      0      3526 HIGH ST  SACRAMENTO  ...  59222  38.631913 -121.434879
22 1      1       51 OMAHA CT  SACRAMENTO  ...  68212  38.478902 -121.431028
23 2      2    2796 BRANCH ST  SACRAMENTO  ...  68880  38.618305 -121.443839
24 3      3  2805 JANETTE WAY  SACRAMENTO  ...  69307  38.616835 -121.439146
25 4      4   6001 MCMAHON DR  SACRAMENTO  ...  81900  38.519470 -121.435768
26 
27 [5 rows x 13 columns] */

View Code

1.8存儲並檢索MongoDB
獨立安裝 pandas 同上，略
獨立安裝 PyMongo
pip install --upgrade PyMongo

1.9使用OpenRefine打開並轉換數據
下載：https://github.com/OpenRefine/OpenRefine/releases

(substring(value,4,10)+','+substring(value,24,29)).toDate()

1.10使用OpenRefine探索數據

1.11排重

1.12使用正則表達式與GREL清理數據

 --value.match(/(.*)(..)(\d{5})/)[0],注意空格
 value.match(/(.*) (..) (\d{5})/)[0]--city
 value.match(/(.*) (..) (\d{5})/)[1]--State
 value.match(/(.*) (..) (\d{5})/)[2]--zip

1.13插補缺失值

 # impute mean in place of NaNs
#估算平均數以替代空值
csv_read['price_mean'] = csv_read['price'] \
    .fillna(
        csv_read.groupby('zip')['price'].transform('mean')
    )

# impute median in place of NaNs
#估算中位數以替代空值
csv_read['price_median'] = csv_read['price'] \
    .fillna(
        csv_read.groupby('zip')['price'].transform('median')
    )

1.14將特徵規範化、標準化
數據規範化：讓全部的數都落在0與1的範圍內（閉區間）
數據標準化：移動其分佈，使得數據的平均數是0，標準差是1
規範化數據，即讓每一個值都落在0和1之間，咱們減去數據的最小值，併除以樣本的範圍。統計學上的範圍指的是最大值與最小值的差。normalize（...）方法就是作的前面描述的工做：對數據的集合，減去最小值，除以範圍。
標準化的過程相似：減去平均數，除以樣本的標準差。這樣，處理後的數據，平均數爲0而標準差爲1。standardize（...）方法作了這些處理：

def normalize(col):
    #Normalize column 規範化
     return (col - col.min()) / (col.max() - col.min())

def standardize(col):
     #Standardize column 標準化
    return (col - col.mean()) / col.std()

1.15分級數據
當咱們想查看數據分佈的形狀，或將數據轉換爲有序的形式時，數據分級就派上用場了。
獨立安裝 pandas 同上，略
獨立安裝 Numpy
pip install numpy

分位數與百分位數有緊密的聯繫。區別在於百分位數返回的是給定百分數的值（即間隔均勻），而分位數返回的是給定分位點的值（即數目大體相等）。
想了解更多，可訪問https://www.stat.auckland.ac.nz/~ihaka/787/lectures-quantiles-handouts.pdf
1）如下代碼取百分位數：

#    根據線性劃分的價格的範圍，建立間隔相等的價格容器
#   示例linspace（0，6，6）==>[0，,1.2,，2.4,，3.6,，4.8,，6.0]
bins = np.linspace(
    csv_read['price_mean'].min(),
    csv_read['price_mean'].max(),
    6
)

# and apply the bins to the data
#  將容器應用到數據上，digitize第一個參數是要分級的列，第二個參數是容器的數組
csv_read['b_price'] = np.digitize(
    csv_read['price_mean'],
    bins
)

# print out the counts for the bins
#  每一個容器中的記錄計數
counts_b = csv_read['b_price'].value_counts()
print(counts_b.sort_index())

有時候咱們不會用均勻間隔的值，咱們會讓每一個箱中擁有相同的數目。要達成這個目標，咱們可使用分位數。
咱們想把列拆成十分位數，即10個（差很少）相等的容器。要作到這點，咱們可使用下面的代碼：

# create bins based on deciles
# 建立基於十分位數的箱子，即每一個箱中擁有差很少相同的數目
decile = csv_read['price_mean'].quantile(np.linspace(0,1,11))

# and apply the decile bins to the data
# 對數據應用分數位
csv_read['p_price'] = np.digitize(
    csv_read['price_mean'],
    decile
)

# print out the counts for the percentile bins
counts_p = csv_read['p_price'].value_counts()
print(counts_p.sort_index())

.quantile（...）方法能夠傳一個（0到1之間的）數字，來代表要返回的分位數（例如，0.5是中位數，0.25和0.75是上下四分位數）。它也能夠傳入一個分位的列表，返回相應的值的數組。.linspace（0，1，11）方法會生成這個數組：.quantile（...）方法會以price_mean列的最小值開始，直到最大值，返回十分位數的列表。

/*
1    350
2    480
3    118
4     26
5      4
6      3
Name: b_price, dtype: int64
1      96
2     100
3      98
4      98
5      97
6      97
7      99
8      97
9      99
10     97
11      3
Name: p_price, dtype: int64
 */

1.16編碼數據變量
最後一步就是分類變量。
統計模型只能接受有序的數據。分類變量（有時根據上下文可表示爲數字）不能直接在模型中使用。要使用它們，咱們要先進行編碼，也就是給它們一個惟一的數字編號。

Tips:

# dummy code the column
# 根據房產類型列[type]處理的簡單代碼
# prefix指定以d開頭，本例中是d_Condo,可經過prefix_sep參數修改
csv_read = pd.get_dummies(
    csv_read,
    prefix='d',
    columns=['type']
)