一個全新的Python數據分析框架：DaPy帶你領略從未有過的絲滑般數據分析體驗

時間 2019-11-07

標籤一個全新 python 數據分析框架 dapy 領略從未有過體驗欄目 Python 简体版

原文原文鏈接

DaPy - 帶你領略從未有過的絲滑般體驗

總由於Pandas嚴格的數據結構要求讓你感覺到很苦惱？爲了實現一個簡單的操做也要查閱不少的文檔而頭疼？html

DaPy來解放你啦！你能夠用DaPy流利地實現腦子早已思索好的想法，再也不由於找不到API或者數據格式報錯而打斷你的思路！DaPy是一個從設計開始就很是關注易用性的數據分析框架，它專爲數據分析師而設計，而不是程序員。對於數據分析師而言，你的價值是解決問題思路！而不是害得你996的幾百行代碼！mysql

DaPy有多友好？

1. 多種在CMD中呈現數據的方式

不要小看瀏覽數據的方式！對於數據分析師而言，感知數據是很是重要的！git

>>> from DaPy.datasets import iris
>>> sheet, info = iris()
 - read() in 0.001s.
>>> sheet
sheet:data
==========
sepal length: <5.1, 4.9, 4.7, 4.6, 5.0, ... ,6.7, 6.3, 6.5, 6.2, 5.9>
 sepal width: <3.5, 3.0, 3.2, 3.1, 3.6, ... ,3.0, 2.5, 3.0, 3.4, 3.0>
petal length: <1.4, 1.4, 1.3, 1.5, 1.4, ... ,5.2, 5.0, 5.2, 5.4, 5.1>
 petal width: <0.2, 0.2, 0.2, 0.2, 0.2, ... ,2.3, 1.9, 2.0, 2.3, 1.8>
       class: <setos, setos, setos, setos, setos, ... ,virginic, virginic, virginic, virginic, virginic>
>>> sheet.info
sheet:data
==========
1.  Structure: DaPy.SeriesSet
2. Dimensions: Lines=150 | Variables=5
3. Miss Value: 0 elements
                               Descriptive Statistics                                   
=======================================================================================
    Title     | Miss |    Min    |    Mean   |     Max     |     Std      |    Mode    
--------------+------+-----------+-----------+-------------+--------------+------------
 sepal length |  0   |  4.300001 | 5.8433333 | 7.900000095 | 0.8253012767 |          5
 sepal width  |  0   |         2 | 3.0540000 | 4.400000095 | 0.4321465798 |          3
 petal length |  0   |         1 | 3.7586666 | 6.900000095 |  1.758529178 |        1.5
 petal width  |  0   | 0.1000000 | 1.1986666 |         2.5 | 0.7606126088 |        0.2
    class     |  0   |         - |         - |           - |            - |      setos
=======================================================================================
>>> sheet.show(5)
sheet:data
==========
 sepal length | sepal width | petal length | petal width |  class  
--------------+-------------+--------------+-------------+----------
     5.1      |     3.5     |     1.4      |     0.2     |  setos   
     4.9      |     3.0     |     1.4      |     0.2     |  setos   
     4.7      |     3.2     |     1.3      |     0.2     |  setos   
     4.6      |     3.1     |     1.5      |     0.2     |  setos   
     5.0      |     3.6     |     1.4      |     0.2     |  setos   
                          .. Omit 140 Ln ..                          
     6.7      |     3.0     |     5.2      |     2.3     | virginic 
     6.3      |     2.5     |     5.0      |     1.9     | virginic 
     6.5      |     3.0     |     5.2      |     2.0     | virginic 
     6.2      |     3.4     |     5.4      |     2.3     | virginic 
     5.9      |     3.0     |     5.1      |     1.8     | virginic 
複製代碼

2. 符合Python語法習慣的二維數據表結構

按行處理數據是符合咱們每個人想法的，所以幾乎全部的數據庫設計都是按照按行存儲的。因爲Pandas最先是爲了處理時間序列數據而開發的，因此他的數據是以列進行的存儲。雖然這種存儲方式在全局處理上表現出了不錯的性能，但沒優化狀況下行操做卻讓人較爲難以忍受的。因爲沒有什麼更好的替代品，人們不得不花不少時間去適應Pandas的編程思惟。好比，Pandas不支持對於DataFrame.iterrows()迭代出來的行進行賦值操做。這個功能即便如此經常使用，在NumPy中也是原生支持的功能在Pandas裏倒是被禁止的。程序員

針對這類由行操做引起的問題，DaPy經過引入「視圖」的概念從新優化了按行操做這個符合人們習慣的操做方式。github

>>> import DaPy as dp
>>> sheet = dp.SeriesSet({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> for row in sheet:
	print(row.A, row[0])   # 按照下標或者列名訪問行數據的值
	row[1] = 'b'   # 用下表爲行賦值
1, 1
2, 2
3, 3
>>> sheet.show()   # 你的操做會映射到原表中
 A | B
---+---
 1 | b 
 2 | b 
 3 | b 
>>> row0 = sheet[0]   # 拿到行的索引 
>>> row0
[1, 'b']
>>> sheet.append_col(series=[7, 8, 9], variable_name='newColumn') # 爲表添加新列
>>> sheet.show()
 A | B | newColumn
---+---+-----------
 1 | b |     7     
 2 | b |     8     
 3 | b |     9     
>>> row0   # 表的操做會時時刻刻反映到行上
[1, 'b', 7]
複製代碼

3. 對了，據說有人喜歡鏈式表達？

讓咱們來作一個稍微有趣點的鏈式表達! 我但願對於經典的鳶尾花數據集在一行代碼中完成下面的6個操做。sql

（1）對於每一列數據分別進行標準化操做；數據庫

（2）而後找到在標準化之後知足sepal length小於petal length的記錄；編程

（3）對於篩選出來的數據集按照鳶尾花的類別class進行分組；bash

（4）對於每一個分組都按照petal width進行升序排序；數據結構

（5）對於排好序後的分組選取前10行記錄；

（6）對於每一個由前十行記錄構成的子數據集進行描述性統計；

>>> from DaPy.datasets import iris
>>> sheet, info = iris()
 - read() in 0.001s.
>>> sheet.normalized().query('sepal length < petal length').groupby('class').sort('petal width')[:10].info
 - normalized() in 0.005s.
 - query() in 0.000s.
 - groupby() in 0.000s.
 - sort() in 0.000s.
sheet:('virginic',)
===================
1.  Structure: DaPy.SeriesSet
2. Dimensions: Lines=10 | Variables=5
3. Miss Value: 0 elements
                                Descriptive Statistics                                 
=======================================================================================
    Title     | Miss |    Min    |   Mean   |    Max     |     Std      |     Mode     
--------------+------+-----------+----------+------------+--------------+--------------
 sepal length |  0   |   0.16666 | 0.572218 | 0.83333331 |  0.177081977 | 0.5555555173
 sepal width  |  0   | 0.0833333 | 0.295832 | 0.41666665 |  0.102824685 | 0.3749999851
 petal length |  0   |  0.593220 | 0.747457 | 0.89830505 | 0.0852358577 | 0.8135593089
 petal width  |  0   |  0.541666 | 0.654166 | 0.70833331 | 0.0619419457 | 0.7083333332
    class     |  0   |         - |        - |          - |            - |     virginic
=======================================================================================
sheet:('setos',)
================
1.  Structure: DaPy.SeriesSet
2. Dimensions: Lines=6 | Variables=5
3. Miss Value: 0 elements
                                Descriptive Statistics                                       
=======================================================================================
    Title     | Miss |    Min    |   Mean   |    Max    |     Std      |      Mode     
--------------+------+-----------+----------+-----------+--------------+---------------
 sepal length |  0   | -5.29e-08 | 0.050925 |  0.138888 | 0.0465272020 | 0.02777772553
 sepal width  |  0   |     0.375 |  0.45833 |  0.583333 | 0.0680413746 |  0.4166666501
 petal length |  0   |    0.0169 | 0.070621 |  0.152542 | 0.0419945431 | 0.05084745681
 petal width  |  0   | -6.20e-10 | 0.034722 | 0.0416666 | 0.0155282505 | 0.04166666607
    class     |  0   |         - |        - |         - |            - |         setos
=======================================================================================
sheet:('versicolo',)
====================
1.  Structure: DaPy.SeriesSet
2. Dimensions: Lines=10 | Variables=5
3. Miss Value: 0 elements
                                Descriptive Statistics                                   
=======================================================================================
    Title     | Miss |   Min    |   Mean   |    Max     |     Std      |     Mode     
--------------+------+----------+----------+------------+--------------+--------------
 sepal length |  0   |  0.16666 | 0.308333 | 0.47222217 |   0.10126514 | 0.1944443966
 sepal width  |  0   |        0 |  0.16666 | 0.29166665 | 0.0790569387 |   0.16666666
 petal length |  0   | 0.338983 | 0.442372 | 0.52542370 | 0.0564434752 | 0.3898305022
 petal width  |  0   |    0.375 |  0.38749 | 0.41666665 | 0.0190940608 | 0.3749999996
    class     |  0   |        - |        - |          - |            - |    versicolo
=======================================================================================
複製代碼

4. 一些numpy和pandas優良的特性他也保留了！

>>> sheet.A + sheet.B # 下標訪問列而且作四則運算
>>> sheet[sheet.A > sheet.B] # 這個很是Pythonic的切片寫法！
複製代碼

OK~ 這些接口特性很酷，還有沒有其餘硬傢伙？

1. 超級NB的、魯棒性極強的I/O工具！！！

咱們都會遇到過一個問題，怎麼把csv轉換成Excel；或者反過來，Excel轉回csv?

>>> from DaPy.datasets import iris
>>> sheet, info = iris()
>>> sheet.groupby('class').save('iris.xls') # 對！直接鏈式表達轉成了xls! 別忘了Excel是支持多子表的，因此剛剛groupby以後DaPy給你存了三個子表！
 - groupby() in 0.000s.
 - save() in 0.241s.
>>> import DaPy as dp
>>> dp.read('iris.xls').shape # DaPy居然又一次性讀完了三個表！！！
 - read() in 0.004s.
sheet:('virginic',)
===================
sheet(Ln=50, Col=5)
sheet:('setos',)
================
sheet(Ln=50, Col=5)
sheet:('versicolo',)
====================
sheet(Ln=50, Col=5)
複製代碼

你覺得read函數就這點水平嗎？讓咱們來看看更騷的！！！

>>> import DaPy as dp
>>> dp.read('iris.xls').save('iris.db') # Excel 轉 Sqlite3
>>> dp.read('iris.sav').save('iris.html') # SPSS 轉 HTML
>>> dp.read('https://sofifa.com/players').save('mysql://root:123123@localhost:3306/fifa_db') # 爬取FIFA球員數據並存入MySQL數據庫
>>> dp.read('mysql://root:123123@localhost:3306/fifa_db').save('fifa.csv') # MySQL數據庫 轉 CSV
複製代碼

2. 支持超級多的數據預處理或者特徵工程的操做

先來一些數據預處理的

>>> sheet.drop_duplicates(keep='first') #刪除重複記錄
>>> sheet.fillna(method='linear') #線性插值法填充缺失值
>>> sheet.drop('ID', axis=1) # 刪除無用變量
>>> sheet.count_values('gender') # 對於某個變量進行計數統計
複製代碼

再來一些特徵工程的

>>> sheet.get_date_label('birth') # 對日期變量作變化，會自動生成一大堆週期性變量
>>> sheet.get_categories(cols='age', cutpoints=[18, 30, 50], group_name=['青年', '壯年', '中年', '老年']) # 對於連續型變量進行封箱操做
>>> sheet.get_dummies(['city', 'education']) # 對於分類變量進行虛擬變量的引入
>>> sheet.get_interactions(n_power=3, col=['income', 'age', 'gender', 'education']) # 爲你選定的變量之間構成高階交叉項，階數n_power能夠隨便填！！！
複製代碼

3. 最最後，重中之重，機器學習模塊！！！

在DaPy裏面，已經內置了四個模型，分別是線性迴歸、邏輯迴歸、多層感知機和C4.5決策樹。在模型這一塊的話，DaPy的開發團隊認爲sklearn和tensorflow已經作得很好了。出於開發團隊主要成員是統計系學生的關係，他們的思路是增長更多的統計學檢驗報告~ 咱們先看看一個demo級別的樣例好了

>>> from DaPy.datasets import iris
>>> sheet, info = iris()
 - read() in 0.001s.
>>> sheet = sheet.shuffle().normalized()
 - shuffle() in 0.001s.
 - normalized() in 0.005s.
>>> X, Y = sheet[:'petal width'], sheet['class']
>>> 
>>> from DaPy.methods.classifiers import MLPClassifier
>>> mlp = MLPClassifier().fit(X[:120], Y[:120])
 - Structure | Input:4 - Dense:4 - Output:3
 - Finished: 0.2%	Epoch: 1	Rest Time: 0.24s	Accuracy: 0.33%
                   ### 這裏我刪掉了一些日誌 ###
 - Finished: 99.8%	Epoch: 500	Rest Time: 0.00s	Accuracy: 0.88%
 - Finish Train | Time:1.9s	Epoch:500	Accuracy:88.33%
>>> 
>>> from DaPy.methods.evaluator import Performance
>>> Performance(mlp, X[120:], Y[120:], mode='clf') # 性能測試包括了正確率、kappa係數和混淆矩陣，二分類任務會包含AUC
 - Classification Accuracy: 86.6667%
 - Classification Kappa: 0.8667
┏                   ┓
┃ 11   0    0    11 ┃
┃ 0    8    1    9  ┃
┃ 0    3    7    10 ┃
┃11.0 11.0 8.0  30.0┃
┗                   ┛
複製代碼