認識
jupyter地址: https://nbviewer.jupyter.org/github/chenjieyouge/jupyter_share/blob/master/share/pandas-%20%E6%8F%8F%E8%BF%B0%E6%80%A7%E7%BB%9F%E8%AE%A1.ipynbpython
import numpy as np import pandas as pd
pandas objects are equipped(配備的) with a set of common mathematical and statistical methods. Most of these fall into the categrory of reductions or summary statistics, methods that exract(提取) a single value(like the sum or mean) from a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they built-in handling for missiing data. Consider a small DataFarme -> (pandas提供了一些經常使用的統計函數, 輸入一般是一個series的值, 或df的行, 列; 值得一提的是, pandas提供了缺失值處理, 在統計的時候, 不列入計算)git
df = pd.DataFrame([ [1.4, np.nan], [7.6, -4.5], [np.nan, np.nan], [3, -1.5] ], index=list('abcd'), columns=['one', 'two']) df
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }github
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>web
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>one</th> <th>two</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1.4</td> <td>NaN</td> </tr> <tr> <th>b</th> <td>7.6</td> <td>-4.5</td> </tr> <tr> <th>c</th> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>d</th> <td>3.0</td> <td>-1.5</td> </tr> </tbody> </table>數組
</div>app
Calling DataFrame's sum method returns a Series containing column sums:less
"默認axis=0, 行方向, 下方, 展現每列, 忽略缺失值" df.sum() df.mean() "在計算平均值時, NaN 不計入樣本"
'默認axis=0, 行方向, 下方, 展現每列, 忽略缺失值'
one 12.0 two -6.0 dtype: float64
one 4.0 two -3.0 dtype: float64
'在計算平均值時, NaN 不計入樣本'
Passing axis='columns' or axis=1 sums across the columns instead. -> axis方向ide
"按行統計, aixs=1, 列方向, 右邊" df.sum(axis=1)
'按行統計, aixs=1, 列方向, 右邊'
a 1.4 b 3.1 c 0.0 d 1.5 dtype: float64
NA values are excluded unless the entire slice (row or column in the case) is NA. This can be disabled with the skipna option: -> 統計計算會自動忽略缺失值, 不計入樣本函數
"默認是忽略缺失值的, 要缺失值, 則手動指定一下" df.mean(skipna=False, axis='columns') # 列方向, 行哦
'默認是忽略缺失值的, 要缺失值, 則手動指定一下'
a NaN b 1.55 c NaN d 0.75 dtype: float64
See Table 5-7 for a list of common options for each reduction method.ui
Method | Description |
---|---|
axis | Axis to reduce over, 0 for DataFrame's rows and 1 for columns |
skipna | Exclude missing values; True by default |
level | Reduce grouped by level if the axis is hierachically indexed(MaltiIndex) |
Some methods, like idmax and idmin, return indirect statistics like the index where the minimum or maximum values are attained(取得).
"idxmax() 返回最大值的第一個索引標籤" df.idxmax()
'idxmax() 返回最大值的第一個索引標籤'
one b two d dtype: object
Other methods are accumulations: 累積求和-默認axis=0 行方向
"累積求和, 默認axis=0, 忽略NA" df.cumsum() "也可指定axis=1列方向" df.cumsum(axis=1)
'累積求和, 默認axis=0, 忽略NA'
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>one</th> <th>two</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1.4</td> <td>NaN</td> </tr> <tr> <th>b</th> <td>9.0</td> <td>-4.5</td> </tr> <tr> <th>c</th> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>d</th> <td>12.0</td> <td>-6.0</td> </tr> </tbody> </table>
</div>
'也可指定axis=0列方向'
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>one</th> <th>two</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1.4</td> <td>NaN</td> </tr> <tr> <th>b</th> <td>7.6</td> <td>3.1</td> </tr> <tr> <th>c</th> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>d</th> <td>3.0</td> <td>1.5</td> </tr> </tbody> </table>
</div>
Another type of method is neither a reduction(聚合) nor an accumulation. describe is one such example, producing multiple summary statistic in one shot: --> (describe()方法是對列變量作描述性統計)
"describe() 返回列變量分位數, 均值, count, std等經常使用統計指標" " roud(2)保留2位小數" df.describe().round(2)
'describe() 返回列變量分位數, 均值, count, std等經常使用統計指標'
' roud(2)保留2位小數'
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>one</th> <th>two</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>3.00</td> <td>2.00</td> </tr> <tr> <th>mean</th> <td>4.00</td> <td>-3.00</td> </tr> <tr> <th>std</th> <td>3.22</td> <td>2.12</td> </tr> <tr> <th>min</th> <td>1.40</td> <td>-4.50</td> </tr> <tr> <th>25%</th> <td>2.20</td> <td>-3.75</td> </tr> <tr> <th>50%</th> <td>3.00</td> <td>-3.00</td> </tr> <tr> <th>75%</th> <td>5.30</td> <td>-2.25</td> </tr> <tr> <th>max</th> <td>7.60</td> <td>-1.50</td> </tr> </tbody> </table>
</div>
On non-numeric data, describe produces alternative(供選擇的) summary statistics: --> 對於分類字段, 能自動識別並返回分類彙總信息
obj = pd.Series(['a', 'a', 'b', 'c']*4) "describe()對分類字段自動分類彙總" obj.describe()
'describe()對分類字段自動分類彙總'
count 16 unique 3 top a freq 8 dtype: object
See Table 5-8 for a full list of summary statistics and related methods.
Method | Description |
---|---|
count | Number of non-NA values |
describe | 描述性統計Series或DataFrame的列 |
min, max | 極值 |
argmin, argmax | 極值全部的位置下標 |
idmin, idmax | 極值所對應的行索引label |
quantile | 樣本分位數 |
sum | 求和 |
mean | 求均值 |
median | 中位數 |
var | 方差 |
std | 標準差 |
skew | 偏度 |
kurt | 峯度 |
skew | 偏度 |
cumsum | 累積求和 |
cumprod | 累積求積 |
diff | Compute first arithmetic difference (useful for time series) |
pct_change | Compute percent change |
df.idxmax()
one b two d dtype: object
df['one'].argmax()
c:\python\python36\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now. """Entry point for launching an IPython kernel.
'b'
Correlation and Convariance
Some summary statistics, like correlation and convariance(方差和協方差), are computed from pairs of arguments. Let's consider some DataFrames of stock prices and volumes(體量) obtained from Yahoo! Finace using the add-on pandas-datareader package. If you don't have it install already, it can be obtained via or pip:
(conda) pip install pandas-datareader
I use the pandas_datareader module to dwonload some data for a few stock tickers:
import pandas_datareader.data as web "字典推導式" # all_data = {ticker: web.get_data_yahoo(ticker) # for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
'字典推導式'
"讀取二進制數據 read_pickle(), 存爲 to_pickle()" returns = pd.read_pickle("../examples/yahoo_volume.pkl") returns.tail()
'讀取二進制數據 read_pickle(), 存爲 to_pickle()'
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>AAPL</th> <th>GOOG</th> <th>IBM</th> <th>MSFT</th> </tr> <tr> <th>Date</th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>2016-10-17</th> <td>23624900</td> <td>1089500</td> <td>5890400</td> <td>23830000</td> </tr> <tr> <th>2016-10-18</th> <td>24553500</td> <td>1995600</td> <td>12770600</td> <td>19149500</td> </tr> <tr> <th>2016-10-19</th> <td>20034600</td> <td>116600</td> <td>4632900</td> <td>22878400</td> </tr> <tr> <th>2016-10-20</th> <td>24125800</td> <td>1734200</td> <td>4023100</td> <td>49455600</td> </tr> <tr> <th>2016-10-21</th> <td>22384800</td> <td>1260500</td> <td>4401900</td> <td>79974200</td> </tr> </tbody> </table>
</div>
The corr method of Series computes the correlation of the overlapping, non-NA(線性相關), aligned-by-index values in two Series. Relatedly, cov compute teh convariance: ->(corr 計算相關係數, cov 計算協方差)
returns.describe()
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>AAPL</th> <th>GOOG</th> <th>IBM</th> <th>MSFT</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>1.714000e+03</td> <td>1.714000e+03</td> <td>1.714000e+03</td> <td>1.714000e+03</td> </tr> <tr> <th>mean</th> <td>9.595085e+07</td> <td>4.111642e+06</td> <td>4.815604e+06</td> <td>4.630359e+07</td> </tr> <tr> <th>std</th> <td>6.010914e+07</td> <td>2.948526e+06</td> <td>2.345484e+06</td> <td>2.437393e+07</td> </tr> <tr> <th>min</th> <td>1.304640e+07</td> <td>7.900000e+03</td> <td>1.415800e+06</td> <td>9.009100e+06</td> </tr> <tr> <th>25%</th> <td>5.088832e+07</td> <td>1.950025e+06</td> <td>3.337950e+06</td> <td>3.008798e+07</td> </tr> <tr> <th>50%</th> <td>8.270255e+07</td> <td>3.710000e+06</td> <td>4.216750e+06</td> <td>4.146035e+07</td> </tr> <tr> <th>75%</th> <td>1.235752e+08</td> <td>5.243550e+06</td> <td>5.520500e+06</td> <td>5.558810e+07</td> </tr> <tr> <th>max</th> <td>4.702495e+08</td> <td>2.976060e+07</td> <td>2.341650e+07</td> <td>3.193179e+08</td> </tr> </tbody> </table>
</div>
"微軟和IBM的相關係數是: {}".format(returns['MSFT'].corr(returns['IBM'])) "微軟和IBM的協方差爲是: {}".format(returns['MSFT'].cov(returns['IBM']))
'微軟和IBM的相關係數是: 0.42589249800808743'
'微軟和IBM的協方差爲是: 24347708920434.156'
Since(儘管) MSFT is a vaild(無效的) Python attritute, we can alse select these columns using more concise syntax:
"經過 DF.col_name 這樣的屬性來選取字段, 面對對象, 支持" returns.MSFT.corr(returns.IBM)
'經過 DF.col_name 這樣的屬性來選取字段, 面對對象, 支持'
0.42589249800808743
DataFrame's corr and cov methods, on the other hand, return a full correlaton or covariance matrix as a DataFrame, respectively(各自地). -> df.corr 返回相關係數矩陣 df.cov() 返回協方差矩陣哦
"DF.corr() 返回矩陣, 這個厲害了, 不知道有無中心化過程" returns.corr() "DF.cov() 返回協方差矩陣" returns.cov()
'DF.corr() 返回矩陣, 這個厲害了, 不知道有無中心化過程'
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>AAPL</th> <th>GOOG</th> <th>IBM</th> <th>MSFT</th> </tr> </thead> <tbody> <tr> <th>AAPL</th> <td>1.000000</td> <td>0.576030</td> <td>0.383942</td> <td>0.490353</td> </tr> <tr> <th>GOOG</th> <td>0.576030</td> <td>1.000000</td> <td>0.438424</td> <td>0.490446</td> </tr> <tr> <th>IBM</th> <td>0.383942</td> <td>0.438424</td> <td>1.000000</td> <td>0.425892</td> </tr> <tr> <th>MSFT</th> <td>0.490353</td> <td>0.490446</td> <td>0.425892</td> <td>1.000000</td> </tr> </tbody> </table>
</div>
'DF.cov() 返回協方差矩陣'
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>AAPL</th> <th>GOOG</th> <th>IBM</th> <th>MSFT</th> </tr> </thead> <tbody> <tr> <th>AAPL</th> <td>3.613108e+15</td> <td>1.020917e+14</td> <td>5.413005e+13</td> <td>7.184135e+14</td> </tr> <tr> <th>GOOG</th> <td>1.020917e+14</td> <td>8.693806e+12</td> <td>3.032022e+12</td> <td>3.524694e+13</td> </tr> <tr> <th>IBM</th> <td>5.413005e+13</td> <td>3.032022e+12</td> <td>5.501297e+12</td> <td>2.434771e+13</td> </tr> <tr> <th>MSFT</th> <td>7.184135e+14</td> <td>3.524694e+13</td> <td>2.434771e+13</td> <td>5.940884e+14</td> </tr> </tbody> </table>
</div>
Using the DataFrame's corrwith method, you can compute pairwise(成對的) corrlations between a DataFrame's columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column.
使用DataFrame的corrwith方法,您能夠計算DataFrame的列或行與另外一個Series或DataFrame之間的成對相關。 傳遞一個Series會返回一個Series,其中包含爲每列計算的相關值。
"corrwith() 計算成對相關" "計算IMB與其餘幾個的相關" returns.corrwith(returns.IBM)
'corrwith() 計算成對相關'
'計算IMB與其餘幾個的相關'
AAPL 0.383942 GOOG 0.438424 IBM 1.000000 MSFT 0.425892 dtype: float64
returns.corrwith(returns)
AAPL 1.0 GOOG 1.0 IBM 1.0 MSFT 1.0 dtype: float64
Passing axis='column'(列方向, 每行) does things row-by-row instead. In all cases, the data points are aligned by label before the correlation is computed. ->按照行進進行計算, 前提是數據是按label對齊的.
Unique Values, Value Counts, and Membership
Another class of related methods extracts(提取) infomation about the values contained in a one-dimensional Series. To illustrate these, consider this example:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) "unique()返回不重複的值序列" obj.unique()
'unique()返回不重複的值序列'
array(['c', 'a', 'd', 'b'], dtype=object)
The unique values are not neccessarily returned in sorted order(沒有進行排序), but could be sorted ater the fact if needed(uniques.sort()). Relatedly, value_counts computes a Series containing value frequencies: ->value_count()統計頻率
"統計詞頻, value_counts()" obj.value_counts()
'統計詞頻, value_counts()'
a 3 c 3 b 2 d 1 dtype: int64
The Series id sorted by value in descending order(降序) as a convenience. value_counts is also available as a top-level pandas method that can be used with any array or sequence: -> 統計詞頻,並降序排列
"統計詞頻並降序排列" "默認是降序的" pd.value_counts(obj.values) "手動自動不排序" pd.value_counts(obj.values, sort=False)
'統計詞頻並降序排列'
'默認是降序的'
a 3 c 3 b 2 d 1 dtype: int64
'手動自動不排序'
c 3 b 2 d 1 a 3 dtype: int64
isin performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame: -> isin 成員判斷
obj
0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object
mask = obj.isin(['b', 'c']) mask
0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool
"bool 過濾條件, True的則返回" obj[mask]
'bool 過濾條件, True的則返回'
0 c 5 b 6 b 7 c 8 c dtype: object
Related to(涉及) isin is the Index.get_indexer method, which gives you can index array from an array of possibly non-distinct values into another array of distinct values:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a']) unique_vals = pd.Series(['c', 'b', 'a']) "沒看懂這是幹嗎" pd.Index(unique_vals).get_indexer(to_match)
'沒看懂這是幹嗎'
array([0, 2, 1, 1, 0, 2], dtype=int64)
See Table 5-9 for a reference on these methods.
Method | Description |
---|---|
isin | 判斷數組的每個值是否在isin的數組裏面, 返回一個bool數組 |
match | 數據對齊用的, 暫時還不會pass |
unique | 數組元素去重後的數組結果 |
value_counts | 詞頻統計, 默認降序 |
In some cases, you may want to compute a histogram(直方圖) on multiple related columns in a DataFrame. Here's an example:
data = pd.DataFrame({ 'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]}) data
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Qu1</th> <th>Qu2</th> <th>Qu3</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>2</td> <td>1</td> </tr> <tr> <th>1</th> <td>3</td> <td>3</td> <td>5</td> </tr> <tr> <th>2</th> <td>4</td> <td>1</td> <td>2</td> </tr> <tr> <th>3</th> <td>3</td> <td>2</td> <td>4</td> </tr> <tr> <th>4</th> <td>4</td> <td>3</td> <td>4</td> </tr> </tbody> </table>
</div>
Passing pandas.value_counts to this DF's apply function gives: -> 對每列進行詞頻統計, 沒有的用0填充
result = data.apply(pd.value_counts).fillna(0) result
<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Qu1</th> <th>Qu2</th> <th>Qu3</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>1.0</td> <td>1.0</td> <td>1.0</td> </tr> <tr> <th>2</th> <td>0.0</td> <td>2.0</td> <td>1.0</td> </tr> <tr> <th>3</th> <td>2.0</td> <td>2.0</td> <td>0.0</td> </tr> <tr> <th>4</th> <td>2.0</td> <td>0.0</td> <td>2.0</td> </tr> <tr> <th>5</th> <td>0.0</td> <td>0.0</td> <td>1.0</td> </tr> </tbody> </table>
</div>
Here, the row labels in the result are the distinct values occuring in all of the columns. The values are the respective counts of these values in each clumns
這裏,結果中的行標籤是在全部列中出現的不一樣值。 值是每列中這些值的相應計數
Conclusion
In the nex chapter, we will discuss tools for reading(or loading) and writing datasets with pandas. After that, we will dig deeper into data cleaning, wrangling, analysis, and visualization tool using pandas.
後面的內容, 涉及數據的讀寫, 數據清理, 轉換, 規整, 分析建模, 挖掘, 可視化等.