Python數據分析入門之pandas基礎總結

時間 2019-11-21

標籤 python 數據分析入門 pandas 基礎總結欄目 Python 简体版

原文原文鏈接

Pandas--「大熊貓」基礎

Series

Series: pandas的長槍(數據表中的一列或一行,觀測向量,一維數組...)javascript

Series1 = pd.Series(np.random.randn(4))

print Series1,type(Series1) 

print Series1.index

print Series1.values

輸出結果：html

0   -0.676256

1    0.533014

2   -0.935212

3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[-0.67625578  0.53301431 -0.93521212 -0.94082195]

np.random.randn() 正態分佈相關。函數說明

Series⽀持過濾的原理就如同NumPy

print Series1>0 

print Series1[Series1>0]

輸出結果以下：java

0 0.030480

1 0.072746

2 -0.186607

3 -1.412244

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[ 0.03048042 0.07274621 -0.18660749 -1.41224432]

我發現，邏輯表達式，得到的值就是True或者False。要先取得值，仍是要X[y]的形式。python

固然也支持廣播Broadcasting

什麼是broadcasting,暫時我也不太清楚，看個栗子：數據庫

print Series1*2 

print Series1+5

輸出結果以下：json

0 0.06096

1 1 0.145492 

2 -0.373215 

3 -2.824489 

dtype: float64 

0 5.030480 

1 5.072746 

2 4.813393 

3 3.587756 

dtype: float64

以及Universal Function

numpy.frompyfunc(out,nin,nout) 返回的是一個函數，nin是輸入的參數個數，nout是函數返回的對象的個數函數說明api

在序列上就使用行標，而不是建立1個2列的數據表，可以輕鬆辨別哪是數據，哪是元數據

這句話的意思，個人理解是序列儘可能是一列，不用去建立2列，這樣子，使用index就能指定數據了數組

Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

print Series2,type(Series2)

print Series2.index

print type(Series2.index)

print Series2.values

輸出結果以下，能夠看到，它是經過修改了index值的樣式，並無建立2列。安全

norm_0   -0.676256

norm_1    0.533014

norm_2   -0.935212

norm_3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')

<class 'pandas.core.index.Index'>

[-0.67625578  0.53301431 -0.93521212 -0.94082195]

雖然行是有順序的，可是仍然可以經過行級的index來訪問到數據：網絡

（固然也不盡然像Ordered Dict，由於⾏索引甚⾄能夠重複，不推薦重複的行索引不表明不能用）

print Series2[['norm_0','norm_3']]

能夠看到，讀取數據時，確實要採用X[y]的格式。這裏X[[y]]是由於，它要讀取兩個數據，指定的是這兩個數據的index值，將index值存放進list中，而後讀取。輸出結果以下：

norm_0   -0.676256

norm_3   -0.940822

dtype: float64

再好比：

print 'norm_0' in Series2

print 'norm_6' in Series2

輸出結果：

True

False

邏輯表達式的輸出結果，布爾型值。

從Key不重複的Ordered Dict或者從Dict來定義Series就不須要擔憂行索引重複：

Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

Series3_pdSeries = pd.Series(Series3_Dict)

print Series3_pdSeries

print Series3_pdSeries.values

print Series3_pdSeries.index

輸出結果：

China Beijing

Japan Tokyo

S.Korea Seoul

dtype: object

['Beijing' 'Tokyo' 'Seoul']

Index([u'China', u'Japan', u'S.Korea'], dtype='object')

經過上面的輸出結果就知道了，輸出結果是無序的，和輸入順序無關。

想讓序列按你的排序⽅式保存？就算有缺失值都毫無問題

Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)

print Series4_pdSeries

print Series4_pdSeries.values

print Series4_pdSeries.index

print Series4_pdSeries.isnull()

print Series4_pdSeries.notnull()

上面這樣的輸出就會按照list中定義的順序輸出結果。

整個序列級別的元數據信息：name

當數據序列以及index自己有了名字，就能夠更方便的進行後續的數據關聯啦！

這裏我感受就是列名的做用。下面舉例：

print Series4_pdSeries.name

print Series4_pdSeries.index.name

很顯然，輸出的結果都是None，由於咱們還沒指定name嘛！

Series4_pdSeries.name = "Capital Series"

Series4_pdSeries.index.name = "Nation"

print Series4_pdSeries

輸出結果：

Nation

Japan Tokyo

China Beijing

Singapore NaN

S.Korea Seoul

Name: Capital Series, dtype: object

"字典"？不是的，⾏index能夠重複，儘管不推薦。

Series5_IndexList = ['A','B','B','C']

Series5 = pd.Series(Series1.values,index = Series5_IndexList)

print Series5

print Series5[['B','A']]

輸出結果：

A 0.030480

B 0.072746

B -0.186607

C -1.412244

dtype: float64

B 0.072746

B -0.186607

A 0.030480

dtype: float64

咱們能夠看出，Series['B']輸出了兩個值，因此index值儘可能不要重複呀！

DataFrame

DataFrame：pandas的戰錘(數據表，⼆維數組)

Series的有序集合，就像R的DataFrame同樣方便。

仔細想一想，絕大部分的數據形式均可以表現爲DataFrame。

從NumPy二維數組、從文件或者從數據庫定義：數據雖好，勿忘列名

dataNumPy = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])

DF1

這裏DataFrame中的columns應該就是列名的意思。如今看print的結果，是否是很舒服啊！Excel的樣式嘛

等長的列數據保存在一個字典裏（JSON）：很不幸，字典key是無序的

dataDict = {'nation':['Japan','S.Korea','China'],'capital':['Tokyo','Seoul','Beijing'],'GDP':[4900,1300,9100]}

DF2 = pd.DataFrame(dataDict)

DF2

輸出結果能夠發現，無序的！

GDP capital nation

0   4900    Tokyo   Japan

1   1300    Seoul   S.Korea

2   9100    Beijing China

PS:因爲懶得截圖放過來，這裏沒有了邊框線。

從另外一個DataFrame定義DataFrame：啊，強迫症犯了！

DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])

DF21

很明顯，這裏是利用DF2定義DF21，還經過指定cloumns改變了列名的順序。

DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])

DF22

很明顯，這裏定義了columns的順序，還定義了index的順序。

nation capital GDP

2 China Beijing 9100

0 Japan Tokyo 4900

1 S.Korea Seoul 1300

從DataFrame中取出列？兩種方法（與JavaScript徹底一致！）

OMG，囧，我居然都快忘了js語法了，如今想起了，可是對象的屬性既能夠obj.x也能夠obj[x]。

'.'的寫法容易與其餘預留關鍵字產生衝突
'[ ]'的寫法最安全。

從DataFrame中取出行？（至少）兩種⽅法：

方法1和方法2：

print DF22[0:1] #給出的實際是DataFrame

print DF22.ix[0] #經過對應Index給出⾏,**ix**好爽。

輸出結果：

nation  capital   GDP

2  China  Beijing  9100

nation     Japan

capital    Tokyo

GDP         4900

Name: 0, dtype: object

方法3 像NumPy切片同樣的終極招式：iloc ：

print DF22.iloc[0,:]    #第一個參數是第幾行，第二個參數是列。這裏呢，就是第0行，所有列

print DF22.iloc[:,0]    #根據上面的描述，這裏是所有行，第0列

輸出結果，驗證一下：

nation       China

capital    Beijing

GDP           9100

Name: 2, dtype: object

2      China

0      Japan

1    S.Korea

Name: nation, dtype: object

動態增長列列，可是沒法用"."的方式，只能用"[]"

舉個栗子說明一下就明白了：

DF22['population'] = [1600,130,55]

DF22

輸出結果：

nation  capital GDP population

2   China   Beijing 9100    1600

0   Japan   Tokyo   4900    130

1   S.Korea Seoul   1300    55

Index：行級索引

Index：pandas進⾏數據操縱的鬼牌（行級索引）

⾏級索引是：

元數據
可能由真實數據產生，所以能夠視做數據
能夠由多重索引也就是多個列組合而成
能夠和列名進行交換，也能夠進行堆疊和展開，達到Excel透視表效果

Index有四種...哦不，不少種寫法，⼀些重要的索引類型包括：

pd.Index（普通）
Int64Index（數值型索引）
MultiIndex（多重索引，在數據操縱中更詳細描述）
DatetimeIndex（以時間格式做爲索引）
PeriodIndex （含週期的時間格式做爲索引）

直接定義普通索引，長得就和普通的Series⼀樣

index_names = ['a','b','c']

Series_for_Index = pd.Series(index_names)

print pd.Index(index_names)

print pd.Index(Series_for_Index)

輸出結果：

Index([u'a', u'b', u'c'], dtype='object')

Index([u'a', u'b', u'c'], dtype='object')

惋惜Immutable，牢記！ 不可變！舉例以下：此處挖坑啊。不明白……

index_names = ['a','b','c'] 

index0 = pd.Index(index_names) 

print index0.get_values() 

index0[2] = 'd'

輸出結果以下：

['a' 'b' 'c']

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-36-f34da0a8623c> in <module>()

      2 index0 = pd.Index(index_names)

      3 print index0.get_values()

----> 4 index0[2] = 'd'



C:\Anaconda\lib\site-packages\pandas\core\index.pyc in __setitem__(self, key, value)

   1055 

   1056     def __setitem__(self, key, value):

-> 1057         raise TypeError("Indexes does not support mutable operations")

   1058 

   1059     def __getitem__(self, key):



TypeError: Indexes does not support mutable operations

扔進去一個含有多元組的List，就有了MultiIndex

惋惜，若是這個List Comprehension改爲小括號，就不對了。

multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)])

multi1.name = ['index1','index2']

print multi1

輸出結果：

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

對於Series來講，若是擁有了多重Index，數據，變形！

下列代碼說明：

二重MultiIndex的Series能夠unstack()成DataFrame
DataFrame能夠stack成擁有⼆重MultiIndex的Series

data_for_multi1 = pd.Series(xrange(0,16),index=multi1)

data_for_multi1

輸出結果：

Row_1  Col_1     0
       Col_2     1

       Col_3     2

       Col_4     3
Row_2  Col_1     4

       Col_2     5

       Col_3     6

       Col_4     7
Row_3  Col_1     8

       Col_2     9

       Col_3    10

       Col_4    11

Row_4  Col_1    12
       Col_2    13

       Col_3    14

       Col_4    15

dtype: int32

看到輸出結果，好像明白了點，有點相似Excel彙總同樣。不過，往後還得查點資料

二重MultiIndex的Series能夠unstack()成DataFrame

data_for_multi1.unstack()

DataFrame能夠stack成擁有⼆重MultiIndex的Series

data_for_multi1.unstack().stack()

輸出結果：

Row_1  Col_1     0

       Col_2     1

       Col_3     2

       Col_4     3

Row_2  Col_1     4

       Col_2     5

       Col_3     6

       Col_4     7

Row_3  Col_1     8

       Col_2     9

       Col_3    10

       Col_4    11

Row_4  Col_1    12

       Col_2    13

       Col_3    14

       Col_4    15

dtype: int32

非平衡數據的例子：

multi2 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(5) for y in xrange(x)])

multi2

輸出結果：

MultiIndex(levels=[[u'Row_2', u'Row_3', u'Row_4', u'Row_5'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

           labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])

data_for_multi2 = pd.Series(np.arange(10),index = multi2) data_for_multi2

輸出結果：

Row_2  Col_1    0

Row_3  Col_1    1

       Col_2    2

Row_4  Col_1    3

       Col_2    4

       Col_3    5

Row_5  Col_1    6

       Col_2    7

       Col_3    8

       Col_4    9

dtype: int32

DateTime標準庫如此好⽤，你值得擁有

import datetime

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]

pd.DatetimeIndex(dates)

輸出結果：

DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)

若是你不只須要時間格式統一，時間頻率也要統一的話

periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

print periodindex1

輸出結果：

PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')

月級精度和日級精度如何轉換？

有的公司統⼀以1號表明當月，有的公司統一以最後1天表明當⽉，轉化起來很麻煩，能夠asfreq

print periodindex1.asfreq('D',how='start')

print periodindex1.asfreq('D',how='end')

輸出結果：

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')

最後的最後，我要真正把兩種頻率的時間精度匹配上？

periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')

print periodindex_mon

print periodindex_day

輸出結果：

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',

             '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',

             '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',

             '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',

             '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',

             '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',

             '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',

             '2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',

             '2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',

             '2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',

             '2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',

             '2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',

             '2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',

             '2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',

             '2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',

             '2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',

             '2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',

             '2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',

             '2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',

             '2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',

             '2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',

             '2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',

             '2015-03-30', '2015-03-31'],

            dtype='int64', freq='D')

粗粒度數據＋`reindex`＋`ffill/bfill`

full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')

full_ts

關於索引，⽅便的操做有？

前⾯描述過了，索引有序，重複，但⼀定程度上⼜能經過key來訪問，也就是說，某些集合操做都是能夠⽀持的。

index1 = pd.Index(['A','B','B','C','C'])

index2 = pd.Index(['C','D','E','E','F'])

index3 = pd.Index(['B','C','A'])

print index1.append(index2)

print index1.difference(index2)

print index1.intersection(index2)

print index1.union(index2) # Support unique-value Index well

print index1.isin(index2)

print index1.delete(2)

print index1.insert(0,'K') # Not suggested

print index3.drop('A') # Support unique-value Index well

print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic

print index1.is_unique,index2.is_unique,index3.is_unique

輸出結果：

Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

Index([u'A', u'B'], dtype='object')

Index([u'C', u'C'], dtype='object')

Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

[False False False  True  True]

Index([u'A', u'B', u'C', u'C'], dtype='object')

Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')

Index([u'B', u'C'], dtype='object')

True True False

False False True

大熊貓世界來去自如：Pandas的I/O

老生常談，從基礎來看，咱們仍然關心pandas對於與外部數據是如何交互的。

結構化數據輸入輸出

read_csv與to_csv 是⼀對輸⼊輸出的⼯具，read_csv直接返回pandas.DataFrame，⽽to_csv只要執行命令便可寫文件
- read_table：功能相似
- read_fwf：操做fixed width file
read_excel與to_excel方便的與excel交互
header 表⽰數據中是否存在列名，若是在第0行就寫就寫0，而且開始讀數據時跳過相應的行數，不存在能夠寫none
names 表示要用給定的列名來做爲最終的列名
encoding 表⽰數據集的字符編碼，一般而言一份數據爲了⽅便的進⾏⽂件傳輸都以utf-8做爲標準

這裏用的是本身的一個csv數據，由於找不到參考的這個pdf中的數據。

cnames=['經度','緯度']

taxidata2 = pd.read_csv('20140401.csv',header = 4,names=cnames,encoding='utf-8')

taxidata2

所有參數的請移步API：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

這裏介紹一些經常使用的參數：

讀取處理：

skiprows：跳過⼀定的⾏數
nrows：僅讀取⼀定的⾏數
skipfooter：尾部有固定的⾏數永不讀取
skip_blank_lines：空⾏跳過

內容處理：

sep/delimiter：分隔符很重要，常⻅的有逗號，空格和Tab('\t')
na_values：指定應該被看成na_values的數值
thousands：處理數值類型時，每千位分隔符並不統⼀ (1.234.567,89或者1,234,567.89均可能)，此時要把字符串轉化爲

數字須要指明千位分隔符

收尾處理：

index_col：將真實的某列（列的數⺫，甚⾄列名）看成index
squeeze：僅讀到⼀列時，再也不保存爲pandas.DataFrame⽽是pandas.Series

Excel ... ?

對於存儲着極爲規整數據的Excel而言，實際上是不必必定用Excel來存，儘管Pandas也十分友好的提供了I/O接口。

taxidata.to_excel('t0401.xlsx',encoding='utf-8')

taxidata_from_excel = pd.read_excel('t0401.xlsx',header=0, encoding='utf-8')

taxidata_from_excel

注意：當你的xls文件行數不少超過65536時，就會遇到錯誤，解決辦法是將寫入的格式變爲xlsx。excel函數受限制問題

惟一重要的參數：sheetname=k，標誌着一個excel的第k個sheet頁將會被取出。（從0開始）

半結構化數據

JSON：網絡傳輸中常⽤的⼀種數據格式。

仔細看一下，實際上這就是咱們平時收集到異源數據的風格是一致的：

列名不能徹底匹配
key可能並不惟一
元數據被保存在數據裏

import json

json_data = [{'name':'Wang','sal':50000,'job':'VP'},\

 {'name':'Zhang','job':'Manager','report':'VP'},\

 {'name':'Li','sal':5000,'report':'IT'}]

data_employee = pd.read_json(json.dumps(json_data))

data_employee_ri = data_employee.reindex(columns=['name','job','sal','report'])

data_employee_ri

輸出結果：

深刻Pandas數據操縱

在前面部分的基礎上，數據會有更多種操縱方式：

經過列名、行index來取數據，結合ix、iloc靈活的獲取數據的一個子集（第一部分已經介紹）
按記錄拼接（就像Union All）或者關聯（join）
方便的統計函數與⾃定義函數映射
排序
缺失值處理
與Excel同樣靈活的數據透視表（在第四部分更詳細介紹）

數據集整合

橫向拼接：直接DataFrame

pd.DataFrame([np.random.rand(2),np.random.rand(2),np.random.rand(2)],columns=['C1','C2'])

橫向拼接：Concatenate

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri])

輸出結果

### 縱向拼接：Merge

根據數據列關聯，使用on關鍵字

能夠指定一列或多列
可使⽤left_on和right_on

pd.merge(data_employee_ri,data_employee_ri,on='name')

根據index關聯，能夠直接使用left_index和right_index

TIPS: 增長how關鍵字，並指定

how = 'inner'
how = 'left'
how = 'right'
how = 'outer'

結合how，能夠看到merge基本再現了SQL應有的功能，並保持代碼整潔

自定義函數映射

dataNumPy32 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF32 = pd.DataFrame(dataNumPy32,columns=['nation','capital','GDP'])

DF32

map: 以相同規則將1列數據做1個映射，也就是進行相同函數的處理

def GDP_Factorize(v):

    fv = np.float64(v)

    if fv > 6000.0:

         return 'High'

    elif fv < 2000.0:

         return 'Low'

    else:

         return 'Medium'



DF32['GDP_Level'] = DF32['GDP'].map(GDP_Factorize)

DF32['NATION'] = DF32.nation.map(str.upper)

DF32

排序

sort: 按⼀列或者多列的值進行行級排序
sort_index: 根據index⾥的取值進行排序，並且能夠根據axis決定是重排行仍是列

sort

dataNumPy33 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF33 = pd.DataFrame(dataNumPy33,columns=['nation','capital','GDP'])

DF33

DF33.sort(['capital','nation'],ascending=False)

ascending是降序的意思。

sort_index

DF33.sort_index(axis=1,ascending=True)

一個好用的功能：Rank

DF33.rank()

缺失數據處理

忽略缺失值：

DF34.mean(skipna=True)

不忽略缺失值的話，估計就不能計算均值了吧。

若是不想忽略缺失值的話，就須要祭出fillna了：

注：這裏我在猜測，axis=1是否是就表明從行的角度呢？仍是得多讀書查資料呀。

「一組」大熊貓：Pandas的groupby

groupby的功能相似SQL的group by關鍵字：

Split-Apply-Combine

Split，就是按照規則分組
Apply，經過⼀定的agg函數來得到輸⼊pd.Series返回⼀個值的效果
Combine，把結果收集起來

Pandas的groupby的靈活性：

分組的關鍵字能夠來⾃於index，也能夠來⾃於真實的列數據
分組規則能夠經過⼀列或者多列

沒有具體數據，截圖看一下吧，方便往後回憶。

分組能夠快速實現MapReduce的邏輯

Map: 指定分組的列標籤，不一樣的值就會被扔到不一樣的分組處理
Reduce: 輸入多個值，返回1個值，通常能夠經過agg實現，agg能接受1個函數

參考：

S1EP3_Pandas.pdf 不知道何時存到電腦裏的資料，今天發現了它。感謝做者的資料。

本文博客地址Python數據分析入門之pandas總結基礎
本文由 Michael翔 創做，採用 知識共享署名 3.0 中國大陸許可協議 進行許可。可自由轉載、引用，但需署名做者且註明文章出處。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

Python數據分析入門之pandas基礎總結

Pandas--「大熊貓」基礎

Series

Series⽀持過濾的原理就如同NumPy

固然也支持廣播Broadcasting

以及Universal Function

在序列上就使用行標，而不是建立1個2列的數據表，可以輕鬆辨別哪是數據，哪是元數據

從Key不重複的Ordered Dict或者從Dict來定義Series就不須要擔憂行索引重複：

DataFrame

從NumPy二維數組、從文件或者從數據庫定義：數據雖好，勿忘列名

等長的列數據保存在一個字典裏（JSON）：很不幸，字典key是無序的

從另外一個DataFrame定義DataFrame：啊，強迫症犯了！

從DataFrame中取出列？兩種方法（與JavaScript徹底一致！）

從DataFrame中取出行？（至少）兩種⽅法：

動態增長列列，可是沒法用"."的方式，只能用"[]"

Index：行級索引

直接定義普通索引，長得就和普通的Series⼀樣

扔進去一個含有多元組的List，就有了MultiIndex

對於Series來講，若是擁有了多重Index，數據，變形！

二重MultiIndex的Series能夠unstack()成DataFrame

DataFrame能夠stack成擁有⼆重MultiIndex的Series

非平衡數據的例子：

DateTime標準庫如此好⽤，你值得擁有

若是你不只須要時間格式統一，時間頻率也要統一的話

月級精度和日級精度如何轉換？

最後的最後，我要真正把兩種頻率的時間精度匹配上？

粗粒度數據＋reindex＋ffill/bfill

關於索引，⽅便的操做有？

大熊貓世界來去自如：Pandas的I/O

結構化數據輸入輸出

Excel ... ?

半結構化數據

深刻Pandas數據操縱

數據集整合

橫向拼接：直接DataFrame

橫向拼接：Concatenate

自定義函數映射

map: 以相同規則將1列數據做1個映射，也就是進行相同函數的處理

排序

sort

sort_index

一個好用的功能：Rank

缺失數據處理

忽略缺失值：

「一組」大熊貓：Pandas的groupby

參考：

粗粒度數據＋`reindex`＋`ffill/bfill`