pandas是python爲數據分析建造的可靠工具,不少地方和R語言有想通之處。數據分析並非工具越高深越好,excel,R,python都是針對不一樣狀況的不一樣工具,各有各的優缺點,
就像你要搭一個架子,或者作一個工藝品,有的小錘子比較合適,有的就得用大斧子了。
excel實際上是數據分析的強力武器。對於小數據量的狀況下,excel有其先天的優點。而Python和R更像一個高性能的數據處理工具。
然而僅僅會使用各類厲害工具,數據落不了地是啥也沒用的。落地,和業務貼合,永遠是數據最終走向。近期轉到研發部門,和業務貼合的機會少了,對這方面更有不少深入感觸。
下面這幾點可能更像從excel角度去看python.
1.panda 的 index 更像橫座標 x,同時也能夠把它當作一個list ,能夠像數組同樣賦值,取數。
2.兩個跟屬性判斷相關的語句。in 是判斷某個columns 或者 index中是否存在某個字段。 is 是判斷數據格式類型。
3.另外基本功能,reindex 調整 index的工具函數。
基本格式:
obj.reindex(['a','b','c','d','e'],fill_value = 0.0)
功能:能夠在橫座標和縱座標上進行修改
frame = DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],columns=['Ohio','Texas','California'])
frame.reindex(['a','b','c','d'])
frame.reindex(columns=['Ohio','Texas','California','NewYork'])
4.取數據子矩陣 frame.ix 函數,同時還能夠有reindex的功能。
frame.ix(['a','b','d'],states)
data.ix[['Colorado','Utah'],['three','four']]
5.如今已經隱約能夠感受到 python的兩個子模塊運算相似於線性代數了
因此 兩個dataframe的結果十分像線性代數的結果。
df1+df2
df1.add(df2,fill_value=0)
df1.mul(df2,fill_value=0)
df1.div(df2,fill_value=0)
df1.sub(df2,fill_value=0)
詳細數據以下========================================================================================'''panda's index objects are responsible for holding the axis labels,like series'''import pandas as pdobj = Series(range(3),index=['a','b','c'])index = obj.indexindexindex[1:]'''index = immutable'''index[1]='d''''so the index can be valued by function'''index = pd.Index(np.arange(3))obj2 = Series([1.5,-2.5,0],index=index)obj2''' evaluate the attribute of index 判斷屬性用Is,判斷存不存在用in'''obj2.index is index'Ohio' in frame3.columns'2002' in obj2.index'''Essential functionality''''''reindexing'''obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])obj2=obj.reindex(['a','b','c','d','e'])obj2'''fill the missing data'''obj.reindex(['a','b','c','d','e'],fill_value = 0.0)'''ordering fill the missing data'''obj3=Series(['blue','green','black'],index=[0,2,4])obj3.reindex(np.arange(5),method='ffill')'''reindex can be alter row,column and both in data frame'''frame = DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],columns=['Ohio','Texas','California'])frame.reindex(['a','b','c','d'])frame.reindex(columns=['Ohio','Texas','California','NewYork'])months = ['APR','MAY','JUN','JUL','AUG']frame.reindex(columns=months)label=['a','b','c','d','e']states=['Ohio','Texas','California','NewYork']'''reindex 僅對x-axis有效'''frame.reindex(label,method='ffill')'''取子矩陣'''frame.ix(['a','b','d'],states)'''dropping entries from axis'''obj = Series(np.arange(5.),index=['a','b','c','d','e'])new_obj = obj.drop('c')new_obj'''drop from data frame'''data=DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','NewYork'],columns=['one','two','three','four'])'''drop from index'''data.drop(['Colorado','Utah'])'''drop from column'''data.drop('two',axis=1)'''index,selection,filtering'''obj=Series(np.arange(4.),index=['a','b','c','d'])'''index能夠像數組同樣,經過數字定位,index 定位,取一個數,一串數'''obj['b']obj[1]obj[1:2]obj[['a','c','d']]obj[[1,3]]obj[obj < 2]obj['b':'c']=5data=DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])'''follow by columns,但只是單維度的'''data['two']data[['three','one']]data.ix['Ohio']data[data['three']>5]data[:2]'''把data小於5的賦值0'''data[data<5]=0'''按照位置選擇值'''data.ix['Colorado','two']data.ix['Colorado',['two','three']]data.ix[['Colorado','Utah'],['three','four']]data.ix[2]data.ix[:'Utah','two']data.ix[:2,'two']data.ix[data.three>5,:3]'''reindex'''data.ix[['Colorado','Utah'],[3,0,1]]'''arithmetic and data alignment'''s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])s2=Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])'''not overlap return NA'''s1+s2'''dataframe'''df1=DataFrame(np.arange(9.).reshape(3,3),columns=list('bcd'),index=['Ohio','Texas','Colorado'])df2=DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])df1+df2'''只要有一個爲空,就是空'''df1.add(df2,fill_value=0)'''只要有一個有數,另一個就設爲0''''''reindex'''df1.reindex(columns=df2.columns,fill_value=0)df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))df1.add(df2,fill_value=0)df1.mul(df2,fill_value=0)df1.div(df2,fill_value=0)df1.sub(df2,fill_value=0)