Python數據分析-Pandas（Series與DataFrame）

時間 2019-12-08

標籤 python 數據分析 pandas series dataframe 欄目 Python 简体版

原文原文鏈接

Pandas介紹：python

　　pandas是一個強大的Python數據分析的工具包，是基於NumPy構建的。mysql

Pandas的主要功能：
　　1)具有對其功能的數據結構DataFrame、Series
　　2)集成時間序列功能
　　3)提供豐富的數學運算和操做
　　4)靈活處理缺失數據sql

python裏面安裝、引入方式：
　　安裝方法：pip install pandas
　　引用方法：import pandas as pd數據庫

Series數組的建立：數組

建立空的的值數據結構

import pandas as pd
s = pd.Series()
print(s)  #Series([], dtype: float64)

傳入一個列表工具

data=['a','b','c','d']
res=pd.Series(data)
print(res)

'''結果
0    a
1    b
2    c
3    d
這裏沒有傳遞任何索引，所以默認狀況下，它分配了從0到len(data)-1的索引，即：0到3
'''

傳一個字典google

data = {'a' : 0, 'b' : 1, 'c' : 2}
s = pd.Series(data)
print(s)
'''結果
a    0
b    1
c    2
dtype: int64

注意 - 字典鍵用於構建索引。

'''

從標量建立索引：spa

若是數據是標量值，則必須提供索引。將按照索引重複該值進行匹配code

res=pd.Series(0, index=['a','b','c','d'])
print(res)

'''結果
a    0
b    0
c    0
d    0

'''

自指定索引值：

res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])
print(res)

'''結果
a_index    a
b_index    b
c_index    c
d_index    d

'''

從具備位置的系列中訪問數據（取值）：

重點理解：數組是從零開始計數的，第一個位置存儲再零位置)

查看index 、 values的值：

#查看數組的index值
print(res.index)

#查看數組的value值
print(res.values)


#取值（根據默認第零位開始取）
print(res[0])  #a

取前三個值（不包括定義的最後一個數值）

res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])

#取前三個值（不包括3）
print(res[:3]) #是個對象能夠 res[:3].values
'''結果

　　a_index a
　　b_index b
　　c_index c
　　dtype: object

'''

取後三個值：

print(res[-3:])

'''結果
b_index    b
c_index    c
d_index    d
dtype: object

'''

使用索引標籤檢索數據並設置數據：

修改value值

res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])
print(res)
res['a_index']='new_a'
print(res)

'''結果

a_index    new_a
b_index        b
c_index        c
d_index        d

'''

copy複製數據並修改

sr1=pd.Series([12,13,14],index=['c','a','d'])
sr2=pd.Series([14,15,16],index=['d','c','a'])

#可使用copy賦值數組再修改
sr3=sr1[1:].copy()
print(sr3)

sr3[0]=1888
print(sr3)

'''
a    13
d    14
dtype: int64

a    1888
d      14
dtype: int64
'''

運算：

初始構建2個數組

sr1=pd.Series([12,13,14],index=['c','a','d'])

sr2=pd.Series([14,15,16],index=['d','c','a'])

print(sr1+sr2)
'''結果
a    29
c    27
d    28


'''

求和運算

Pandas自動對齊功能，若是自定義了索引就會找原來索引，若是沒有值就爲NaN

sr1=pd.Series([12,13,14],index=['c','a','d'])
sr3=pd.Series([11,20,10,14], index=['d','c','a','b'])
print(sr3)
#求sr1+sr3和值
print(sr1+sr3)
'''結果

a    23.0
b     NaN  #一位sr1中沒有索引b，因此顯示空
c    32.0
d    25.0

Pandas自動對齊功能，若是自定義了索引就會找原來索引，若是沒有值就爲NaN

'''

針對Seires格式的數據，Pandas對其NaN值的處理以下：

#先構建一個缺失數據
sr1=pd.Series([12,13,14],index=['c','a','d'])
sr2=pd.Series([14,15,16],index=['d','c','a'])

sr3=pd.Series([11,20,10,14], index=['d','c','a','b'])

#合併生成一個缺失數據
sr4=sr1+sr3
print(sr4)

'''結果

a    23.0
b     NaN
c    32.0
d    25.0
dtype: float64

'''

第一步：格式爲 pd.isnull（Series對象），isnull、notnull用於過濾、查找NaN的值

isnull，返回布爾數組，缺失值對應True

#isnull，返回布爾數組，缺失值對應True
res=pd.isnull(sr4)
print(res)

'''結果
a    False
b True
c    False
d    False

'''

notnull,返回布爾數組，缺失值對應爲False

#notnull,返回布爾數組，缺失值對應爲False
res=pd.notnull(sr4)
print(res)
'''結果
a     True
b False
c     True
d     True
dtype: bool

'''

第二步：格式爲 pd.Series.dropna（series對象），刪除有NaN的行，注意對於Series的數據格式使用dropna必須是

pd.Series.dropna(sr4)這個格式，不能使用pd.dropna()這個是無效的，

dropna,刪除NaN的行(由於是Series數據格式只有行的概念)

#dropna,過濾掉有NaN的行
res=pd.Series.dropna(sr4)
print(res)

'''
a    23.0
c    32.0
d    25.0
dtype: float64

'''

第三步：格式爲 Series對象.fillna（‘要填充爲的數據內容’）

fillna,填充缺失的數據

#fillna,填充NaN缺失的數據
res=sr4.fillna('這是給NaN作填充的數據')
print(res)

'''數據結構
a              23
b    這是給NaN作填充的數據
c              32
d              25
dtype: object

'''

DataFrame數組建立

DataFrame是個二維數據結構，很是接近電子表格或者相似於mysql數據庫的形式，是一個表格型的數據結構，含有一組有序的列。
DataFrame能夠被看作是由Series組成的字典，而且共用一個索引。

建立數組

簡單方式

data={'name':['google','baidu','yahho'],'marks':[100,200,300],'price':[1,2,3]}
res=DataFrame(data)
print(res)

'''結果(默認索引是0開始)

     name  marks  price
0  google    100      1
1   baidu    200      2
2   yahho    300      3

'''

補充，與Series結合方式

#與Series結合的方式
res=pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
print(res)

'''結果
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

'''

數組屬性、方法

　　1）index 獲取索引
　　2）T 轉置
　　3）columns 獲取列索引
　　4）values 獲取值數組
　　5) describe() 獲取快速統計
　　6）sort_index(axis, …, ascending) 按行或列索引排序
　　7）sort_values(by, axis, ascending) 按值排序

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}
res=DataFrame(data)
print(res)
''' 依此來進行下面查詢方法的驗證
     name  marks  price
0  google    100      1
1   baidu    200      2
2   yahoo    300      3

'''

index獲取索引

#index,查看索引
print(res.index)    #RangeIndex(start=0, stop=3, step=1)

columns查看列索引

#columns，查看列索引
print(res.columns)   #Index(['name', 'marks', 'price'], dtype='object')

values獲取數組值

#values，查看值數組
print(res.values)

'''結果

[['google' 100 1]
 ['baidu' 200 2]
 ['yahoo' 300 3]]
 
'''

describe(),獲取快速統計

#describe(),獲取快速統計
# print(res.describe())
'''
     marks  price
count    3.0    3.0
mean   200.0    2.0
std    100.0    1.0
min    100.0    1.0
25%    150.0    1.5
50%    200.0    2.0
75%    250.0    2.5
max    300.0    3.0

'''

sort_index(),按行或列索引排序

參數說明，axis=0/1 ascending=True升序/降序默認是True

#axis=0，按照行索引排序
res=res.sort_index(axis=0)
print(res)

'''索引排序結果
     name  marks  price
0  google    100      1
1   baidu    200      2
2   yahoo    300      3
'''


#axis=1，按照列索引排序
res=res.sort_index(axis=1,ascending=True)
print(res)
'''列索引排序結果
   marks name price
0    100  google      1
1    200   baidu      2
2    300   yahoo      3

'''

sort_values( by,axis,ascending ) 按值排序

#sort_values(by,axis,ascending) 按值排序
data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}
res=DataFrame(data)
res=res.sort_values(by=['name'],axis=0) #這裏的axis只能是0，每一列的數值就是根據每一個數值的大小順序上下浮動排序的，參照的就是逐行去對比
print(res)

'''按照值排序結果
     name  marks  price
1   baidu    200      2
0  google    100      1
2   yahoo    300      3

'''

手動指定索引值

#手動指定索引值
res=DataFrame(data,columns=['name','marks','price',],index=['第一','第二','第三'])
print(res)

'''結果

      name  marks  price
第一  google    100      1
第二   baidu    200      2
第三   yahho    300      3

'''

取值（有行索引和列索引）

獲取單列數據，例如獲取name標籤列數據

#獲取單列
#1.獲取名字標籤列
res=DataFrame(data,columns=['name','marks','price',],index=['第一','第二','第三'])
res=res['name']
print(res)

'''結果

第一    google
第二     baidu
第三     yahho
Name: name, dtype: object

'''

獲取price標籤列數據

#2.獲取價格標籤列
# res=res['price']
# print(res)
'''
第一    1
第二    2
第三    3
Name: price, dtype: int64

'''

獲取雙列數據

同時獲取2個標籤(注意：同時獲取兩個標籤時要雙中括號引發來)

#同時獲取2個標籤(注意：同時獲取兩個標籤時要雙中括號引發來)
res=res[['name','price']]
print(res)

'''結果
      name  price
第一  google      1
第二   baidu      2
第三   yahho      3

'''

獲取數據中單個值

#先從單列裏面取第一列，再從取出的列中取出第一個值
res=res['name'][0]
print(res)  #google

取前兩行值

#先取前2行
res=res[0:2]
print(res)
'''
      name  marks  price
第一  google    100      1
第二   baidu    200      2
'''

取前兩行值後再從中取指定列

#先取前2行--再從中取指定列
res=res[0:2][['name','price']]
print(res)

'''結果：注意，取多個標籤時要雙括號
      name  price
第一  google      1
第二   baidu      2

'''

ix ，能夠兼容下面loc、iloc用法，它能夠根據行列標籤又能夠根據行列數，例以下面的（參數前：行索引後：列索引）

import pandas as pd
data = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]},index=["a","b","c"])
data

    A   B   C
a   1   4   7
b   2   5   8
c   3   6   9
：本文爲博主原創文章，轉載請附上博文連接！

好比要拿到5

方法1

data.ix[1,1]

data.ix['b':'c','B':'C']

方法2

data.ix[1:3,1:3]
data.ix['b':'c','B':'C']

loc，經過標籤獲取列

指定取某幾個標籤

#指定取某幾個標籤
res=DataFrame(data,columns=['name','marks','price',],index=['第一','第二','第三'])
res=res.loc[:,['name','marks']]
print(res)
'''
      name  marks
第一  google    100
第二   baidu    200
第三   yahho    300
'''

取指定範圍內的標籤

#取指定範圍內的標籤
res=res.loc[:,'name':'price']
print(res)
'''
      name  marks  price
第一  google    100      1
第二   baidu    200      2
第三   yahho    300      3

'''

索引+標籤取值

#索引+標籤取值
data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}
res=DataFrame(data)
print(res)

'''初始結果
     name  marks  price
0  google    100      1
1   baidu    200      2
2   yahoo    300      3

'''

#搭配取值寫法
res=res.loc[0,['name']]
print(res)

'''結果

name    google
Name: 0, dtype: object

'''

根據索引、標籤範圍配合取值（注意，0:1包含了1）

#根據索引、標籤範圍配合取值(注意，0:1包含了1)
res=res.loc[0:1,['marks','price']]
print(res)

'''結果

   marks  price
0    100      1
1    200      2

'''

iloc，經過位置獲取行數據

獲取單行數據

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}
res=DataFrame(data)
print(res)
'''初始狀態
     name  marks  price
0  google    100      1
1   baidu    200      2
2   yahoo    300      3

'''

#獲第一行數據
res=res.iloc[0] print(res)

'''結果

name google marks 100 price 1 Name: 0, dtype: object '''

獲取多行數據

#獲取多行數據
data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}
res=DataFrame(data)
print(res.iloc[1]) # 先取到第2行： 1   baidu    200      2
 res=res.iloc[1,2]  #res.iloc[1,2]再在獲得的行上再根據索引取值
print(res)  #200

獲取行和列（根據範圍取值，注意前提還用默認索引）

#取行和列（根據範圍來取）
res=res.iloc[0:2,0:2]
print(res)
'''結果
     name  marks   1.先取前2行【0:2】即取0,1索引行
0  google    100   2.在1的基礎上再取前2列【0,2】即取列索引爲0,1
1   baidu    200

'''

取二、3行，而後打印這兩行全部列的數據

res=res.iloc[1:3,:]
print(res)

'''結果
    name  marks  price
1  baidu    200      2
2  yahoo    300      3

'''

取一、2行，而後單一顯示它的第一、3列

#取第一、2行，而後打印顯示它的第一、3列
res=res.iloc[[0,1],[0,2]]
print(res)

'''結果
     name  price
0  google      1
1   baidu      2

'''

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。