數據分析 --- 02.Pandas

時間 2019-11-08

標籤數據分析 02.pandas pandas 简体版

原文原文鏈接

一.Series 對象(一維數組)

Series是一種相似與一維數組的對象，由下面兩個部分組成：

　　values：一組數據（ndarray類型）
　　index：相關的數據索引標籤

　　1.建立

# 導入模塊

from pandas import Series,DataFrame
import pandas as pd
import numpy as np

①使用列表建立數組

#使用列表建立Series

#默認索引
Series(data=[1,2,3])

#指定索引
Series(data=[1,2,3],index=['a','b','c'])

結果爲:

a    1
b    2
c    3
dtype: int64

②使用numpy 建立數據結構

s = Series(data=np.random.randint(0,100,size=(3,)),index=['a','b','c'])

結果爲:

a    37
b    13
c    71
dtype: int32

③使用字典建立dom

由字典建立:不能在使用index.可是依然存在默認索引

注意：數據源必須爲一維數據

dic = {
    '語文':99,
    '數學':100,
    '英語':88,
    '理綜':120
}
s = Series(data=dic)

結果爲:

數學    100
理綜    120
英語     88
語文     99
dtype: int64

　　2.索引和切片

①索引函數


s[0]


結果爲:

　　數學    100

②切片spa

s[0:2]

結果爲:

數學    100
理綜    120
dtype: int64

　　3.基本概念

①添加數據設計

s['毛概'] = 111

結果爲:

數學    100
理綜    120
英語     88
語文     99
毛概    111
dtype: int64

②查看屬性3d

shape 形狀，
size 總長度，
index 索引,
values 值

示例:

s.values

array([100, 120,  88,  99, 111], dtype=int64)

③查值code

可使用s.head(),tail()分別查看前n個和後n個值

示例:對象

s.tail(2)

語文     99
毛概    111
dtype: int64

④去重blog

s1 = Series(data=[1,1,1,2,2,2,3,3,4,56,6,7,8,8,8,7])
s1.unique()

結果爲:

array([ 1,  2,  3,  4, 56,  6,  7,  8], dtype=int64)

　　4.Series的運算

在運算中自動對齊不一樣索引的數據
若是索引不對應，則補NaN

①運算

示例:

s1 = Series(data=[1,2,3],index=['a','b','c'])
s2 = Series(data=[1,2,3],index=['a','b','d'])
s = s1 + s2

結果爲:

a    2.0
b    4.0
c    NaN
d    NaN
dtype: float64

②可使用pd.isnull()，pd.notnull()，

　　或s.isnull(),notnull()函數檢測缺失數據

③ boolean值能夠做爲Series的索引，只保留True對應的元素值，忽略False對應的元素值

二. DataFrame

DataFrame是一個【表格型】的數據結構。DataFrame由按必定順序排列的多列數據組成。
　　設計初衷是將Series的使用場景從一維拓展到多維。DataFrame既有行索引，也有列索引。


　　行索引：index
　　列索引：columns
　　值：values

　　1.建立

①使用ndarray建立DataFrame

②使用字典建立

　　2.屬性

values、值
columns、 列索引
index、行索引
shape  形狀

　　3.索引和切片


df[0]  #列索引
df.iloc[0] #行索引
df.iloc[1,2] #定位一個元素


df[0:2] #切行
df.iloc[:,0:2] #切列

對行進行索引

- 使用.loc[]加index來進行行索引(設置了index)
- 使用.iloc[]加整數來進行行索引

①修改索引

②獲取前兩列

③獲取前兩行

④定位元素

⑤切出前兩行

⑥切出前兩列

　　4.運算

同Series同樣：

　　在運算中自動對齊不一樣索引的數據
　　若是索引不對應，則補NaN

示例:

　　5.練習

============================================

練習1：

假設ddd是期中考試成績，ddd2是期末考試成績，請自由建立ddd2，並將其與ddd相加，求期中期末平均值。

假設張三期中考試數學被發現做弊，要記爲0分，如何實現？

李四由於舉報張三做弊立功，期中考試全部科目加100分，如何實現？

後來老師發現有一道題出錯了，爲了安撫學生情緒，給每位學生每一個科目都加10分，如何實現？

============================================

①

# 建立(期中和期末同樣的值)

dic = {
    '張三':[150,150,150,150],
    '李四':[0,0,0,0]
}
df = DataFrame(data=dic,index=['語文','數學','英語','理綜'])
qizhong = df
qimo = df

#求平均:

(qizhong+qimo)/2

②

③

④

練習2:
使用tushare包獲取某股票的歷史行情數據。
輸出該股票全部收盤比開盤上漲3%以上的日期。
輸出該股票全部開盤比前日收盤跌幅超過2%的日期。
假如我從2010年1月1日開始，每個月第一個交易日買入1手股票，每一年最後一個交易日賣出全部股票，到今天爲止，個人收益如何？

安裝 tushare 模塊

　　pip install tushare

①

import tushare as ts
df = ts.get_k_data(code='600519',start='2000-01-01')

　# 將時間做爲索引
#將請求的數據存儲起來
df.to_csv('./600519.csv')

#將600519.csv文件中的數據讀取到df, 索引,時間類型
df = pd.read_csv('./600519.csv',index_col='date',parse_dates=['date'])
# 刪除無用的列,1表示列,並同步映射原數據
df.drop(labels='Unnamed: 0',axis=1,inplace=True)

#展現前五行數據

df.head(5)

②

③

④

#從新劃定交易範圍

df = df['2010-01':'2019-06']
df #2010-2019年全部的交易數據

#將df中全部月的第一個交易日對應的行數據取出（數據的從新取樣）
df_monthly = df.resample('M').first()

# 每一年的最後一個交易日
df_yearly = df.resample('A').last()
df_yearly = df_yearly[:-1]

price_last = df['open'][-1]

#Pandas提供了resample函數用便捷的方式對時間序列進行重採樣，根據時間粒度的變大或者變小分爲降採樣和升採樣：
df_monthly = df.resample("M").first()
df_yearly = df.resample("A").last()[:-1] #去除最後一年
cost_money = 0
hold = 0 #每一年持有的股票
for year in range(2010, 2020):
    #買股票要花的錢 str(year)可獲取某一年的全部數據
    cost_money -= df_monthly.loc[str(year)]['open'].sum()*100
    hold += len(df_monthly[str(year)]['open']) * 100
    
    if year != 2019:
　　　　  賣股票收入的錢
        cost_money += df_yearly[str(year)]['open'][0] * hold
        hold = 0 #每一年持有的股票
 #2019 年買股票的錢      
cost_money += hold * price_last

print(cost_money)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。