數據分析之NumPy，pandas，Matplotlib

時間 2019-12-06

標籤數據分析 numpy pandas matplotlib 简体版

原文原文鏈接

量化投資

本身編寫：NumPy+pandas+Matplotlib+...
在線平臺：聚寬，優礦，米筐，Quantopian...
開源框架：RQAlpha , QUANTAXIS,...
NumPy:數組批量計算
pandas:靈活的表計算
Matplotlib:數據可視化

IPython

IPython:交互式的Python命令行
安裝:pip3 install ipython
使用:ipython
快捷鍵：
1. TAB鍵自動完成
2. ?：內省、命名空間搜索 a? a.append? a.*pp*?
3. !：執行系統命令 !cd
魔術命令：以%開始的命令
1. %run：執行文件代碼
2. %paste：執行剪貼板代碼
3. %timeit：評估運行時間 %timeit a.sort()
4. %pdb：自動調試 %pdb on n %pdb off
使用命令歷史 _
獲取輸入輸出結果
目錄標籤系統
IPython Notebook

NumPy模塊

NumPy是高性能科學計算和數據分析的基礎包。它是pandas等其餘各類工具的基礎。html

安裝方法：pip3 install numpy
引用方式：import numpy as np
NumPy的主要功能：
1. ndarray，一個多維數組結構，高效且節省空間
2. 無需循環對整組數據進行快速運算的數學函數
3. 讀寫磁盤數據的工具以及用於操做內存映射文件的工具
4. 線性代數、隨機數生成和傅里葉變換功能
5. 用於集成C、C++等代碼的工具

1、ndarray多維數組對象

建立ndarray：np.array(array_like)
ndarray是多維數組結構，與列表的區別是：
1. 數組對象內的元素類型必須相同
2. 數組大小不可修改

一、ndarray經常使用屬性

T 數組的轉置（對高維數組而言）
dtype 數組元素的數據類型
size 數組元素的個數
ndim 數組的維數
shape 數組的維度大小（以元組形式）

二、ndarray數據類型

布爾型：bool_
整型：int_ int8 int16 int32 int64
無符號整型：uint8 uint16 uint32 uint64
浮點型：float_ float16 float32 float64
複數型：complex_ complex64 complex128

三、ndarray建立

array() 將列表轉換爲數組，可選擇顯式指定dtype
arange() range的numpy版，支持浮點數
linspace() 相似arange()，第三個參數爲數組長度
zeros() 根據指定形狀和dtype建立全0數組
ones() 根據指定形狀和dtype建立全1數組
empty() 根據指定形狀和dtype建立空數組（隨機值）
eye() 根據指定邊長和dtype建立單位矩陣

四、ndarray批量運算

數組和標量之間的運算：
1. a+1 a*3 1//a a**0.5 a>5
一樣大小數組之間的運算：
1. a+b a/b a**b a%b a==b
數組的索引：
1. 一維數組：a[5]
2. 多維數組：
  1. 列表式寫法：a[2][3]
  2. 新式寫法：a[2,3] (推薦)
數組的切片：
1. 一維數組：a[5:8] a[4:] a[2:10] = 1
2. 多維數組：a[1:2, 3:4] a[:,3:5] a[:,1]
3. 與列表不一樣，數組切片時並不會自動複製，在切片數組上的修改會影響原數組。【解決方法：copy()】
4. copy()方法開源建立數組的深拷貝

五、布爾型索引

問題：給一個數組，選出數組中全部大於5的數。
答案：a[a>5]
原理：
a>5會對a中的每個元素進行判斷，返回一個布爾數組
布爾型索引：將一樣大小的布爾數組傳進索引，會返回一個由全部True對應位置的元素的數組

問題2：給一個數組，選出數組中全部大於5的偶數。
問題3：給一個數組，選出數組中全部大於5的數和偶數。
答案：
a[(a>5) & (a%2==0)]
a[(a>5) | (a%2==0)

六、花式索引*

問題1：對於一個數組，選出其第1，3，4，6，7個元素，組成新的二維數組。
答案：a[[1,3,4,6,7]]

問題2：對一個二維數組，選出其第一列和第三列，組成新的二維數組。
答案：a[:,[1,3]]

2、NumPy：通用函數

ceil:向上取整 3.6 -》4 3.1-》4 -3.1-》-3

floor:向下取整：3.6-》3 3.1-》3 -3.1-》-4

round:四捨五入：3.6-》4 3.1-》3 -3.6-》-4

trunc(int):向零取整（捨去小數點後） 3.6-》3 3.1-》3 -3.1-》-3

arr = np.arange(10)
arr.sum()#45  求和
arr.mean()#4.5 平均值
arr.cumsum()  #array([ 0,  1,  3,  6, 10, 15, 21, 28, 36, 45], dtype=int32) #等差數列 
arr.std() #、求標準差

通用函數：能同時對數組中全部元素進行運算的函數
常見通用函數：
一元函數：abs(絕對值), sqrt(開方), exp, log, ceil, floor, rint, trunc, modf(分別取出小數部分和整數部分), isnan, isinf, cos, sin, tan
二元函數：add, substract, multiply, divide, power, mod, maximum, mininum,python

3、補充知識：浮點數特殊值

浮點數：float
浮點數有兩個特殊值：
nan(Not a Number)：不等於任何浮點數（nan != nan）
inf(infinity)：比任何浮點數都大

NumPy中建立特殊值：np.nan    np.inf
在數據分析中，nan常被用做表示數據缺失值

4、數學和統計方法

sum    求和
cumsum 求前綴和
mean    求平均數
std    求標準差
var    求方差
min    求最小值
max    求最大值
argmin    求最小值索引
argmax    求最大值索引

5、隨機數生成

隨機數生成函數在np.random子包內
經常使用函數
rand        給定形狀產生隨機數組（0到1之間的數）
randint        給定形狀產生隨機整數
choice        給定形狀產生隨機選擇
shuffle        與random.shuffle相同
uniform        給定形狀產生隨機數組

import numpy as np
import random
a=[random.uniform(10.0,20.0) for i in range(3)]
print(a)
a=np.array(a)  #將列表轉換成數組
print(a)
x=10
print(a*x) #數組裏面的每一項乘以10
print(a.size,a.dtype)#size 數組元素的個數 #dtype 數組元素的數據類型


b=np.array([[1,2],[11,22],[99,99]])
print(b.shape) #shape 數組的維度大小（以元組形式）
print(b.T) #T 數組的轉置（對高維數組而言）


c=np.array([[[1,2,3],[666,777,888]],[[11,22,33],[55,66,77]]])
print(c.shape)
print(c.ndim) #ndim 數組的維數


d=np.zeros(10)
print(d)
d=np.zeros(10,dtype='int') #能夠加dtype指定類型，默認是float
print(d)
d=np.ones(10)
print(d)


d=np.empty(10)  ##注意內存殘留
print(d)
d=np.arange(2,10,3)
print(d)
d=np.arange(2,10,0.3) ##步長能夠是小數
print(d)

d=np.linspace(1,10,3) ##最後一個參數是分紅多少份數,也就是數組長度，步長是同樣的
print(d)

d=np.eye(10)
print(d)

a=np.arange(10)
print(a,a+1,a*3) #每一列加1，每一列乘以3
b=np.arange(10,20)
print(b,a+b) #對應列相加，前提是數組長度同樣
a[0]=100
print(a>b) #每一列的比較返回布爾，總體返回布爾列表

a=np.arange(15)
print(a)
a=np.arange(15).reshape((3,5))  #生成一個3行5列的二維數組
print(a,a.shape)
print(a[0][0],a[0,0]) # 推薦a[0,0]的寫法


a=np.arange(15)
b=np.arange(10)
c=a[0:4]
d=b[0:4].copy() #這樣就不影響b,這個是複製出來了一份
c[0]=20  #對c進行修改一樣會影響到a,是出於省空間的考量，其實只是把地址指向了a的前4項
d[0]=20
print(a,b,c,d)


a=np.arange(15).reshape((3,5))
print(a[0][0:2],a[1,0:2])
print(a[0:2]) #默認是按行切
print(a[0:2,0:2]) #逗號前面是行，後面是列，獲得的仍是二維數組

print(a[1:4,2:4])
print(a[1:,2:4])


a=[random.randint(0,10) for i in range(20)]
print(list(filter(lambda x:x>5,a)))
a=np.array(a)
print(a>5) #[False False ... True  True]
print(a[a>5])  #a>5會把列表的每一項先轉換成True和False

#方式1：
b=a[a>5]
print(b[b%2==0],'全部大於5的偶數')
#方式2：
print(a[(a>5) & (a%2==0)])#記得加括號，&的優先級比較高

print(3 and 4,3 & 4)
print(a[(a>5) | (a%2==0)])#大於5或者偶數

a=np.arange(4)
print(a)
print(a[[True,True,False,False]]) #是True的才顯示


a=np.arange(10,20)
print(a)
print(a[[1,3,4]],'花式索引') #

a=np.arange(20).reshape(4,5)
print(a[0,2:4])
print(a[0,a[0]>2])

print(a[[1,3],[1,3]])# 注意：這個是取出（1,1）和（3,3）位置上的數，
print(a[[1,3],:][:,[1,3]]) # 花式索引要這樣操做

a=np.arange(-5,5)
print(np.abs(a))
print(abs(a))
# print(np.sqrt(a))

import math
a=1.6
print(math.trunc(a),math.ceil(a),round(a),math.floor(a))
print(np.trunc(a),np.ceil(a),np.round(a),np.floor(a))

print(np.modf(a))#把整數和小數部位分開，返回一個元組，元組第一個值是小數部分，第二個值是整數部分

#nan(Not a Number)：不等於任何浮點數（nan != nan）
#inf(infinity)：比任何浮點數都大
print(float('nan'),float('inf')) #nan inf
print(np.nan == np.nan) #False
print(np.isnan(np.nan)) #True
print(~ np.isnan(np.nan)) #False
print(not np.isnan(np.nan)) #False

print(np.inf == np.inf) #True


a=np.arange(0,5)
b=5/a #inf
c=a/a #nan
print(b,)
print(c[~np.isnan(c)])  #去掉nan
print(b[~np.isinf(b)])  #去掉inf


a=np.array([3,4,5,6,7])
b=np.array([2,5,3,7,4])
print(np.maximum(a,b))
print(a.sum())
print(a.mean())

"""
方差是在機率論和統計方差衡量隨機變量或一組數據時離散程度的度量。
1,2,3,4,5
方差: ((1-3)**2+(2-3)**2+(3-3)**2+(4-3)**2+(5-3)**2)/5
標準差：方差再開方 
sqrt(((1-3)**2+(2-3)**2+(3-3)**2+(4-3)**2+(5-3)**2)/5)
"""



a=np.arange(0,10,0.2)
print(a.var(),a.std()) #方差 標準差
print(a.mean()-a.std(),a.mean()+a.std())
print(a.mean()-2*a.std(),a.mean()+2*a.std())

print(a.argmax()) #取最大數的索引

a=np.random.randint(0,10,10) #第三個參數是數組長度
a=np.random.randint(0,10,(3,5)) #第三個參數三行五列
a=np.random.randint(0,10,(3,5,4)) #第三個參數三行五列
a=np.random.rand(10) #####表明個數  不是random.random
print(a)

# x=np.linspace(-10,10,10000)
# y=x**2
# import matplotlib.pyplot as plt
# plt.plot(x,y)
# plt.show()

View Code

pandas

pandas是一個強大的Python數據分析的工具包，pandas是基於NumPy構建的。正則表達式

安裝方法：pip3 install pandas
引用方式：import pandas as pd
pandas的主要功能：
1. 具有對其功能的數據結構DataFrame、Series
2. 集成時間序列功能
3. 提供豐富的數學運算和操做
4. 靈活處理缺失數據

1、Series

Series（數組+字典）是一種相似於一位數組的對象，由一組數據和一組與之相關的數據標籤（索引）組成。數據庫

一、建立方式

pd.Series([4,7,-5,3]) 
pd.Series([4,7,-5,3],index=['a','b','c','d'])               
pd.Series({'a':1, 'b':2})             
pd.Series(0, index=['a','b','c','d’])

二、Series特性

Series支持數組的特性：
從ndarray建立Series：Series(arr)
與標量運算：sr*2
兩個Series運算：sr1+sr2
索引：sr[0], sr[[1,2,4]]
切片：sr[0:2]（切片依然是視圖形式）
通用函數：np.abs(sr)
布爾值過濾：sr[sr>0]
統計函數：mean() sum() cumsum()

Series支持字典的特性（標籤）：
從字典建立Series：Series(dic), 
in運算：’a’ in sr、for x in sr
鍵索引：sr['a'], sr[['a', 'b', 'd']]
鍵切片：sr['a':'c']
其餘函數：get('a', default=0)等

三、Series整數索引

整數索引的pandas對象每每會使新手抓狂。
例：
sr = np.Series(np.arange(4.))
sr[-1] 

若是索引是整數類型，則根據整數進行數據操做時老是面向標籤的。
loc屬性        以標籤解釋
iloc屬性        如下標解釋

四、Series數據對齊

pandas在運算時，會按索引進行對齊而後計算。若是存在不一樣的索引，則結果的索引是兩個操做數索引的並集。

例：
sr1 = pd.Series([12,23,34], index=['c','a','d'])
sr2 = pd.Series([11,20,10], index=['d','c','a',])
sr1+sr2
sr3 = pd.Series([11,20,10,14], index=['d','c','a','b'])
sr1+sr3

如何在兩個Series對象相加時將缺失值設爲0？
sr1.add(sr2, fill_value=0)
靈活的算術方法：add, sub, div, mul

五、Series缺失數據

缺失數據：使用NaN（Not a Number）來表示缺失數據。其值等於np.nan。內置的None值也會被當作NaN處理。

處理缺失數據的相關方法：
dropna()        過濾掉值爲NaN的行
fillna()        填充缺失數據
isnull()        返回布爾數組，缺失值對應爲True
notnull()        返回布爾數組，缺失值對應爲False

2、DataFrame

一、建立方式

DataFrame是二維數據對象
DataFrame是一個表格型的數據結構，含有一組有序的列。
DataFrame能夠被看作是由Series組成的字典，而且共用一個索引。

建立方式：
pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
......

csv文件讀取與寫入：
df.read_csv('filename.csv')
df.to_csv()

二、DataFrame經常使用屬性及方法

index                    獲取索引
T                        轉置
columns                    獲取列索引
values                    獲取值數組
describe()                獲取快速統計

DataFrame各列name屬性：列名
rename(columns={})

三、DataFrame索引和切片

DataFrame是一個二維數據類型因此有行索引和列索引。

DataFrame一樣能夠經過標籤和位置兩種方法進行索引和切片。

loc屬性和iloc屬性
使用方法：逗號隔開，前面是行索引，後面是列索引
行/列索引部分能夠是常規索引、切片、布爾值索引、花式索引任意搭配。

DataFrame使用索引切片：

方法1：兩個中括號，先取列再取行。    df['A'][0]
方法2（推薦）：使用loc/iloc屬性，一箇中括號，逗號隔開，先取行再取列。
loc屬性：解釋爲標籤
iloc屬性：解釋爲下標
向DataFrame對象中寫入值時只使用方法2
行/列索引部分能夠是常規索引、切片、布爾值索引、花式索引任意搭配。（注意：兩部分都是花式索引時結果可能與預料的不一樣）

經過標籤獲取：
df['A']
df[['A', 'B']]
df['A'][0]
df[0:10][['A', 'C']]
df.loc[:,['A','B']]
df.loc[:,'A':'C']
df.loc[0,'A']
df.loc[0:10,['A','C']]

經過位置獲取：
df.iloc[3]
df.iloc[3,3]
df.iloc[0:3,4:6]
df.iloc[1:5,:]
df.iloc[[1,2,4],[0,3]]

經過布爾值過濾：
df[df['A']>0]
df[df['A'].isin([1,3,5])]
df[df<0] = 0

四、DataFrame數據對齊與缺失數據

DataFrame對象在運算時，一樣會進行數據對齊，行索引與列索引分別對齊。
結果的行索引與列索引分別爲兩個操做數的行索引與列索引的並集。

DataFrame處理缺失數據的相關方法：
dropna(axis=0,where='any',…)
fillna()
isnull()
notnull()

3、pandas其餘經常使用方法

pandas經常使用方法（適用Series和DataFrame）：
mean(axis=0,skipna=False)
sum(axis=1)
sort_index(axis, …, ascending)        按行或列索引排序
sort_values(by, axis, ascending)    按值排序
NumPy的通用函數一樣適用於pandas

apply(func, axis=0)    將自定義函數應用在各行或者各列上                                  ，func可返回標量或者Series
applymap(func)        將函數應用在DataFrame各個元素上
map(func)        將函數應用在Series各個元素上

4、pandas時間對象處理

時間序列類型：
時間戳：特定時刻
固定時期：如2017年7月
時間間隔：起始時間-結束時間
Python標準庫：datetime
date time datetime timedelta
dt.strftime()
strptime()
靈活處理時間對象：dateutil包
dateutil.parser.parse()
成組處理時間對象：pandas
pd.to_datetime(['2001-01-01', '2002-02-02'])

產生時間對象數組：date_range
start        開始時間
end        結束時間
periods        時間長度
freq        時間頻率，默認爲'D'，可選H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

5、pandas時間序列

時間序列就是以時間對象爲索引的Series或DataFrame。

datetime對象做爲索引時是存儲在DatetimeIndex對象中的。

時間序列特殊功能：
傳入「年」或「年月」做爲切片方式
傳入日期範圍做爲切片方式
豐富的函數支持：resample(), strftime(), ……
批量轉換爲datetime對象：to_pydatetime()

6、pandas從文件讀取

讀取文件：從文件名、URL、文件對象中加載數據
read_csv        默認分隔符爲csv
read_table    默認分隔符爲\t
read_excel    讀取excel文件  pip3 install xlrd
讀取文件函數主要參數：
sep        指定分隔符，可用正則表達式如'\s+'
header=None    指定文件無列名
name        指定列名
index_col    指定某列做爲索引
skip_row        指定跳過某些行
na_values    指定某些字符串表示缺失值
parse_dates    指定某些列是否被解析爲日期，布爾值或列表

7、pandas寫入到文件

寫入到文件：
to_csv
寫入文件函數的主要參數：
sep
na_rep        指定缺失值轉換的字符串，默認爲空字符串
header=False    不輸出列名一行
index=False    不輸出行索引一列
cols        指定輸出的列，傳入列表

其餘文件類型：json, XML, HTML, 數據庫
pandas轉換爲二進制文件格式（pickle）:
save
load

import pandas as pd
import numpy as np
print(pd.Series([2,3,4,5]))
print(pd.Series(np.arange(5)))

sr=pd.Series({'a':1,'b':2})
print(sr)

sr = pd.Series([2,3,4,5],index=['a','b','c','d'])
print(sr,sr[0],sr['a'])
"""
a    2
b    3
c    4
d    5
dtype: int64 2 2
"""

print(sr * 2)
print(sr+sr) # 每列相加

print(sr[0:2])
print(sr['a':'c']) #標籤進行切片
print(sr[[1,2]]) #花式索引
print(sr[sr > 2]) #布爾索引
print(sr[['a','b']]) #鍵索引
print('a' in sr) #True

for i in sr:
    print(i)  #是值不是索引

print(sr.index,sr.values)

######################################整數索引
sr=pd.Series(np.arange(20))
sr2=sr[10:].copy()
print(sr2,sr2[10])  ##10規定解釋爲標籤，不是索引
# print(sr2[-1]) #### 報錯
print(sr2.loc[10]) ## 標籤
print(sr2.iloc[-1])## 下標

sr1=pd.Series([1,2,3],index=['c','b','a'])
sr2=pd.Series([10,20,30],index=['a','b','c'])
print(sr1+sr2) ##按照標籤進行計算

sr1=pd.Series([1,2,3],index=['c','b','a'])
sr2=pd.Series([10,20,30,40],index=['a','b','c','d'])
print(sr1+sr2)

sr1=pd.Series([1,2,3],index=['c','d','a'])
sr2=pd.Series([10,20,30],index=['a','b','c'])
print(sr1.add(sr2,fill_value=0))
sr=sr1+sr2
print(sr.isnull()) ######不是isnan
print(sr.notnull())
print(sr[sr.notnull()])
print(sr.dropna()) ###去掉NaN

print(sr.fillna(0))
print(sr.fillna(sr.mean()))

################################
df = pd.DataFrame({'one':[1,2,3],'two':[4,5,6]})
print(df)
df = pd.DataFrame({'one':[1,2,3],'two':[4,5,6]},index=['a','b','c'])
print(df)
"""
   one  two
0    1    4
1    2    5
2    3    6

   one  two
a    1    4
b    2    5
c    3    6
"""

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
                   'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
print(df)
"""
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4
"""

print('####################################')
#NaN不參與排序，統一放最後面
print(df.sort_values(by='two')) #按列排序
"""
   one  two
b  2.0    1
a  1.0    2
c  3.0    3
d  NaN    4
"""
print(df.sort_values(by='two',ascending=False)) #按列倒序
print(df.sort_values(by='a',axis=1)) #按行排序，通常不用


print(df.sort_index(),'按索引排序') #按索引排序
"""
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4 按索引排序
"""
print(df.sort_index(ascending=False))
print(df.sort_index(ascending=False,axis=1))



print(df['one']['a'],'先是列後是行')  #########注意
print(df.loc['a','one']) #########推薦先行後列

print(df.loc['a',:])
print(df.loc['a',])
#行/列索引部分能夠是常規索引、切片、布爾值索引、花式索引任意搭配。
print(df.loc[['a','c'],:])
print(df.loc[['a','c'],'two'])

print(df.mean()) #每一列的平均值
print(df.mean(axis=1),'每一行的平均值') #每一行的平均值
"""
one    2.0
two    2.5
dtype: float64

a    1.5
b    1.5
c    3.0
d    4.0
dtype: float64 每一行的平均值
"""
print(df.sum())
print(df.sum(axis=1))
"""
one     6.0
two    10.0
dtype: float64

a    3.0
b    3.0
c    6.0
d    4.0
dtype: float64
"""


df2=pd.DataFrame({'two':[1,2,3,4],'one':[4,5,6,7]},index=['c','b','a','d'])
print(df2,'=========')
"""
   one  two
c    4    1
b    5    2
a    6    3
d    7    4 =========
"""
print(df+df2) #行對行，列隊列相加
"""
   one  two
a  7.0    5
b  7.0    3
c  7.0    4
d  NaN    8
"""
print(df.fillna(0))#NaN的地方填充0

print(df.dropna())#只要一行中有NaN那麼這一行就被刪除
print(df.dropna(how='any'))#默認值，要一行中有NaN那麼這一行就被刪除
print(df.dropna(how='all'))#這一行所有爲NaN的時候才刪除這一行

print(df.dropna(axis=1),) #默認axis=0表明行，axis=1表明列，一列中有NaN就刪除這一列


"""
a.csv中：
a,b,c
1,2,3
2,4,6
3,6,9
"""
df = pd.read_csv('a.csv')
print(df)
"""
   a  b  c
0  1  2  3
1  2  4  6
2  3  6  9
"""

df = pd.DataFrame({'one':[1,2,3],'two':[4,5,6]},index=['a','b','c'])
df.to_csv('b.csv')

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
                   'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

print(df.index) #獲取行索引
print(df.columns) #獲取列索引
print(df.T) #每一列的數據類型必須同樣
print(df.values)#獲取值，是一個二維數組
print(df.describe()) #描述


##################################
import datetime
d=datetime.datetime.strptime('2010-01-01','%Y-%m-%d')
print(d,type(d)) #2010-01-01 00:00:00 <class 'datetime.datetime'>

import dateutil
d=dateutil.parser.parse('2010-01-01')
print(d,type(d)) #2010-01-01 00:00:00 <class 'datetime.datetime'>
d=dateutil.parser.parse('02/03/2010')
print(d,type(d))#2010-02-03 00:00:00 <class 'datetime.datetime'>
d=dateutil.parser.parse('2010-JAN-01')
print(d,type(d))#2010-01-01 00:00:00 <class 'datetime.datetime'>

print('################################')
a=pd.to_datetime(['2010-01-01','02/03/2010','2010-JAN-01'])
print(a) #DatetimeIndex(['2010-01-01', '2010-02-03', '2010-01-01'], dtype='datetime64[ns]', freq=None)
a=pd.date_range('2010-01-01','2010-05-01') #以天爲單位
print(a)
a=pd.date_range('2010-01-01',periods=30) #以天爲單位,periods=30顯示30個
print(a)
a=pd.date_range('2010-01-01',periods=30,freq='H') #以小時爲單位
print(a)
a=pd.date_range('2010-01-01',periods=30,freq='W') #以周爲單位
print(a)
a=pd.date_range('2010-01-01',periods=30,freq='W-MON') #以週一爲單位
print(a)
a=pd.date_range('2010-01-01',periods=30,freq='B') #以工做日爲單位
print(a)
a=pd.date_range('2010-01-01',periods=30,freq='1h20min') #1小時20分鐘
print(a)

sr=pd.Series(np.arange(500),index=pd.date_range('2010-01-01',periods=500))
print(sr)
print(sr['2011-03'])
print(sr['2011'])
print(sr['2010':'2011-03'])
print(sr['2010-12-10':'2011-03-11'])
aa=sr.resample('W').sum() #每週的和
aa=sr.resample('M').sum() #每個月的和
aa=sr.resample('M').mean() #每個月的平均
print(aa)

gp=pd.read_csv('601318.csv',index_col=0)
gp=pd.read_csv('601318.csv',index_col='date',parse_dates=True)
gp=pd.read_csv('601318.csv',index_col='date',parse_dates=['date'])
#parse_dates=True 把日期列變成時間對象
print(gp,gp.index)

#csv沒有列名時：
#gp=pd.read_csv('601318.csv',header=None,names=['ID','name','age'])


#na_values=['None'] 指定那些字符串爲NaN
gp=pd.read_csv('601318.csv',header=None, na_values=['None','nan','null'])
print("################")

df = pd.DataFrame({'one':[1,'NaN',3],'two':[4,5,6]},index=['a','b','c'])
print(df)
"""
   one  two
a    1    4
b  NaN    5
c    3    6

header=False 不要列名
index=False 不要索引
na_rep='null' 空格換成null
columns=['one'] 取那些列
"""
df.to_csv('c.csv',header=False,index=False,)

"""
c.csv中：
1,4
NaN,5
3,6
"""
df = pd.DataFrame({'one':[1,'NaN',3],'two':[4,5,6]},index=['a','b','c'])
df.to_csv('d.csv',header=False,index=False,na_rep='null',columns=['one'])
df.to_html('aa.html') #是表格
df.to_json('aa.json')

View Code

,date,open,close,high,low,volume,code
0,2007-03-01,21.878,20.473,22.302,20.04,1977633.51,601318
1,2007-03-02,20.565,20.307,20.758,20.075,425048.32,601318
2,2007-03-05,20.119,19.419,20.202,19.047,419196.74,601318
3,2007-03-06,19.253,19.8,20.128,19.143,297727.88,601318
4,2007-03-07,19.817,20.338,20.522,19.651,287463.78,601318
5,2007-03-08,20.171,20.093,20.272,19.988,130983.83,601318
6,2007-03-09,20.084,19.922,20.171,19.559,160887.79,601318
7,2007-03-12,19.821,19.563,19.821,19.471,145353.06,601318
8,2007-03-13,19.607,19.642,19.804,19.524,102319.68,601318
9,2007-03-14,19.384,19.664,19.734,19.161,173306.56,601318
10,2007-03-15,19.918,19.673,20.342,19.603,152521.9,601318

601318.csv

Matplotlib

Matplotlib是一個強大的Python繪圖和數據可視化的工具包。json

安裝方法：pip install matplotlib
引用方法：import matplotlib.pyplot as plt
繪圖函數：plt.plot()
顯示圖像：plt.show()

1、plot函數

plot函數：繪製折線圖
線型linestyle（-,-.,--,..）
點型marker（v,^,s,*,H,+,x,D,o,…）
顏色color（b,g,r,y,k,w,…）

plot函數繪製多條曲線
pandas包對plot的支持

2、圖像標註

設置圖像標題：plt.title()
設置x軸名稱：plt.xlabel()
設置y軸名稱：plt.ylabel()
設置x軸範圍：plt.xlim()

設置y軸範圍：plt.ylim()
設置x軸刻度：plt.xticks()
設置y軸刻度：plt.yticks()
設置曲線圖例：plt.legend()

3、畫布與子圖

畫布：figure
fig = plt.figure()
圖：subplot
ax1 = fig.add_subplot(2,2,1)
調節子圖間距：
subplots_adjust(left, bottom, right, top, wspace, hspace)

4、支持的圖類型

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# plt.plot([1,2,3,4],[2,3,1,7])#折線圖,x軸，y軸
# plt.plot([1,2,3,4],[2,3,1,7],"o") # "o"小圓點
# plt.plot([1,2,3,4],[2,3,1,7],"o-")# "o-"小圓點加折線實線
# plt.plot([1,2,3,4],[2,3,1,7],"o--")# "o-"小圓點加折線虛線
plt.plot([1,2,3,4],[2,3,1,7],"o--",color='red',label='Line A')# "o-"小圓點加折線虛線
plt.plot([1,2,3,7],[3,5,6,9],marker='o',color='black',label='Line B')
plt.title("Matplotlib Test Plot")
plt.xlabel("Xlabel")
plt.ylabel("Ylabel")
plt.xlim(0,5) #設置x軸的範圍
plt.ylim(0,10)
plt.xticks([0,2,4])#x軸的刻度
plt.xticks(np.arange(0,11,2))
plt.xticks(np.arange(0,11,2),['a','b','c','d','e','f'])
plt.legend()#配合plot中的label='Line A' 一塊兒使用
plt.show()


df=pd.read_csv('601318.csv',parse_dates=['date'],index_col='date')[['open','close','high','low']]
df.plot() #索引列直接變成X軸
plt.show()


x=np.linspace(-1000,1000,10000)
y1=x
y2=x**2
y3=3*x**3+5*x**2+2*x+1
plt.plot(x,y1,color='red',label="Line y=x")
plt.plot(x,y2,color='blue',label="Line y=x^2")
plt.plot(x,y3,color='black',label="Line y=3x^3+5x^2+2x+1")
plt.legend()
plt.xlim(-1000,1000)
plt.ylim(-1000,1000)
plt.show()


#####################畫布
fig=plt.figure()
ax1=fig.add_subplot(2,2,1)  #畫布分紅2行2列 它佔第一個位置
ax1.plot([1,2,3,4],[2,3,1,7])
plt.show()


fig=plt.figure()
ax1=fig.add_subplot(2,1,1)
ax1.plot([1,2,3,4],[2,3,1,7])

fig=plt.figure()
ax1=fig.add_subplot(2,1,2)
ax1.plot([1,2,3,4],[2,3,1,7])

#####################柱狀圖
plt.bar([0,1,2,4],[5,6,7,8])
plt.show()


data=[32,48,21,100]
labels=['Jan','Feb','Mar','Apr']
# plt.bar(np.arange(len(data)),data,color='red',width=0.3)
plt.bar(np.arange(len(data)),data,color='red',width=[0.1,0.2,0.3,0.4])
plt.xticks(np.arange(len(data)),labels=labels)
plt.show()


#####################餅圖
plt.pie([10,20,30,40],labels=['a','b','c','d'],autopct="%.2f%%",explode=[0.1,0,0.1,0])
plt.axis('equal')
plt.show()

#####################K線圖
import matplotlib.finance as fin
from matplotlib.dates import date2num
df=pd.read_csv('601318.csv',parse_dates=['date'],index_col='date')[['open','close','high','low']]
df['time']=date2num(df.index.to_pydatetime())
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
arr=df[['time','open','close','high','low']].values
fin.candlestick_ochl(ax,arr)
fig.grid()
fig.show()