Ref: Pandas and NumPy arrays explainedhtml
Ref: pandas: powerful Python data analysis toolkit【開發者文檔】git
df=df.values
import pandas as pd df = pd.DataFrame(df)
數據集直接轉換爲dataframe格式。github
import pandas as pd iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
COMP9318/L1 - Pandas-1.ipynbapp
COMP9318/L1 - Pandas-2.ipynbdom
COMP9318/L1 - numpy-fundamentals.ipynb函數
相似於倒排表,column至關於words. index就是doc id.flex
df = pd.DataFrame([10, 20, 30, 40], columns=['numbers'], index=['a', 'b', 'c', 'd']) df
Output: spa
numbers | |
---|---|
a | 10 |
b | 20 |
c | 30 |
d | 40 |
以「月」爲間隔單位。.net
dates = pd.date_range('2015-1-1', periods=9, freq='M') df.index = dates df
Output:excel
DatetimeIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30', '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31', '2015-09-30'], dtype='datetime64[ns]', freq='M')
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
2015-01-31 | -0.173893 | 0.744792 | 0.943524 | 1.423618 |
2015-02-28 | -0.388310 | -0.494934 | 0.408451 | -0.291632 |
2015-03-31 | 0.675479 | 0.256953 | -0.458723 | 0.858815 |
2015-04-30 | -0.046759 | -2.548551 | 0.454668 | -1.011647 |
2015-05-31 | -0.938467 | 0.636606 | -0.237240 | 0.854314 |
2015-06-30 | 0.134884 | -0.650734 | 0.213996 | -1.969656 |
2015-07-31 | 1.046851 | -0.016665 | -0.488270 | 1.377827 |
2015-08-31 | 0.482625 | 0.176105 | -0.681728 | -1.057683 |
2015-09-30 | -1.675402 | 0.364292 | 0.897240 | -0.629711 |
相似dict的添加方式。
# (1) 其餘col默認
df['floats'] = (1.5, 2.5, 3.5, 4.5)
# (2) 自定義 df['names'] = pd.DataFrame(['Yves', 'Felix', 'Francesc'], index=['a', 'b', 'c'])
相似list的添加方式。
df = df.append(pd.DataFrame({'numbers': 100, 'floats': 5.75, 'names': 'Henry'}, index=['z',]))
# 1.1.1 使用vstack增長一行含缺失值的樣本(nan, nan, nan, nan), reshape至關於升維度 nan_tmp = array([nan, nan, nan, nan]).reshape(1,-1) print(nan_tmp)
# 1.1.2 合併兩個array iris.data = vstack((iris.data, array([nan, nan, nan, nan]).reshape(1,-1))) ##########################################################################################
# 1.2.1 使用hstack增長一列表示花的顏色(0-白、1-黃、2-紅),花的顏色是隨機的,意味着顏色並不影響花的分類 random_feature = choice([0, 1, 2], size=iris.data.shape[0]).reshape(-1,1)
# 1.2.2 合併兩個array iris.data = hstack((random_feature, iris.data))
df = pd.read_csv('./asset/lecture_data.txt', sep='\t') # to read an excel file, use read_excel() df.head() # 頭五個數據 df.describe() # 統計量信息
是經過「key" 並不是row number。
定位的套路是:row number --> key --> row context --> target
.index, .loc, liloc
df.index # the index values df.loc[['a', 'd']] # selection of multiple indices df.loc[df.index[1:3]] # 得到索引鍵值,再獲得請求的行。
以上是先得到「某行」;如下是先肯定「某列」,再定位「某行」。
df['No2'].iloc[3]
兩個方案:itertuples()應該比iterrows()快
import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}] df = pd.DataFrame(inp) print(df)
-------------------------------------------------------------------
# 開始遍歷,倆個方式等價
for index, row in df.iterrows(): print(row["c1"], row["c2"]) for row in df.itertuples(index=True, name='Pandas'): print(getattr(row, "c1"), getattr(row, "c2"))
Ref: Iterate over rows in a dataframe in Pandas
這一列的值展現。
In [9]: df.loc[df.index[:]] Out[9]: c1 c2 0 10 100 1 11 110 2 12 120 In [10]: df.loc[df.index[:],'c2'] Out[10]: 0 100 1 110 2 120 Name: c2, dtype: int64
(1) 這一列的 "類型統計"。
# permit_status - Outcome status_cts = df_train.permit_status.value_counts(dropna=False) print(status_cts)
# Complete 358 # Cancelled 31 # In Process 7 # Comments: # - Complete v not (Cancelled or In Process) as binary outcome
(2) 這一列的 "統計量"。
# attendance attendance_null_ct = df_train.attendance.isnull().sum() print(attendance_null_ct) # 3 print(df_train.attendance.describe())
# count 393.000000 # mean 3716.913486 # std 16097.152814 # min 15.000000 # 25% 200.000000 # 50% 640.000000 # 75% 1800.000000 # max 204000.000000
# 列出某一屬性的行 f[df['location'] == 'Vancouver'].head() # 更爲複雜的條件 df[(df['location'] == 'Vancouver') & (df['time'] != 'Q1') & (df['dollars_sold'] > 500)]
Ref: JOIN和UNION區別
Ref: What is the difference between join and merge in Pandas?
這個及其相似SQL JOIN. how = 'inner/left/right/outer'。
df.join(pd.DataFrame([1, 4, 9, 16, 25], index=['a', 'b', 'c', 'd', 'y'], columns=['squares',]), how='inner')
JOIN用於按照ON條件聯接兩個表,主要有四種:
INNER JOIN | 內部聯接兩個表中的記錄,僅當至少有一個同屬於兩表的行符合聯接條件時,內聯接才返回行。我理解的是隻要記錄不符合ON條件,就不會顯示在結果集內。 |
LEFT JOIN / LEFT OUTER JOIN | 外部聯接兩個表中的記錄,幷包含左表中的所有記錄。若是左表的某記錄在右表中沒有匹配記錄,則在相關聯的結果集中右表的全部選擇列表列均爲空值。理解爲即便不符合ON條件,左表中的記錄也所有顯示出來,且結果集中該類記錄的右表字段爲空值。 |
RIGHT JOIN / RIGHT OUTER JOIN | 外部聯接兩個表中的記錄,幷包含右表中的所有記錄。簡單說就是和LEFT JOIN反過來。 |
FULL JOIN / FULL OUTER JOIN | 完整外部聯接返回左表和右表中的全部行。就是LEFT JOIN和RIGHT JOIN和合並,左右兩表的數據都所有顯示。 |
兩張表:msp, party。
內鏈接inner join SELECT msp.name, party.name FROM msp JOIN party ON party=code SELECT msp.name, party.name FROM msp inner JOIN party ON party=code 左鏈接left join SELECT msp.name, party.name FROM msp LEFT JOIN party ON party=code 右鏈接right join SELECT msp.name, party.name FROM msp RIGHT JOIN party ON msp.party=party.code 全鏈接(full join) SELECT msp.name, party.name FROM msp FULL JOIN party ON msp.party=party.code
合併兩表,保留共有列。
pd.concat([df2,df3]) pd.concat([df2,df3]).drop_duplicates()
Ref: Pandas的Apply函數——Pandas中最好用的函數
假如咱們想要獲得表格中的PublishedTime
和ReceivedTime
屬性之間的時間差數據,就能夠使用下面的函數來實現:
import pandas as pd import datetime # 用來計算日期差的包 def dataInterval(data1, data2):
# 以某種格式提取時間信息爲可計算的形式 d1 = datetime.datetime.strptime(data1, '%Y-%m-%d') d2 = datetime.datetime.strptime(data2, '%Y-%m-%d') delta = d1 - d2 return delta.days def getInterval(arrLike): # 用來計算日期間隔天數的調用的函數
PublishedTime = arrLike['PublishedTime'] ReceivedTime = arrLike['ReceivedTime'] days = dataInterval(PublishedTime.strip(), ReceivedTime.strip()) # 注意去掉兩端空白 return days if __name__ == '__main__': fileName = "NS_new.xls"; df = pd.read_excel(fileName) df['TimeInterval'] = df.apply(getInterval , axis = 1)
df.sum()
df.mean()
df.cumsum()
df.describe()
np.sqrt(abs(df))
np.sqrt(abs(df)).sum()
No1 | No2 | No3 | No4 | Quarter | |
---|---|---|---|---|---|
2015-01-31 | -0.173893 | 0.744792 | 0.943524 | 1.423618 | Q1 |
2015-02-28 | -0.388310 | -0.494934 | 0.408451 | -0.291632 | Q1 |
2015-03-31 | 0.675479 | 0.256953 | -0.458723 | 0.858815 | Q1 |
2015-04-30 | -0.046759 | -2.548551 | 0.454668 | -1.011647 | Q2 |
2015-05-31 | -0.938467 | 0.636606 | -0.237240 | 0.854314 | Q2 |
2015-06-30 | 0.134884 | -0.650734 | 0.213996 | -1.969656 | Q2 |
2015-07-31 | 1.046851 | -0.016665 | -0.488270 | 1.377827 | Q3 |
2015-08-31 | 0.482625 | 0.176105 | -0.681728 | -1.057683 | Q3 |
2015-09-30 | -1.675402 | 0.364292 | 0.897240 | -0.629711 | Q3 |
三行數據爲一組,而後按組統計。
groups = df.groupby('Quarter') groups.mean()
group by 兩個columns後的狀況。
df['Odd_Even'] = ['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd'] groups = df.groupby(['Quarter', 'Odd_Even']) groups.mean()
另外一個例子:
type_cat_cts = ( df_train .groupby([df_train.permit_type, df_train.event_category.isnull()]) .size()) print(type_cat_cts)
# permit_type event_category # Charter Vessel True 10 # Special Event False 325 # Valet Parking True 61 # Comments: # - present iff Special Event
We use agg()
to apply multiple functions at once, and pass a list of columns to groupby()
to grouping multiple columns
df.groupby(['location','item']).agg({'dollars_sold': [np.mean,np.sum]})
更多內容,參考:https://github.com/DBWangGroupUNSW/COMP9318/blob/master/L1%20-%20Pandas-2.ipynb
若是想作二次更爲細分的統計,能夠藉助pivot_table。
table = pd.pivot_table(df, index = 'location', columns = 'time', aggfunc=np.sum)
pd.pivot_table(df, index = ['location', 'item'], columns = 'time', aggfunc=np.sum, margins=True)
%matplotlib inline df.cumsum().plot(lw=2.0, grid=True) # 線的粗度是2.0
# tag: dataframe_plot # title: Line plot of a DataFrame object
Ref: https://github.com/yhilpisch/py4fi/blob/master/jupyter36/source/tr_eikon_eod_data.csv
# data from Thomson Reuters Eikon API raw = pd.read_csv('source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True) raw.info()
# (1) 想看其中的哪一列
data = pd.DataFrame(raw['.SPX']) data.columns = ['Close']
# (2) 看個別數據,再看整體數據(數據多隻能經過figure看) data.tail() data['Close'].plot(figsize=(8, 5), grid=True); # tag: dax # title: Historical DAX index levels
shift 爲1,默認下拉表格位置,也就是表示「上一個值」。
%time data['Return'] = np.log(data['Close'] / data['Close'].shift(1)) data['Return'].plot(figsize=(8, 5), grid=True);
窗口期數據的「統計量」計算。
data['42d'] = data['Close'].rolling(window=42).mean() data['252d'] = data['Close'].rolling(window=252).mean() data[['Close', '42d', '252d']].plot(figsize=(8, 5), grid=True) # tag: dax_trends # title: The S&P index and moving averages
兩列數據上下顯示出來。
data[['Close', 'Return']].plot(subplots=True, style='b', figsize=(8, 5), grid=True); # tag: dax_returns # title: The S&P 500 index and daily log returns
/* 略 */
通常的思路,可能選擇:scikit learning;固然,NumPy也提供了一些基本的功能。
xdat = rets['.SPX'].values ydat = rets['.VIX'].values
reg = np.polyfit(x=xdat, y=ydat, deg=1)
plt.plot(xdat, ydat, 'r.') ax = plt.axis() # grab axis values x = np.linspace(ax[0], ax[1] + 0.01)
------------------------------------------
# 畫出預測趨勢 plt.plot(x, np.polyval(reg, x), 'b', lw=2)
plt.grid(True) plt.axis('tight') plt.xlabel('S&P 500 returns') plt.ylabel('VIX returns') # tag: scatter_rets # title: Scatter plot of log returns and regression line
思考:High Frequency首先會帶來怎麼樣的問題?
index時間,在秒級如下仍然有不少的數據,但對目前的分析而言其實意義不是很大。
Bid Ask Mid
2017-11-10 13:59:59.716 1.16481 1.16481 1.164810
2017-11-10 13:59:59.757 1.16481 1.16482 1.164815
2017-11-10 14:00:00.005 1.16482 1.16482 1.164820
2017-11-10 14:00:00.032 1.16482 1.16483 1.164825
2017-11-10 14:00:00.131 1.16483 1.16483 1.164830
這個相似:先分段,再用「統計量」替換「原來的密集的數據」;與窗口策略小有不一樣。
eur_usd_resam = eur_usd.resample(rule='1min', label='last').last() eur_usd_resam.head()
End.