本文的內容主要來源於A Beginner’s Guide to Optimizing Pandas Code for Speed這篇文章,入門級的講了怎麼優化Pandas DataFrame
的處理速度。app
一個50000行的DataFrame
,其head
以下:ide
d m1 m2 0 GbGXR/7198718882 66 0.670074 1 ylaMAz/121108977765122 74 0.497126 2 TmMGuz/841097771117122 39 0.360868 3 RkzCzz/8210712267122122 76 0.293050 4 sWxCNIji/11587120677873106105 14 0.893429
一個函數,接收一個參數:函數
# 該函數必須能夠接收pd.Series或np.Array做爲參數,所以函數裏只有一些常規的運算操做 def simple_function(v): return (v**2 - v) // 2 + (v**0.5) // 2
目的是將DataFrame
的m1
列中的值使用simple_function
進行處理,生成一個新的m3
列。oop
7.23s
。%%timeit m3 = [] df = origin_df.copy(deep=True) for i in range(0, len(df)): m3.append(simple_function(df.iloc[i]['m1'])) df['m3'] = m3 1 loop, best of 3: 7.23 s per loop
iterrows
,平均3.27s
%%timeit m3 = [] df = origin_df.copy(deep=True) for _, row in df.iterrows(): m3.append(simple_function(row['m1'])) df['m3'] = m3 1 loop, best of 3: 3.27 s per loop
apply
,平均29.5ms
%%timeit df = origin_df.copy(deep=True) df['m3'] = df['m1'].apply(simple_function) 10 loops, best of 3: 29.5 ms per loop
Pandas series
,平均7.88ms
%%timeit df = origin_df.copy(deep=True) df['m3'] = simple_function(df['m1']) 100 loops, best of 3: 7.88 ms per loop
NumPy arrays
,平均5.31ms
%%timeit df = origin_df.copy(deep=True) df['m3'] = simple_function(df['m1'].values) 100 loops, best of 3: 5.31 ms per loop