pandas

時間 2019-11-12

標籤 pandas 简体版

原文原文鏈接

Pandas提供了 DataFrame.describe 方法查看數據摘要測試

res.describe()
[10 rows x 24 columns]
       askPrice1  askPrice2  ...    bidVolume4     bidVolume5
count       10.0       10.0  ...     10.000000      10.000000
mean     10425.0    10425.5  ...  51242.200000  268655.800000
std          0.0        0.0  ...   2552.554224    2552.554224
min      10425.0    10425.5  ...  49265.000000  265690.000000
25%      10425.0    10425.5  ...  49265.000000  265690.000000
50%      10425.0    10425.5  ...  49265.000000  270633.000000
75%      10425.0    10425.5  ...  54208.000000  270633.000000
max      10425.0    10425.5  ...  54208.000000  270633.000000

DataFrame.isnull() 方法查看數據表中哪些爲空值，與它相反的方法是 DataFrame.notnull()code

df.dropna(axis=1, how='all')字符串

通過測試，在 DataFrame.replace() 中使用空字符串，要比默認的空值NaN節省一些空間；但對整個CSV文件來講，空列只是多存了一個「,」，因此移除的9800萬 x 6列也只省下了200M的空間。進一步的數據清洗仍是在移除無用數據和合並上。pandas

使用 DataFrame.dtypes 能夠查看每列的數據類型，Pandas默承認以讀出int和float64，其它的都處理爲object，須要轉換格式的通常爲日期時間。DataFrame.astype() 方法可對整個DataFrame或某一列進行數據格式轉換，支持Python和NumPy的數據類型。io

df['Name'] = df['Name'].astype(np.datetime64)table

對數據聚合，我測試了 DataFrame.groupby 和 DataFrame.pivot_table 以及 pandas.merge ，groupby 9800萬行 x 3列的時間爲99秒，鏈接表爲26秒，生成透視表的速度更快，僅需5秒。ast

df.groupby(['NO','TIME','SVID']).count() # 分組
fullData = pd.merge(df, trancodeData)[['NO','SVID','TIME','CLASS','TYPE']] # 鏈接
actions = fullData.pivot_table('SVID', columns='TYPE', aggfunc='count') # 透視表object

相關標籤/搜索

python+pandas+statsmodels

pyautogui+pil+pandas

pandas+mysql+excel

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。