本文示例數據下載,密碼:vwy3html
import pandas as pd # 數據是以前在cnblog上抓取的部分文章信息 df = pd.read_csv('./data/SQL測試用數據_20200325.csv',encoding='utf-8') # 爲了後續演示,抽樣生成兩個數據集 df1 = df.sample(n=500,random_state=123) df2 = df.sample(n=600,random_state=234) # 保證有較多的交集 # 比例抽樣是有順序的,不加random_state,那麼兩個數據集是同樣的
pandas 官方教程python
pd.concat主要參數說明:api
[]
進行包裹,e.g. [df1,df2,df3]
;若是要達到union all
的效果,那麼要拼接的多個dataframe,必須:app
若是列名不一致就會產生新的列dom
若是數據類型不一致,不必定報錯,要看具體的兼容場景ide
df2.columns
輸出:
Index(['href', 'title', 'create_time', 'read_cnt', 'blog_name', 'date', 'weekday', 'hour'], dtype='object')
測試
# 這裏故意修改下第2列的名稱 df2.columns = ['href', 'title_2', 'create_time', 'read_cnt', 'blog_name', 'date','weekday', 'hour'] print(df1.shape,df2.shape) # inner方法將沒法配對的列刪除 # 拼接的方向,默認是就行(axis=0) df_m = pd.concat([df1,df2],axis=0,join='inner') print(df_m.shape)
輸出:
(500, 8) (600, 8)
(1100, 7)ui
# 查看去重後的數據集大小 df_m.drop_duplicates(subset='href').shape
輸出:
(849, 7)code
和pd.concat方法的區別:htm
相同點:
pd.concat([df1,df2],axis=0,join='outer')
df1.append(df2).shape
輸出:
(1100, 9)
df1.append([df2,df2]).shape
輸出:
(1700, 9)
pd.concat也能夠作join,不過關聯的字段不是列的值,而是index
也由於是基於index的關聯,因此pd.concat能夠對超過2個以上的dataframe作join操做
# 按列拼接,設置axis=1 # inner join print(df1.shape,df2.shape) df_m_c = pd.concat([df1,df2], axis=1, join='inner') print(df_m_c.shape)
輸出:
(500, 8) (600, 8)
(251, 16)
這裏是251行,能夠取兩個dataframe的index而後求交集看下
set1 = set(df1.index) set2 = set(df2.index) set_join = set1.intersection(set2) print(len(set1), len(set2), len(set_join))
輸出:
500 600 251
pd.merge主要參數說明:
print(df1.shape,df2.shape) df_m = pd.merge(left=df1, right=df2\ ,how='inner'\ ,on=['href','blog_name'] ) print(df_m.shape)
輸出:
(500, 8) (600, 8)
(251, 14)
print(df1.shape,df2.shape) df_m = pd.merge(left=df1, right=df2\ ,how='inner'\ ,left_on = 'href',right_on='href' ) print(df_m.shape)
輸出:
(500, 8) (600, 8)
(251, 15)
# 對比下不一樣join模式的區別 print(df1.shape,df2.shape) # inner join df_inner = pd.merge(left=df1, right=df2\ ,how='inner'\ ,on=['href','blog_name'] ) # full outer join df_full_outer = pd.merge(left=df1, right=df2\ ,how='outer'\ ,on=['href','blog_name'] ) # left outer join df_left_outer = pd.merge(left=df1, right=df2\ ,how='left'\ ,on=['href','blog_name'] ) # right outer join df_right_outer = pd.merge(left=df1, right=df2\ ,how='right'\ ,on=['href','blog_name'] ) print('inner join 左表∩右表:' + str(df_inner.shape)) print('full outer join 左表∪右表:' + str(df_full_outer.shape)) print('left outer join 左表包含右表:' + str(df_left_outer.shape)) print('right outer join 右表包含左表:' + str(df_right_outer.shape))
輸出:
(500, 8) (600, 8)
inner join 左表∩右表:(251, 14)
full outer join 左表∪右表:(849, 14)
left outer join 左表包含右表:(500, 14)
right outer join 右表包含左表:(600, 14)
df.join主要參數說明:
print(df1.shape,df2.shape) df_m = df1.join(df2, how='inner',lsuffix='1',rsuffix='2') df_m.shape
輸出:
(500, 8) (600, 8)
(251, 16)
# 數據準備 import math df['time_mark'] = df['hour'].apply(lambda x:math.ceil(int(x)/8)) df_stat_raw = df.pivot_table(values= ['read_cnt','href']\ ,index=['weekday','time_mark']\ ,aggfunc={'read_cnt':'sum','href':'count'}) df_stat = df_stat_raw.reset_index()
df_stat.head(3)
如上所示,df_stat是兩個維度weekday,time_mark
以及兩個計量指標 href, read_cnt
# pivot操做中,index和columns都是維度 res = df_stat.pivot(index='weekday',columns='time_mark',values='href').reset_index(drop=True) res
# pandas.pivot_table生成的結果以下 df_stat_raw
# unstack默認是將排位最靠後的index轉成column(column放到下面) df_stat_raw.unstack() # unstack也能夠指定index,而後轉成最底層的column df_stat_raw.unstack('weekday') # 這個語句的效果是同樣的,能夠指定`index`的位置 # stat_raw.unstack(0)
# stack則是將層級醉倒的column轉化爲index df_stat_raw.unstack().stack().head(5)
# 通過兩次stack後就成爲多維表了 # 每次stack都會像洋蔥同樣將column放到左側的index來(放到index序列最後) df_stat_raw.unstack().stack().stack().head(5)
輸出:
weekday time_mark 1 0 href 4 read_cnt 2386 1 href 32 read_cnt 31888 2 href 94 dtype: int64
pd.DataFrame(df_stat_raw.unstack().stack().stack()).reset_index().head(5)
melt方法中id_vals
是指保留哪些做爲維度(index),剩下的都看作是數值(value)
除此以外,會另外生成一個維度叫variable,列轉行後記錄被轉的的變量名稱
print(df_stat.head(5)) df_stat.melt(id_vars=['weekday']).head(5)
df_stat.melt(id_vars=['weekday','time_mark']).head(5)