dplyr包是Hadley Wickham的新做,主要用於數據清洗和整理,該包專一dataframe數據格式,從而大幅提升了數據處理速度,而且提供了與其它數據庫的接口,本節學習dplyr包函數基本用法。dplyr()可以使用%>%(鏈式操做),其功能是用於實現將一個函數的輸出傳遞給下一個函數的第一個參數。注意,傳遞給下一個函數的第一個參數,那麼下一個函數的第一個參數就不用寫。正則表達式
目錄:數據庫
篩選: filter()函數
install.packages("dplyr") library(dplyr) mtcars_df = tbl_df(mtcars)
> filter(mtcars_df, hp<110 & vs == 1) # A tibble: 10 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
排列: arrange()學習
> a <- head(mtcars_df,2) > a # A tibble: 2 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 > arrange(a,desc(wt,qsec)) # A tibble: 2 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.875 17.02 0 1 4 4 2 21 6 160 110 3.9 2.620 16.46 0 1 4 4 > arrange(a,wt,qsec) # A tibble: 2 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
選擇: select()spa
> mtcars_df %>% select(mpg,wt,qsec) # A tibble: 32 × 3 mpg wt qsec * <dbl> <dbl> <dbl> 1 21.0 2.620 16.46 2 21.0 2.875 17.02
變形: mutate()code
> mutate(mtcars_df, NO = 1:dim(mtcars_df)[1]) # A tibble: 32 × 12 mpg cyl disp hp drat wt qsec vs am gear carb NO <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
彙總: summarise()對象
> summarise(mtcars, mean(disp)) mean(disp) 1 230.7219 > summarise(group_by(mtcars, cyl), mean(disp)) # A tibble: 3 × 2 cyl `mean(disp)` <dbl> <dbl> 1 4 105.1364 2 6 183.3143 3 8 353.1000
分組: group_by()blog
> cars <- group_by(mtcars_df, cyl) > summarise(cars, count = n()) # count = n()用來計算次數 # A tibble: 3 × 2 cyl count <dbl> <int> 1 4 11 2 6 7 3 8 14
數據關連排序
bind接口
mydf1 <- data.frame(x = c(1,2,3,4), y = c(10,20,30,40)) mydf2 <- data.frame(x = c(5,6), y = c(50,60)) mydf3 <- data.frame(z = c(100,200,300,400)) bind_rows(mydf1, mydf2) bind_cols(mydf1, mydf3)