在本篇文章,我將演示如何使用tidyr
包來作數據處理。tidyr
包的做者是Hadley Wickham。這個包常跟dplyr
結合使用。函數
本文將演示tidyr
包中下述四個函數的用法:spa
gather
—寬數據轉爲長數據。相似於reshape2
包中的melt
函數翻譯
spread
—長數據轉爲寬數據。相似於reshape2
包中的cast
函數code
unit
—多列合併爲一列ip
separate
—將一列分離爲多列ci
下面使用datasets
包中的mtcars
數據集作演示。rem
library(tidyr) library(dplyr) head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
爲方便處理,在數據集中增長一列car
get
mtcars$car <- rownames(mtcars) mtcars <- mtcars[, c(12, 1:11)]
gather
的調用格式爲:it
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
這裏,...
表示須要聚合的指定列。io
與reshape2
包中的melt
函數同樣,獲得以下結果:
mtcarsNew <- mtcars %>% gather(attribute, value, -car) head(mtcarsNew) car attribute value 1 Mazda RX4 mpg 21.0 2 Mazda RX4 Wag mpg 21.0 3 Datsun 710 mpg 22.8 4 Hornet 4 Drive mpg 21.4 5 Hornet Sportabout mpg 18.7 6 Valiant mpg 18.1 tail(mtcarsNew) car attribute value 347 Porsche 914-2 carb 2 348 Lotus Europa carb 2 349 Ford Pantera L carb 4 350 Ferrari Dino carb 6 351 Maserati Bora carb 8 352 Volvo 142E carb 2
如你所見,除了car
列外,其他列聚合成兩列,分別命名爲attribute
和value
。
tidyr
很好的一點是能夠只gather
若干列而其餘列保持不變。若是你想gather
在map
和gear
之間的全部列而保持carb
和car
列不變,能夠像下面這樣作:
mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear) head(mtcarsNew) car carb attribute value 1 Mazda RX4 4 mpg 21.0 2 Mazda RX4 Wag 4 mpg 21.0 3 Datsun 710 1 mpg 22.8 4 Hornet 4 Drive 1 mpg 21.4 5 Hornet Sportabout 2 mpg 18.7 6 Valiant 1 mpg 18.1
spread
的調用格式爲:
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
與reshape2
包中的cast
函數同樣,獲得以下結果:
mtcarsSpread <- mtcarsNew %>% spread(attribute, value) head(mtcarsSpread) car carb mpg cyl disp hp drat wt qsec vs am gear 1 AMC Javelin 2 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2 Cadillac Fleetwood 4 10.4 8 472 205 2.93 5.250 17.98 0 0 3 3 Camaro Z28 4 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 Chrysler Imperial 4 14.7 8 440 230 3.23 5.345 17.42 0 0 3 5 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 6 Dodge Challenger 2 15.5 8 318 150 2.76 3.520 16.87 0 0 3
unite
的調用格式以下:
unite(data, col, ..., sep = "_", remove = TRUE) where ... represents the columns to unite and col represents the c
這裏,...
表示須要合併的列,col
表示合併後的列。
咱們先虛構一些數據:
set.seed(1) date <- as.Date('2016-01-01') + 0:14 hour <- sample(1:24, 15) min <- sample(1:60, 15) second <- sample(1:60, 15) event <- sample(letters, 15) data <- data.frame(date, hour, min, second, event) data date hour min second event 1 2016-01-01 7 30 29 u 2 2016-01-02 9 43 36 a 3 2016-01-03 13 58 60 l 4 2016-01-04 20 22 11 q 5 2016-01-05 5 44 47 p 6 2016-01-06 18 52 37 k 7 2016-01-07 19 12 43 r 8 2016-01-08 12 35 6 i 9 2016-01-09 11 7 38 e 10 2016-01-10 1 14 21 b 11 2016-01-11 3 20 42 w 12 2016-01-12 14 1 32 t 13 2016-01-13 23 19 52 h 14 2016-01-14 21 41 26 s 15 2016-01-15 8 16 25 o
如今,咱們須要把date
,hour
,min
和second
列合併爲新列datetime
。一般,R中的日期時間格式爲"Year-Month-Day-Hour:Min:Second"。
dataNew <- data %>% unite(datehour, date, hour, sep = ' ') %>% unite(datetime, datehour, min, second, sep = ':') dataNew datetime event 1 2016-01-01 7:30:29 u 2 2016-01-02 9:43:36 a 3 2016-01-03 13:58:60 l 4 2016-01-04 20:22:11 q 5 2016-01-05 5:44:47 p 6 2016-01-06 18:52:37 k 7 2016-01-07 19:12:43 r 8 2016-01-08 12:35:6 i 9 2016-01-09 11:7:38 e 10 2016-01-10 1:14:21 b 11 2016-01-11 3:20:42 w 12 2016-01-12 14:1:32 t 13 2016-01-13 23:19:52 h 14 2016-01-14 21:41:26 s 15 2016-01-15 8:16:25 o
separate
的調用格式爲:
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
咱們能夠用separate
函數將數據恢復到剛建立的時候,以下所示:
data1 <- dataNew %>% separate(datetime, c('date', 'time'), sep = ' ') %>% separate(time, c('hour', 'min', 'second'), sep = ':') data1 date hour min second event 1 2016-01-01 07 30 29 u 2 2016-01-02 09 43 36 a 3 2016-01-03 13 59 00 l 4 2016-01-04 20 22 11 q 5 2016-01-05 05 44 47 p 6 2016-01-06 18 52 37 k 7 2016-01-07 19 12 43 r 8 2016-01-08 12 35 06 i 9 2016-01-09 11 07 38 e 10 2016-01-10 01 14 21 b 11 2016-01-11 03 20 42 w 12 2016-01-12 14 01 32 t 13 2016-01-13 23 19 52 h 14 2016-01-14 21 41 26 s 15 2016-01-15 08 16 25 o
首先,將datetime
分爲date
列和time
列。而後,將time
列分爲hour
,min
,second
列。
本文由雪晴數據網負責翻譯整理,原文請參考Data manipulation with tidyr做者Teja Kodali。轉載請註明原文連接http://www.xueqing.cc/cms/article/105