相比dplyr包,data.table包可以更大程度地提升數據的處理速度,這裏就簡單介紹一下data.tale包的使用方法。php
data.table:用於快速處理大數據集的哦css
data.table包中數據讀取的函數:fread()git
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) DT # x y v # 1: a 1 1 # 2: a 3 2 # 3: a 6 3 # 4: b 1 4 # 5: b 3 5 # 6: b 6 6 # 7: c 1 7 # 8: c 3 8 # 9: c 6 9
行提取分爲單行提取和多行提取。github
DT[2] # 2nd row # x y v # 1: a 3 2 DT[2,] # same # x y v # 1: a 3 2
這裏DT [2]和DT [2]是徹底相同的,這裏的「,」只是說明還有其餘參數可設置,而其餘參數按默認值進行計算。下文全部這樣的最後一個「,」都再也不寫出來。ruby
DT[1:2] # x y v # 1: a 1 1 # 2: a 3 2 DT[c(2,5)] # x y v #1: a 3 2 #2: b 3 5
DT[c(FALSE,TRUE)] # even rows (usual recycling) # x y v # 1: a 3 2 # 2: b 1 4 # 3: b 6 6 # 4: c 3 8
此時,C(FALSE,TRUE)會本身重複匹配成與DT的行數相同的向量bash
與行提取相同,列的提取也包含單列提取和多列提取。函數
數字提取時,必定要把心安理得參數設置爲FALSE。大數據
DT[,2,with=FALSE] # 2nd column # y # 1: 1 # 2: 3 # 3: 6 # 4: 1 # 5: 3 # 6: 6 # 7: 1 # 8: 3 # 9: 6
DT[,list(v)] # v column (as data.table # v # 1: 1 # 2: 2 # 3: 3 # 4: 4 # 5: 5 # 6: 6 # 7: 7 # 8: 8 # 9: 9
列名的修改能夠使用setnames()函數,這個函數好像比對data.frame類型數據名更改的名稱()和colnames()函數也要快一些。spa
dt = data.table(a=1:2,b=3:4,c=5:6) # compare to data.table try(tracemem(dt)) # by reference, no deep or shallow copies setnames(dt,"b","B") # by name, no match() needed (warning if "b" is missing) setnames(dt,3,"C") # by position with warning if 3 > ncol(dt) setnames(dt,2:3,c("D","E")) # multiple setnames(dt,c("a","E"),c("A","F")) # multiple by name (warning if either "a" or "E" is missing) setnames(dt,c("X","Y","Z")) # replace all (length of names must be == ncol(DT))
如同上面對按數字對單列的提取,對多列提取也要設置與參數爲FALSE。code
DT[,2:3,with=FALSE] # y v # 1: 1 1 # 2: 3 2 # 3: 6 3 # 4: 1 4 # 5: 3 5 # 6: 6 6 # 7: 1 7 # 8: 3 8 # 9: 6 9 DT[,c(1,3),with=FALSE] # x v # 1: a 1 # 2: a 2 # 3: a 3 # 4: b 4 # 5: b 5 # 6: b 6 # 7: c 7 # 8: c 8 # 9: c 9
DT[,list(y, v)] # y v # 1: 1 1 # 2: 3 2 # 3: 6 3 # 4: 1 4 # 5: 3 5 # 6: 6 6 # 7: 1 7 # 8: 3 8 # 9: 6 9
若是按列名提取時,不使用列表,仍然能對列進行提取,只是結果以向量的形式輸出。
DT[,v] # v column (as vector) # [1] 1 2 3 4 5 6 7 8 9 DT[,c(v)] # same # [1] 1 2 3 4 5 6 7 8 9 DT[, c(y, v)] # [1] 1 3 6 1 3 6 1 3 6 1 2 3 4 5 6 7 8 9
DT
# x y v # 1: a 1 1 # 2: a 3 2 # 3: a 6 3 # 4: b 1 4 # 5: b 3 5 # 6: b 6 6 # 7: c 1 7 # 8: c 3 8 # 9: c 6 9 DT[, a := 'k'] DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k DT[,c:=8] # add a numeric column, 8 for all rows DT # x y v a c # 1: a 1 1 k 8 # 2: a 3 2 k 8 # 3: a 6 3 k 8 # 4: b 1 4 k 8 # 5: b 3 5 k 8 # 6: b 6 6 k 8 # 7: c 1 7 k 8 # 8: c 3 8 k 8 # 9: c 6 9 k 8 DT[,d:=9L] # add an integer column, 9L for all rows DT[2,d:=10L] # subassign by reference to column d DT # x y v a c d # 1: a 1 1 k 8 9 # 2: a 3 2 k 8 10 # 3: a 6 3 k 8 9 # 4: b 1 4 k 8 9 # 5: b 3 5 k 8 9 # 6: b 6 6 k 8 9 # 7: c 1 7 k 8 9 # 8: c 3 8 k 8 9 # 9: c 6 9 k 8 9 DT[, e := d + 2] DT # x y v a c d e # 1: a 1 1 k 8 9 11 # 2: a 3 2 k 8 10 12 # 3: a 6 3 k 8 9 11 # 4: b 1 4 k 8 9 11 # 5: b 3 5 k 8 9 11 # 6: b 6 6 k 8 9 11 # 7: c 1 7 k 8 9 11 # 8: c 3 8 k 8 9 11 # 9: c 6 9 k 8 9 11
若是添加的列名,數據中已經包含則是對這一列數據的修改。
DT[, c('f', 'g') := list( d + 1, c)] DT[, ':='( f = d + 1, g = c)] # same DT # x y v a c d e f g # 1: a 1 1 k 8 9 11 10 8 # 2: a 3 2 k 8 10 12 11 8 # 3: a 6 3 k 8 9 11 10 8 # 4: b 1 4 k 8 9 11 10 8 # 5: b 3 5 k 8 9 11 10 8 # 6: b 6 6 k 8 9 11 10 8 # 7: c 1 7 k 8 9 11 10 8 # 8: c 3 8 k 8 9 11 10 8 # 9: c 6 9 k 8 9 11 10 8
此處,須要注意的是新建立的列只能依照原有數據列,而不能依照新建立的列。例如這個例子中,G = C是能夠運行,而摹= F則會提示錯誤。
DT[,c:=NULL] # remove column c DT # x y v a d e f g # 1: a 1 1 k 9 11 10 8 # 2: a 3 2 k 10 12 11 8 # 3: a 6 3 k 9 11 10 8 # 4: b 1 4 k 9 11 10 8 # 5: b 3 5 k 9 11 10 8 # 6: b 6 6 k 9 11 10 8 # 7: c 1 7 k 9 11 10 8 # 8: c 3 8 k 9 11 10 8 # 9: c 6 9 k 9 11 10 8 DT[, c('d', 'e', 'f', 'g'):=NULL] DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k
簡單操做主要包括求和,平均值,方差和標準差等。
DT[2:3,sum(v)] # sum(v) over rows 2 and 3 # [1] 5 DT[2:3,mean(v)] # sum(v) over rows 2 and 3 # [1] 2.5
索引是對列而言的,索引建立後,數據將自動按索引值進行從新排序,因此每一個數據最多隻能有一個索引,可是索引能夠由多列組成,這些列能夠是數字,因子,字符串或其餘格式。
## methdod first key(DT) # key # NULL setkey(DT,x) # set a 1-column key. No quotes, for convenience. key(DT) [1] "x" DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k ## method second setkeyv(DT,"y") # same (v in setkeyv stands for vector) key(DT) # [1] "y"
一旦對數據進行新的索引,原有的索引將消失。
## methdod first # key setkey(DT,x,v) # set a 1-column key. No quotes, for convenience. key(DT) # [1] "x" "v" DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k ## method second setkeyv(DT,c("x", "y")) # same (v in setkeyv stands for vector) key(DT) # [1] "x" "v" DT # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 3 5 k # 6: b 6 6 k # 7: c 1 7 k # 8: c 3 8 k # 9: c 6 9 k
按照索引對數據提取,能夠加快提取數據的速度。
正向提取
setkey(DT, x)
DT["a"] # binary search (fast) # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[.(x=="a")] # same; i.e. binary search (fast) # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[x=="a"] # same; i.e. binary search (fast) # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k
DT[!.("a")] # not join # x y v a # 1: b 1 4 k # 2: b 3 5 k # 3: b 6 6 k # 4: c 1 7 k # 5: c 3 8 k # 6: c 6 9 k DT[!"a"] # same # x y v a # 1: b 1 4 k # 2: b 3 5 k # 3: b 6 6 k # 4: c 1 7 k # 5: c 3 8 k # 6: c 6 9 k DT[!2:4] # all rows other than 2:4 # x y v a # 1: a 1 1 k # 2: b 3 5 k # 3: b 6 6 k # 4: c 1 7 k # 5: c 3 8 k # 6: c 6 9 k
setkey(DT, x, y)
# Mehtod First DT["a"] # join to 1st column of key # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[.("a")] # same, .() is an alias for list() # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[.("a",3)] # join to 2 columns # x y v a # 1: a 3 2 k DT[.("a",3:6)] # join 4 rows (2 missing) # x y v a # 1: a 3 2 k # 2: a 4 NA NA # 3: a 5 NA NA # 4: a 6 3 k DT[.("a",3:6),nomatch=0] # remove missing # x y v a # 1: a 3 2 k # 2: a 6 3 k DT[.("a",3:6),roll=TRUE] # rolling join (locf) # x y v a # 1: a 3 2 k # 2: a 4 2 k # 3: a 5 2 k # 4: a 6 3 k ## Method Second DT[J('a')] # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[J("a",3)] # binary search (fast) # x y v a # 1: a 3 2 k DT[J("a",3:6)] # same; i.e. binary search (fast) # x y v a # 1: a 3 2 k # 2: a 4 NA NA # 3: a 5 NA NA # 4: a 6 3 k DT[J("a",3:6), nomatch = 0] # x y v a # 1: a 3 2 k # 2: a 6 3 k DT[J("a",3:6), roll = T] # x y v a # 1: a 3 2 k # 2: a 4 2 k # 3: a 5 2 k # 4: a 6 3 k ## Method Third DT[list("a")] # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k DT[list("a",3)] # x y v a # 1: a 3 2 k DT[list("a", 3:6)] # x y v a # 1: a 3 2 k # 2: a 4 NA NA # 3: a 5 NA NA # 4: a 6 3 k DT[list("a", 3:6), nomatch = 0] # x y v a # 1: a 3 2 k # 2: a 6 3 k DT[list("a", 3:6), roll = T] # x y v a # 1: a 3 2 k # 2: a 4 2 k # 3: a 5 2 k # 4: a 6 3 k
DT[x!="b" | y!=3] # not yet optimized, currently vector scans # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 6 6 k # 6: c 1 7 k # 7: c 3 8 k # 8: c 6 9 k DT[!.("b",3)] # same result but much faster # x y v a # 1: a 1 1 k # 2: a 3 2 k # 3: a 6 3 k # 4: b 1 4 k # 5: b 6 6 k # 6: c 1 7 k # 7: c 3 8 k # 8: c 6 9 k
分類匯老是指按某列的分類指標進行簡單操做,藉助由參數實現。此外,經過參數與索引相互沒有影響這裏。
DT[,sum(v),by=x] # x V1 # 1: a 6 # 2: b 15 # 3: c 24 DT[,sum(v),by=y] # y V1 # 1: 1 12 # 2: 3 15 # 3: 6 18
DT[,list(sum.v.x = sum(v)),by=x] # x sum.v.x # 1: a 6 # 2: b 15 # 3: c 24 DT[,list(sum.v.y = sum(v)),by=y] # y sum.v.y # 1: 1 12 # 2: 3 15 # 3: 6 18 DT[,sum.v.y := sum(v) ,by=y] # x y v a sum.v.y # 1: a 1 1 k 12 # 2: a 3 2 k 15 # 3: a 6 3 k 18 # 4: b 1 4 k 12 # 5: b 3 5 k 15 # 6: b 6 6 k 18 # 7: c 1 7 k 12 # 8: c 3 8 k 15 # 9: c 6 9 k 18
DT[,sum.v.y := sum(v) ,by=y] # x y v a sum.v.y # 1: a 1 1 k 12 # 2: a 3 2 k 15 # 3: a 6 3 k 18 # 4: b 1 4 k 12 # 5: b 3 5 k 15 # 6: b 6 6 k 18 # 7: c 1 7 k 12 # 8: c 3 8 k 15 # 9: c 6 9 k 18
DT[,list(mean(v),sum(v)),by=list(x,y)] # keyed by # x y V1 V2 # 1: a 1 1 1 # 2: a 3 2 2 # 3: a 6 3 3 # 4: b 1 4 4 # 5: b 3 5 5 # 6: b 6 6 6 # 7: c 1 7 7 # 8: c 3 8 8 # 9: c 6 9 9
DT[,list(mean.v = mean(v),sum.v = sum(v)),by=list(x,y)] # keyed by # x y mean.v sum.v #1: a 1 1 1 #2: a 3 2 2 #3: a 6 3 3 #4: b 1 4 4 #5: b 3 5 5 #6: b 6 6 6 #7: c 1 7 7 #8: c 3 8 8 #9: c 6 9 9
DT[,c("mean.v", "sum.v.y") := list(mean(v),sum(v)) ,by=list(x,y)] # x y v a sum.v.y mean.v # 1: a 1 1 k 1 1 # 2: a 3 2 k 2 2 # 3: a 6 3 k 3 3 # 4: b 1 4 k 4 4 # 5: b 3 5 k 5 5 # 6: b 6 6 k 6 6 # 7: c 1 7 k 7 7 # 8: c 3 8 k 8 8 # 9: c 6 9 k 9 9
data.table格式加快了處理速度,而data.frame則更爲基礎。二者的轉化能夠經過data.table(),setDT()和setDT()來實現,其中data.table()和setDT()函數能夠將數據從data.frame轉化爲data.table,setDF()函數能夠將數據從data.table轉化爲data.frame。注意使用data.table(),setDT()和setDT()時,參數自己的數據類型也會發生變化。
class(DT) # [1] "data.table" "data.frame" class(setDF(DT)) # [1] "data.frame" class(DT) # [1] "data.frame"
此外,data.table包還能夠與基礎包中的重複的(),惟一的(),子()函數結合使用。不只如此,data.table包還有一些基礎包的替代函數.rbind()升級版的rbindlist(),能夠合併列數不一樣和列位置不一樣的數據。比dplyr包中安排()函數更快的setorder()排序函數。
來源於:http://xukuang.github.io/blog/2016/04/data-table-in-R/