> vars <- c("mpg", "hp", "wt") > head(mtcars[vars]) mpg hp wt Mazda RX4 21.0 110 2.620 Mazda RX4 Wag 21.0 110 2.875 Datsun 710 22.8 93 2.320 Hornet 4 Drive 21.4 110 3.215 Hornet Sportabout 18.7 175 3.440 Valiant 18.1 105 3.460
以上述數據集爲例,對於基礎安裝R的用戶,可使用summary()函數來獲取描述性統計量。這個函數提供了最小值、最大值、四分位數和數值型變量的均值,以及因子向量和邏輯型向量的頻數統計。ide
> summary(mtcars[vars]) mpg hp wt Min. :10.40 Min. : 52.0 Min. :1.513 1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581 Median :19.20 Median :123.0 Median :3.325 Mean :20.09 Mean :146.7 Mean :3.217 3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610 Max. :33.90 Max. :335.0 Max. :5.424
分組計算描述性統計量函數
你可使用aggregate()函數來分組獲取描述性統計量。ui
> aggregate(mtcars[vars],by=list(am=mtcars$am),mean) am mpg hp wt 1 0 17.14737 160.2632 3.768895 2 1 24.39231 126.8462 2.411000 > aggregate(mtcars[vars],by=list(am=mtcars$am),sd) am mpg hp wt 1 0 3.833966 53.90820 0.7774001 2 1 6.166504 84.06232 0.6169816
注意list的使用。若是使用的是list(mtcars$am),則am列將被標註爲Group.1而不是am。你使用這個賦值指定了一個更有幫助的列標籤。若是有多個分組變量,你可使用一下語句:spa
by=list(name1=groupvar1, name2=groupvar2,...)
而aggregate函數僅容許每次調用中使用平均數、標準差這樣的單返回值函數。若是想要一次返回若干個統計量,可使用by()函數。設計
下面的示例中,假設A、B和C表明類別型變量。code
> library(vcd) > library(grid) > head(Arthritis) ID Treatment Sex Age Improved 1 57 Treated Male 27 Some 2 46 Treated Male 29 None 3 77 Treated Male 30 None 4 17 Treated Male 32 Marked 5 36 Treated Male 46 Marked 6 23 Treated Male 58 Marked
1.1 一維列聯表blog
可使用table()函數生成簡單的頻數統計表。ip
> mytable <- with(Arthritis, table(Improved)) > mytable Improved None Some Marked 42 14 28
可使用prop.table()將這些頻數轉換爲比例值,或使用prop.table()*100來轉換爲百分比:it
> prop.table(mytable)
Improved
None Some Marked
0.5000000 0.1666667 0.3333333
1.2 二維列聯表io
對於二維表,table的使用格式爲:
mytable <- table(A, B)
其中,A是行變量,B是列變量。
除此以外,xtabs()函數還可以使用公式風格的輸入建立列聯表。
mytable <- xtabs(~ A + B, data=mydata)
其中,mydata是一個矩陣或者數據框,要進行交叉分類的變量應出如今公式的右側:
> mytable <- xtabs(~ Treatment+Improved, data=Arthritis) > mytable Improved Treatment None Some Marked Placebo 29 7 7 Treated 13 7 21
可使用margin.table()和prop.table()函數分別生成邊際頻數和比例。
> margin.table(mytable,1) Treatment Placebo Treated 43 41 > prop.table(mytable,1) Improved Treatment None Some Marked Placebo 0.6744186 0.1627907 0.1627907 Treated 0.3170732 0.1707317 0.5121951
下標1指代table語句的第一個變量,列和列比例能夠這樣計算:
> margin.table(mytable,2) Improved None Some Marked 42 14 28 > prop.table(mytable,2) Improved Treatment None Some Marked Placebo 0.6904762 0.5000000 0.2500000 Treated 0.3095238 0.5000000 0.7500000
各單元格所佔比例可用以下語句獲取:
> prop.table(mytable) Improved Treatment None Some Marked Placebo 0.34523810 0.08333333 0.08333333 Treated 0.15476190 0.08333333 0.25000000
可使用addmargins()函數爲這些表格添加邊際和。
3. 多維列聯表
除了上面二維列聯表裏介紹的方法以外,還有ftable()函數能夠以一種緊湊而吸引人的方式輸出多維列聯表。
> mytable <- xtabs(~Treatment+Sex+Improved, data=Arthritis) > mytable , , Improved = None Sex Treatment Female Male Placebo 19 10 Treated 6 7 , , Improved = Some Sex Treatment Female Male Placebo 7 0 Treated 5 2 , , Improved = Marked Sex Treatment Female Male Placebo 6 1 Treated 16 5
> ftable(mytable) Improved None Some Marked Treatment Sex Placebo Female 19 7 6 Male 10 0 1 Treated Female 6 5 16 Male 7 2 5
> margin.table(mytable,1) Treatment Placebo Treated 43 41 > margin.table(mytable,2) Sex Female Male 59 25 > margin.table(mytable,3) Improved None Some Marked 42 14 28 > margin.table(mytable,c(1,3)) Improved Treatment None Some Marked Placebo 29 7 7 Treated 13 7 21 > ftable(prop.table(mytable,c(1,2))) Improved None Some Marked Treatment Sex Placebo Female 0.59375000 0.21875000 0.18750000 Male 0.90909091 0.00000000 0.09090909 Treated Female 0.22222222 0.18518519 0.59259259 Male 0.50000000 0.14285714 0.35714286 > ftable(addmargins(prop.table(mytable,c(1,2)),3)) Improved None Some Marked Sum Treatment Sex Placebo Female 0.59375000 0.21875000 0.18750000 1.00000000 Male 0.90909091 0.00000000 0.09090909 1.00000000 Treated Female 0.22222222 0.18518519 0.59259259 1.00000000 Male 0.50000000 0.14285714 0.35714286 1.00000000 >
可使用chisq.test()函數對二維列表的行變量和列變量進行卡方獨立性檢驗。
> mytable <- xtabs(~Treatment+Improved, data=Arthritis) > chisq.test(mytable) Pearson's Chi-squared test data: mytable X-squared = 13.055, df = 2, p-value = 0.001463
> mytable <- xtabs(~Treatment+Sex, data=Arthritis) > chisq.test(mytable) Pearson's Chi-squared test with Yates' continuity correction data: mytable X-squared = 0.38378, df = 1, p-value = 0.5356
Treatment和Improved有相關性,和Sex沒有相關性。
使用fisher.test()函數進行Fisher精確檢驗,Fisher精確檢驗的原假設是:邊界固定的列聯表中行和列是相互獨立的。
> mytable <- xtabs(~Treatment+Improved, data=Arthritis) > fisher.test(mytable) Fisher's Exact Test for Count Data data: mytable p-value = 0.001393 alternative hypothesis: two.sided
mantelhaen.test()函數可用來進行該檢驗,其原假設是:兩個名義變量在第三個變量的每一層中都是條件獨立的。
> mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis) > mantelhaen.test(mytable) Cochran-Mantel-Haenszel test data: mytable Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647
Pearson積差關係數衡量了兩個定量變量之間的線性相關程度。
Spearman等級相關係數則衡量分級定序變量之間的相關程度。
Kendall's Tau相關係數也是一種非參數的等級相關度量。
cor()函數能夠計算這三種相關係數,而cov()函數可用來計算協方差。
cor(x, use= ,method= )
x指一個矩陣或者數據框;
use是指定數據缺失的處理方式;
method是指定相關係數的類型,默認爲Pearson類型。
> states <- state.x77[,1:6] > cov(states) Population Income Illiteracy Life Exp Murder HS Grad Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714 -3551.509551 Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286 3076.768980 Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776 -3.235469 Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480 6.312685 Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465 -14.549616 HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616 65.237894 > cor(states) Population Income Illiteracy Life Exp Murder HS Grad Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975 Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232 Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861 Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620 Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102 HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000 > cor(states,method="spearman") Population Income Illiteracy Life Exp Murder HS Grad Population 1.0000000 0.1246098 0.3130496 -0.1040171 0.3457401 -0.3833649 Income 0.1246098 1.0000000 -0.3145948 0.3241050 -0.2174623 0.5104809 Illiteracy 0.3130496 -0.3145948 1.0000000 -0.5553735 0.6723592 -0.6545396 Life Exp -0.1040171 0.3241050 -0.5553735 1.0000000 -0.7802406 0.5239410 Murder 0.3457401 -0.2174623 0.6723592 -0.7802406 1.0000000 -0.4367330 HS Grad -0.3833649 0.5104809 -0.6545396 0.5239410 -0.4367330 1.0000000
在默認狀況下,獲得的結果是一個方陣(全部變量之間兩兩計算相關)。你一樣能夠計算非方形的相關矩陣。
> x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")] > y <- states[,c("Life Exp", "Murder")] > cor(x, y) Life Exp Murder Population -0.06805195 0.3436428 Income 0.34025534 -0.2300776 Illiteracy -0.58847793 0.7029752 HS Grad 0.58221620 -0.4879710
偏相關是控制一個或多個定量變量時,另外兩個定量變量之間的相關係數。可使用ggm包中的pcor()函數計算偏相關係數。
> pcor(c(1,5,2,3,6), cov(states))
[1] 0.3462724
經常使用的原假設爲變量間不相關(即整體的相關係數爲0)。你可使用cor.test()函數對單個的Pearson、Spearman和Kendall相關係數進行檢驗。簡化後的格式爲:
cor.test(x,y, alternative= , method= ,)
x和y爲要檢驗相關性的變量;
alternative用來指定進行雙側檢驗或單側檢驗;
method用來指定要計算的相關類型。
> cor.test(states[,3],states[,5]) Pearson's product-moment correlation data: states[, 3] and states[, 5] t = 6.8479, df = 48, p-value = 1.258e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5279280 0.8207295 sample estimates: cor 0.7029752
可是,cor.test()一次只能檢驗一種相關關係。psych包中提供的corr.test()函數能夠一次作更多事情。
> library(psych) > corr.test(states, use="complete") Call:corr.test(x = states, use = "complete") Correlation matrix Population Income Illiteracy Life Exp Murder HS Grad Population 1.00 0.21 0.11 -0.07 0.34 -0.10 Income 0.21 1.00 -0.44 0.34 -0.23 0.62 Illiteracy 0.11 -0.44 1.00 -0.59 0.70 -0.66 Life Exp -0.07 0.34 -0.59 1.00 -0.78 0.58 Murder 0.34 -0.23 0.70 -0.78 1.00 -0.49 HS Grad -0.10 0.62 -0.66 0.58 -0.49 1.00 Sample Size [1] 50 Probability values (Entries above the diagonal are adjusted for multiple tests.) Population Income Illiteracy Life Exp Murder HS Grad Population 0.00 0.59 1.00 1.0 0.10 1 Income 0.15 0.00 0.01 0.1 0.54 0 Illiteracy 0.46 0.00 0.00 0.0 0.00 0 Life Exp 0.64 0.02 0.00 0.0 0.00 0 Murder 0.01 0.11 0.00 0.0 0.00 0 HS Grad 0.50 0.00 0.00 0.0 0.00 0 To see confidence intervals of the correlations, print with the short=FALSE option
參數use=的取值可爲"pairwise"或"complete"(分別表示對缺失值執行成對刪除或行刪除)。
參數method=的取值是三種方法,默認爲pearson。
一個針對兩組的獨立樣本t檢驗能夠用於檢驗兩個整體的均值相等的假設。這裏假設兩組數據是獨立的,而且是從正態整體中抽的的。
t.test(y ~ x, data)
> library(MASS) > t.test(Prob ~ So, data=UScrime) Welch Two Sample t-test data: Prob by So t = -3.8954, df = 24.925, p-value = 0.0006506 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.03852569 -0.01187439 sample estimates: mean in group 0 mean in group 1 0.03851265 0.06371269
在兩組觀測之間相關時,你得到的是一個非獨立組設計。非獨立樣本的t檢驗假定組間的差別呈正態分佈。