第7章--基本統計分析

時間 2019-12-08

標籤基本統計分析简体版

原文原文鏈接

7.1 描述性統計分析

> vars <- c("mpg", "hp", "wt")
> head(mtcars[vars])
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460

以上述數據集爲例，對於基礎安裝R的用戶，可使用summary()函數來獲取描述性統計量。這個函數提供了最小值、最大值、四分位數和數值型變量的均值，以及因子向量和邏輯型向量的頻數統計。ide

> summary(mtcars[vars])
      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424

分組計算描述性統計量函數

你可使用aggregate()函數來分組獲取描述性統計量。ui

> aggregate(mtcars[vars],by=list(am=mtcars$am),mean)
  am      mpg       hp       wt
1  0 17.14737 160.2632 3.768895
2  1 24.39231 126.8462 2.411000
> aggregate(mtcars[vars],by=list(am=mtcars$am),sd)
  am      mpg       hp        wt
1  0 3.833966 53.90820 0.7774001
2  1 6.166504 84.06232 0.6169816

注意list的使用。若是使用的是list(mtcars$am)，則am列將被標註爲Group.1而不是am。你使用這個賦值指定了一個更有幫助的列標籤。若是有多個分組變量，你可使用一下語句：spa

by=list(name1=groupvar1, name2=groupvar2,...)

而aggregate函數僅容許每次調用中使用平均數、標準差這樣的單返回值函數。若是想要一次返回若干個統計量，可使用by()函數。設計

7.2 頻數表和列聯表

下面的示例中，假設A、B和C表明類別型變量。code

> library(vcd)
> library(grid)
> head(Arthritis)
  ID Treatment  Sex Age Improved
1 57   Treated Male  27     Some
2 46   Treated Male  29     None
3 77   Treated Male  30     None
4 17   Treated Male  32   Marked
5 36   Treated Male  46   Marked
6 23   Treated Male  58   Marked

1. 生成頻數表

1.1 一維列聯表blog

可使用table()函數生成簡單的頻數統計表。ip

> mytable <- with(Arthritis, table(Improved))
> mytable
Improved
  None   Some Marked 
    42     14     28

可使用prop.table()將這些頻數轉換爲比例值，或使用prop.table()*100來轉換爲百分比：it

> prop.table(mytable)
Improved
     None      Some    Marked 
0.5000000 0.1666667 0.3333333

1.2 二維列聯表io

對於二維表，table的使用格式爲：

mytable <- table(A, B)

其中，A是行變量，B是列變量。

除此以外，xtabs()函數還可以使用公式風格的輸入建立列聯表。

mytable <- xtabs(~ A + B, data=mydata)

其中，mydata是一個矩陣或者數據框，要進行交叉分類的變量應出如今公式的右側：

> mytable <- xtabs(~ Treatment+Improved, data=Arthritis)
> mytable
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21

可使用margin.table()和prop.table()函數分別生成邊際頻數和比例。

> margin.table(mytable,1)
Treatment
Placebo Treated 
     43      41 
> prop.table(mytable,1)
         Improved
Treatment      None      Some    Marked
  Placebo 0.6744186 0.1627907 0.1627907
  Treated 0.3170732 0.1707317 0.5121951

下標1指代table語句的第一個變量，列和列比例能夠這樣計算：

> margin.table(mytable,2)
Improved
  None   Some Marked 
    42     14     28 
> prop.table(mytable,2)
         Improved
Treatment      None      Some    Marked
  Placebo 0.6904762 0.5000000 0.2500000
  Treated 0.3095238 0.5000000 0.7500000

各單元格所佔比例可用以下語句獲取：

> prop.table(mytable)
         Improved
Treatment       None       Some     Marked
  Placebo 0.34523810 0.08333333 0.08333333
  Treated 0.15476190 0.08333333 0.25000000

可使用addmargins()函數爲這些表格添加邊際和。

3. 多維列聯表

除了上面二維列聯表裏介紹的方法以外，還有ftable()函數能夠以一種緊湊而吸引人的方式輸出多維列聯表。

> mytable <- xtabs(~Treatment+Sex+Improved, data=Arthritis)
> mytable
, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2

, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5

> ftable(mytable)
                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5

> margin.table(mytable,1)
Treatment
Placebo Treated 
     43      41 
> margin.table(mytable,2)
Sex
Female   Male 
    59     25 
> margin.table(mytable,3)
Improved
  None   Some Marked 
    42     14     28 
> margin.table(mytable,c(1,3))
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
> ftable(prop.table(mytable,c(1,2)))
                 Improved       None       Some     Marked
Treatment Sex                                             
Placebo   Female          0.59375000 0.21875000 0.18750000
          Male            0.90909091 0.00000000 0.09090909
Treated   Female          0.22222222 0.18518519 0.59259259
          Male            0.50000000 0.14285714 0.35714286
> ftable(addmargins(prop.table(mytable,c(1,2)),3))
                 Improved       None       Some     Marked        Sum
Treatment Sex                                                        
Placebo   Female          0.59375000 0.21875000 0.18750000 1.00000000
          Male            0.90909091 0.00000000 0.09090909 1.00000000
Treated   Female          0.22222222 0.18518519 0.59259259 1.00000000
          Male            0.50000000 0.14285714 0.35714286 1.00000000
>

2. 獨立性檢驗

1. 卡方獨立性檢驗

可使用chisq.test()函數對二維列表的行變量和列變量進行卡方獨立性檢驗。

> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> chisq.test(mytable)

        Pearson's Chi-squared test

data:  mytable
X-squared = 13.055, df = 2, p-value = 0.001463

> mytable <- xtabs(~Treatment+Sex, data=Arthritis)
> chisq.test(mytable)

        Pearson's Chi-squared test with Yates' continuity correction

data:  mytable
X-squared = 0.38378, df = 1, p-value = 0.5356

Treatment和Improved有相關性，和Sex沒有相關性。

2. Fisher精確檢驗

使用fisher.test()函數進行Fisher精確檢驗，Fisher精確檢驗的原假設是：邊界固定的列聯表中行和列是相互獨立的。

> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> fisher.test(mytable)

        Fisher's Exact Test for Count Data

data:  mytable
p-value = 0.001393
alternative hypothesis: two.sided

3. Cochran-Mantel-Haenszel檢驗

mantelhaen.test()函數可用來進行該檢驗，其原假設是：兩個名義變量在第三個變量的每一層中都是條件獨立的。

> mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
> mantelhaen.test(mytable)

        Cochran-Mantel-Haenszel test

data:  mytable
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647

7.3 相關

1. 相關的類型

Pearson，Spearman和Kendall相關

Pearson積差關係數衡量了兩個定量變量之間的線性相關程度。

Spearman等級相關係數則衡量分級定序變量之間的相關程度。

Kendall's Tau相關係數也是一種非參數的等級相關度量。

cor()函數能夠計算這三種相關係數，而cov()函數可用來計算協方差。

cor(x, use= ,method= )

x指一個矩陣或者數據框；

use是指定數據缺失的處理方式；

method是指定相關係數的類型，默認爲Pearson類型。

> states <- state.x77[,1:6]
> cov(states)
              Population      Income   Illiteracy     Life Exp      Murder      HS Grad
Population 19931683.7588 571229.7796  292.8679592 -407.8424612 5663.523714 -3551.509551
Income       571229.7796 377573.3061 -163.7020408  280.6631837 -521.894286  3076.768980
Illiteracy      292.8680   -163.7020    0.3715306   -0.4815122    1.581776    -3.235469
Life Exp       -407.8425    280.6632   -0.4815122    1.8020204   -3.869480     6.312685
Murder         5663.5237   -521.8943    1.5817755   -3.8694804   13.627465   -14.549616
HS Grad       -3551.5096   3076.7690   -3.2354694    6.3126849  -14.549616    65.237894
> cor(states)
            Population     Income Illiteracy    Life Exp     Murder     HS Grad
Population  1.00000000  0.2082276  0.1076224 -0.06805195  0.3436428 -0.09848975
Income      0.20822756  1.0000000 -0.4370752  0.34025534 -0.2300776  0.61993232
Illiteracy  0.10762237 -0.4370752  1.0000000 -0.58847793  0.7029752 -0.65718861
Life Exp   -0.06805195  0.3402553 -0.5884779  1.00000000 -0.7808458  0.58221620
Murder      0.34364275 -0.2300776  0.7029752 -0.78084575  1.0000000 -0.48797102
HS Grad    -0.09848975  0.6199323 -0.6571886  0.58221620 -0.4879710  1.00000000
> cor(states,method="spearman")
           Population     Income Illiteracy   Life Exp     Murder    HS Grad
Population  1.0000000  0.1246098  0.3130496 -0.1040171  0.3457401 -0.3833649
Income      0.1246098  1.0000000 -0.3145948  0.3241050 -0.2174623  0.5104809
Illiteracy  0.3130496 -0.3145948  1.0000000 -0.5553735  0.6723592 -0.6545396
Life Exp   -0.1040171  0.3241050 -0.5553735  1.0000000 -0.7802406  0.5239410
Murder      0.3457401 -0.2174623  0.6723592 -0.7802406  1.0000000 -0.4367330
HS Grad    -0.3833649  0.5104809 -0.6545396  0.5239410 -0.4367330  1.0000000

在默認狀況下，獲得的結果是一個方陣（全部變量之間兩兩計算相關）。你一樣能夠計算非方形的相關矩陣。

> x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]
> y <- states[,c("Life Exp", "Murder")]
> cor(x, y)
              Life Exp     Murder
Population -0.06805195  0.3436428
Income      0.34025534 -0.2300776
Illiteracy -0.58847793  0.7029752
HS Grad     0.58221620 -0.4879710

偏相關

偏相關是控制一個或多個定量變量時，另外兩個定量變量之間的相關係數。可使用ggm包中的pcor()函數計算偏相關係數。

> pcor(c(1,5,2,3,6), cov(states))
[1] 0.3462724

7.3 相關性的顯著性檢驗

經常使用的原假設爲變量間不相關（即整體的相關係數爲0）。你可使用cor.test()函數對單個的Pearson、Spearman和Kendall相關係數進行檢驗。簡化後的格式爲：

cor.test(x,y, alternative= , method= ,)

x和y爲要檢驗相關性的變量；

alternative用來指定進行雙側檢驗或單側檢驗；

method用來指定要計算的相關類型。

> cor.test(states[,3],states[,5])

        Pearson's product-moment correlation

data:  states[, 3] and states[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5279280 0.8207295
sample estimates:
      cor 
0.7029752

可是，cor.test()一次只能檢驗一種相關關係。psych包中提供的corr.test()函數能夠一次作更多事情。

> library(psych)
> corr.test(states, use="complete")
Call:corr.test(x = states, use = "complete")
Correlation matrix 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       1.00   0.21       0.11    -0.07   0.34   -0.10
Income           0.21   1.00      -0.44     0.34  -0.23    0.62
Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00
Sample Size 
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           Population Income Illiteracy Life Exp Murder HS Grad
Population       0.00   0.59       1.00      1.0   0.10       1
Income           0.15   0.00       0.01      0.1   0.54       0
Illiteracy       0.46   0.00       0.00      0.0   0.00       0
Life Exp         0.64   0.02       0.00      0.0   0.00       0
Murder           0.01   0.11       0.00      0.0   0.00       0
HS Grad          0.50   0.00       0.00      0.0   0.00       0

 To see confidence intervals of the correlations, print with the short=FALSE option

參數use=的取值可爲"pairwise"或"complete"（分別表示對缺失值執行成對刪除或行刪除）。

參數method=的取值是三種方法，默認爲pearson。

7.4 t檢驗

1. 獨立樣本的t檢驗

一個針對兩組的獨立樣本t檢驗能夠用於檢驗兩個整體的均值相等的假設。這裏假設兩組數據是獨立的，而且是從正態整體中抽的的。

t.test(y ~ x, data)

> library(MASS)
> t.test(Prob ~ So, data=UScrime)

        Welch Two Sample t-test

data:  Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1 
     0.03851265      0.06371269