tidyverse|數據分析常規操做-分組彙總（sumamrise+group_by)

時間 2020-07-07

標籤 tidyverse 數據分析常規分組彙總 sumamrise+group sumamrise group 简体版

原文原文鏈接

| 本文首發於「生信補給站」 https://mp.weixin.qq.com/s/tQt0ezYJj3H7x3aWZmKVEQ函數

使用tidyverse進行簡單的數據處理：ui

盤一盤Tidyverse| 篩行選列之select，玩轉列操做spa

盤一盤Tidyverse| 只要你要只要我有-filter 篩選行3d

Tidyverse|數據列的分分合合，一分多，多合一code

Tidyverse| XX_join ：多個數據表（文件）之間的各類鏈接ci

本次介紹變量彙總以及分組彙總。get

一 summarise 彙總

彙總函數 summarize()，能夠將數據框摺疊成一行 ,多與group_by()結合使用string

1.1 `summarize`完成指定變量的彙總

統計均值，標準差，最小值，個數和邏輯值it

library(dplyr)
iris %>%
    summarise(mean(Petal.Length), #無命名
      sd_pet_len = sd(Petal.Length,na.rm = TRUE), #命名
              min_pet_len = min(Petal.Length),
              n = n(),
             any(Sepal.Length > 5))

# mean(Petal.Length) sd_pet_len min_pet_len   n any(Sepal.Length > 5)
#1             3.758   1.765298           1 150                 TRUE

經常使用函數：io

Center 位置度量 : mean(), median()
Spread 分散程度度量 : sd(), IQR(), mad()
Range 秩的度量 : min(), max(), quantile()
Position 定位度量 : first(), last(), nth(),
Count 計數 : n(), n_distinct()
Logical 邏輯值的計數和比例 : any(), all()

1.2 , `summarise_if`完成一類變量的彙總

iris %>%
 summarise_if(is.numeric, ~ mean(., na.rm = TRUE))

# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     5.843333   3.057333       3.758   1.199333

1.3，`summarise_at`完成指定變量的彙總

summarise_at配合vars，能夠更靈活的篩選符合條件的列，而後進行彙總

iris %>%
 summarise_at(vars(ends_with("Length"),Petal.Width),
 list(~mean(.), ~median(.)))

# Sepal.Length_mean Petal.Length_mean Petal.Width_mean Sepal.Length_median Petal.Length_median
#1         5.843333             3.758         1.199333                 5.8               4.35
# Petal.Width_median
#1               1.3

二結合`group_by` 彙總

group_by() 和 summarize() 的組合構成了使用 dplyr 包時最經常使用的操做之一：分組摘要

2.1 按照Species分組，變量彙總

iris %>%
 group_by(Species) %>%
    summarise(avg_pet_len = mean(Petal.Length),
      sd_pet_len = sd(Petal.Length),
              min_pet_len = min(Petal.Length),
              first_pet_len = first(Petal.Length),
             n_pet_len = n())

# A tibble: 3 x 6
# Species   avg_pet_len sd_pet_len min_pet_len first_pet_len n_pet_len
# <fct>           <dbl>     <dbl>       <dbl>         <dbl>     <int>
#1 setosa           1.46     0.174         1             1.4       50
#2 versicolor       4.26     0.470         3             4.7       50
#3 virginica         5.55     0.552         4.5           6         50

2.2 計數

n() ：無需參數返回當前分組的大小；
sum(!is.na(x)) ：返回非缺失值的梳理；
n_distinct(x)：返回惟一值的數量。

iris %>%
 group_by(Species) %>%
    summarise( n_pet_len = n(),
              noNA_n_pet_len =  sum(!is.na(Petal.Length)),
         Petal.Length_uniq_n = n_distinct(Petal.Length)
     )
# A tibble: 3 x 4
# Species   n_pet_len noNA_n_pet_len Petal.Length_uniq_n
# <fct>         <int>         <int>               <int>
#1 setosa           50             50                   9
#2 versicolor       50             50                 19
#3 virginica         50             50                 20

除此以外，還能夠用dplyr的count函數進行計數：

iris %>%
 count(Species)

# A tibble: 3 x 2
# Species       n
# <fct>     <int>
#1 setosa       50
#2 versicolor   50
#3 virginica     50

2.3 邏輯值的計數和比例

當與數值型函數一同使用時， TRUE 會轉換爲 1， FALSE 會轉換爲 0。

這使得 sum() 和 mean() 很是適用於邏輯值： sum(x) 能夠找出 x 中 TRUE 的數量， mean(x) 則能夠找出比例

iris %>%
 group_by(Species) %>%
    summarise( n_pet_len = n(),
              noNA_n_pet_len =  sum(!is.na(Petal.Length)),
         Petal.Length_uniq_n = n_distinct(Petal.Length),
              Petal.Length_uniq_n2 = sum(n_distinct(Petal.Length) >= 20)
     )

# A tibble: 3 x 5
# Species   n_pet_len noNA_n_pet_len Petal.Length_uniq_n Petal.Length_uniq_n2
# <fct>         <int>         <int>               <int>               <int>
#1 setosa           50             50                   9                   0
#2 versicolor       50             50                 19                   0
#3 virginica         50             50                 20                   1