該包主要用於數據清洗和整理,coursera課程連接:Getting and Cleaning Dataweb
也能夠載入swirl包,加載課Getting and Cleaning Data跟着學習。函數
以下:post
- library(swirl)
- install_from_swirl("Getting and Cleaning Data")
- swirl()
此文主要是參考R自帶的簡介:Introduce to dplyr學習
一、示範數據spa
- > library(nycflights13)
- > dim(flights)
- [1] 336776 16
- > head(flights, 3)
- Source: local data frame [3 x 16]
-
- year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time
- 1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227
- 2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227
- 3 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160
- Variables not shown: distance (dbl), hour (dbl), minute (dbl)
二、將過長的數據整理成友好的tbl_df數據
- > flights_df <- tbl_df(flights)
- > flights_df
三、篩選filter().net
- > filter(flights_df, month == 1, day == 1)
- Source: local data frame [842 x 16]
-
- year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time
- 1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227
- 2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227
篩選出month=1和day=1的數據
一樣效果的,code
- flights_df[flights_df$month == 1 & flights_df$day == 1, ]
四、選出幾行數據slice()
五、排列arrange()
- >arrange(flights_df, year, month, day)
將flights_df數據按照year,month,day的升序排列。
降序xml
- >arrange(flights_df, year, desc(month), day)
R語言當中的自帶函數
- flights_df[order(flights$year, flights_df$month, flights_df$day), ]
- flights_df[order(desc(flights_df$arr_delay)), ]
六、選擇select()
經過列名來選擇所要的數據htm
- select(flights_df, year, month, day)
選出三列數據
使用:符號
- select(flights_df, year:day)
使用-來刪除不要的列表
- select(flights_df, -(year:day))
七、變形mutate()
產生新的列
- > mutate(flights_df,
- + gain = arr_delay - dep_delay,
- + speed = distance / air_time * 60)
八、彙總summarize()
- <pre name="code" class="html">> summarise(flights,
- + delay = mean(dep_delay, na.rm = TRUE)
求dep_delay的均值
九、隨機選出樣本
隨機選出10個樣本
- sample_frac(flights_df, 0.01)
隨機選出1%個樣本
十、分組group_py()
- by_tailnum <- group_by(flights, tailnum)
- #肯定組別爲tailnum,賦值爲by_tailnum
- delay <- summarise(by_tailnum,
- count = n(),
- dist = mean(distance, na.rm = TRUE),
- delay = mean(arr_delay, na.rm = TRUE))
- #彙總flights裏地tailnum組的分類數量,及其組別對應的distance和arr_delay的均值
- delay <- filter(delay, count > 20, dist < 2000)
- ggplot(delay, aes(dist, delay)) +
- geom_point(aes(size = count), alpha = 1/2) +
- geom_smooth() +
- scale_size_area()
結果都須要經過賦值存儲
- a1 <- group_by(flights, year, month, day)
- a2 <- select(a1, arr_delay, dep_delay)
- a3 <- summarise(a2,
- arr = mean(arr_delay, na.rm = TRUE),
- dep = mean(dep_delay, na.rm = TRUE))
- a4 <- filter(a3, arr > 30 | dep > 30)
十一、引入連接符%>%
使用時把數據名做爲開頭,而後依次對數據進行多步操做:
- flights %>%
- group_by(year, month, day) %>%
- select(arr_delay, dep_delay) %>%
- summarise(
- arr = mean(arr_delay, na.rm = TRUE),
- dep = mean(dep_delay, na.rm = TRUE)
- ) %>%
- filter(arr > 30 | dep > 30)
前面都免去了數據名
若想要進行更多地瞭解這個包,能夠參考其自帶的說明書(60頁):dplyr