9102年是互聯網大環境不太好的一年,這一年更須要苦練基本功,數據科學領域的基本功無非就是數據處理,而 DataFrame 是其中的核心。那麼,都9102年了,如何在 R 語言中優雅地使用 DataFrame 呢?是否優雅是礦工生產力差別的重要來源,本文將介紹最近三年來 DataFrame 的最近進展。html
按照功能來劃分:前端
談到數據處理的基本功必須先了解 DataFrame 的最核心能力:比SQL還要優雅地處理數據。做爲第四代數據處理語言的核心 tidyverse 是必需要了解的。node
tidyverse 實現了核心計算的大一統,它囊括了八個數據科學最高頻的操做:git
對於 tidyverse 的各個子模塊能夠參考 R語言與DataFrame 2016版, 其中詳細介紹了各模塊的狀況,本文再也不贅述,更多着墨與新趨勢。八大模塊中,dplyr 是核心中的核心,由於它實現了數據處理的先後端分離,支持 R 語言的前端和多種結構化數據庫的後端,好比 RPostGIS
,RMySQL
,sparklyr
等。github
下面舉一個 sparklyr
處理 iris 數據的例子來感覺一下現代數據處理:web
library(sparklyr) library(tidyverse) # 分佈式OLAP數據庫 Hive 鏈接 sc <- sparklyr::spark_connect(master = "localhost", spark_home = "/data/FinanceR/Spark", version = "2.2.0", config = sparklyr::spark_config()) # 或者分佈式OLTP數據庫 tidb 鏈接 # sc <- RMySQL::dbConnect(RMySQL::MySQL(), # user="FinanceR", password="FinanceR", # dbname="FinanceR", host="192.168.1.100") # 或者基於內存索引的 MPP數據庫 impala 連接 # impala <- implyr::src_impala( # drv = odbc::odbc(), # driver = "Cloudera ODBC Driver for Impala", # host = "host", # port = 21050, # database = "default", # uid = "username", # pwd = "password" # )
# 在 mutate 中支持 Hive UDF remote_df = copy_to(sc,iris,"iris_tbl") remote_df = dplyr::tbl(sc,from = "db.iris_tbl") # 定義數據源表 # 或者 remote_df = dplyr::tbl(sc,from = dplyr::sql("select * from db.iris_tbl limit 10")) # # 以比SQL更接近於天然語言的方法編程,以基本的 過濾、聚合、排序爲例 remote_df %>% dplyr::mutate(a = Sepal_Length+2) %>% # 在 mutate 中支持 Hive UDF dplyr::filter(a > 2)%>% dplyr::group_by(Species)%>% dplyr::summarize(count = n())%>% dplyr::select(cnt = count)%>% dplyr::arrange(desc(cnt)) %>% dplyr::mutate_if(is.double,as.character) -> # 配合 逆天的批量操做 pipeline
## 一頓操做猛如虎以後,將數據操做的對應SQL執行過程打印出來 pipeline %>% dbplyr::sql_render() # <SQL> SELECT CAST(`cnt` AS STRING) AS `cnt` FROM (SELECT * FROM (SELECT `count` AS `cnt` FROM (SELECT `Species`, count(*) AS `count` FROM (SELECT * FROM (SELECT `Sepal_Length`, `Sepal_Width`, `Petal_Length`, `Petal_Width`, `Species`, `Sepal_Length` + 2.0 AS `a` FROM `iris_tbl`) `nmpfznbuzf` WHERE (`a` > 2.0)) `ygjaktrzzu` GROUP BY `Species`) `lbmpsjksox`) `tdrwlnidxw` ORDER BY `cnt` DESC) `mzuwuocxay`
## 將 pipeline 結果自動緩存爲臨時表 tmp_tbl tmp_tbl = pipeline %>% compute("tmp_tbl") # 將 臨時表按照條件過濾後再寫回 Hive 數據庫,省去建表過程 tmp_tbl %>% filter(cnt > 10) %>% sparklyr::spark_write_table("db.financer_res")
經過上面的例子,咱們能夠看到使用 sparklyr
而不是直接使用 Hive ,可以使得數據分析流程變得更加高效和方便。算法
在 Edzer 等人的努力之下,空間計算社區逐漸創建起來一套優雅的以 sf 爲核心的空間計算工做流。sql
核心組成主要包括數據庫
在空間數據處理中,一般將數據分爲兩大類:apache
下面按照空間數據類型分別介紹:
sf 是 OGC 和 ISO 共同擬定的一套通用二維幾何圖形(包括點, 線, 面, 多點, 多線 等)存儲和訪問GIS模型, 旨在標準化要素存取流程。sf 的簡稱是 Simple Feature Access, 意爲 簡單要素存取。
在 spatial tidy workflow 中的核心就是 sf 的空間數據結構。
以 trip network 舉例,實現 data.frame 到 sf 的空間 dataframe 的轉化:
library(tidyverse) library(stplanr) data(cents, flow) travel_network <- stplanr::od2line(flow = flow, zones = cents) %>% sf::st_as_sf() %>% sf::st_set_transform(4326)
travel_network %>% select( Area.of.residence)
Simple feature collection with 49 features and 1 field geometry type: LINESTRING dimension: XY bbox: xmin: -1.550806 ymin: 53.8041 xmax: -1.511861 ymax: 53.82887 epsg (SRID): 4326 proj4string: +proj=longlat +datum=WGS84 +no_defs First 10 features: Area.of.residence geometry 920573 E02002361 LINESTRING (-1.516734 53.82... 920575 E02002361 LINESTRING (-1.516734 53.82... 920578 E02002361 LINESTRING (-1.516734 53.82... 920582 E02002361 LINESTRING (-1.516734 53.82... 920587 E02002361 LINESTRING (-1.516734 53.82... 920591 E02002361 LINESTRING (-1.516734 53.82... 920601 E02002361 LINESTRING (-1.516734 53.82... 921220 E02002363 LINESTRING (-1.535617 53.82... 921222 E02002363 LINESTRING (-1.535617 53.82... 921225 E02002363 LINESTRING (-1.535617 53.82...
從 travel_network
對象中能夠看到,sf 實現了 空間信息(座標系、集合類型、維度、bbox、投影系) + DataFrame 的完美結合, 支持與 dplyr 操做的完美結合,使得空間數據的處理變得簡單優雅。
以騎行網絡上自行車出行的流量分佈統計爲例,經過 sf::st_set_geometry(NULL)
函數將空間信息隱去:
travel_network %>% dplyr::group_by(Bicycle) %>% dplyr::summarise(cnt = n()) %>% sf::st_set_geometry(NULL)
# A tibble: 9 x 2 Bicycle cnt * <int> <int> 1 0 24 2 1 11 3 2 8 4 3 1 5 4 1 6 5 1 7 6 1 8 10 1 9 12 1
travel_network 對象經過 leaflet 直接實現快速可視化,不須要額外指定經緯度參數:
travel_network %>% leaflet::leaflet() %>% leaflet::addTiles() %>% leaflet::addPolylines()
sf
本質上是若干 C++/Rcpp庫的一個封裝,旨在爲空間數據分析提供一個統一的工做流接口,下面是 sf
的框架圖,包括如下幾個核心構成:
sf 的核心數據操做主要是基於 sp 包實現的,sp 包的api因爲歷史緣由實現的功能很是豐富可是命名並不規則,sf 在它的基礎上對接了 Simple Feature Access 標準。下面是 sp 操做 和 sf 操做的對比
功能 | sp 操做 | sf 操做 |
---|---|---|
讀取數據 | read.asciigrid |
st_read |
數據寫入 | write.asciigrid |
st_write |
生成空間網格 | as.SpatialPolygons.GridTopology |
st_make_grid |
投影係獲取/設置 | proj4string(x) <- p4s |
st_crs(x) <- p4s / st_set_crs(x, p4s) |
投影系轉化 | spTransform(x, CRSobj) |
st_transform(x, crs) |
座標系一致性判斷 | identicalCRS |
== method for crs objects |
是否經緯度座標系 | is.projected(x) |
!st_is_longlat(x) |
功能 | sp 操做 | sf 操做 |
---|---|---|
sfc 對象轉化 | Spatial* |
st_sfc |
數據格式轉化 | disaggregate |
st_cast |
空間數據框 | Spatial*DataFrame |
st_sf |
獲取/編輯座標系 | coordinates(x) |
st_coordinates(x) |
將 data.frame 轉爲地理對象 | coordinates(x)=~x+y |
st_as_sf(x, coords = c("x","y")) |
合併 | merge |
merge |
多邊形 | Polygon |
st_polygon , st_multipolygon |
邊界框 | bbox(x) |
st_bbox(x) / matrix(st_bbox(x), 2) |
捨去幾何信息 | x@data |
st_set_geometry(x, NULL) / st_drop_geometry(x) |
添加屬性爲幾何信息 | addAttrToGeom |
st_sf |
設置/獲取幾何要素 | geometry<- |
st_geometry<- |
按column合併數據框 | cbind.Spatial |
cbind |
按row合併數據框 | rbind.Spatial* |
rbind |
功能 | sp 操做 | sf 操做 |
---|---|---|
空間線段長度 | SpatialLinesLengths |
st_length |
交集運算 | over(x,y,returnList=TRUE) |
st_intersects(x,y) |
空間距離計算 | spDists(x,y) ,spDists ,spDistsN1 |
st_distance(x,y) |
空間插值 | aggregate(x,by,mean,areaWeighted = TRUE) |
st_interpolate_aw(x,by,mean) |
去重 | remove.duplicates |
st_union on multipoint |
採樣 | spsample |
st_sample , st_line_sample (partial support) |
lattice可視化 | plot(x) ,spplot |
plot(st_geometry(x)) |
維度擴展九交模型,旨在解決二維空間的集合拓撲關係判斷。維度擴展九交模型的形式以下所示,假設 A、B 表示兩個相互獨立的多邊形,則有 I(A)和I(B)表示A和B的內部(inside),B(A)和B(B)表示A和B的邊界(border),E(A)和E(B)表示A和B的外部。下面是DE-9IM 模型的示意圖:
經過九交模型能夠支持完整的二維空間八類拓撲關係,對應的 SF 函數以下:
功能 | sf 操做 |
---|---|
線段合併 | st_line_merge |
線段切分 | st_segmentize |
生成 voronoi 圖 | st_voronoi |
幾何中心點 | st_centroid |
凸包運算 | st_convex_hull |
三角形運算 | st_triangulate |
將LineString轉爲多邊形 | st_polygonize |
經過道格拉斯算法簡化圖形邊緣毛刺 | st_simplify |
切割多邊形 | st_split |
計算緩衝區 | st_buffer |
幾何有效性驗證 | st_make_valid |
幾何邊界提取 | st_boundary |
更多空間計算操做 詳見 SF CheatSheet.
GeoSpark 解決了大規模矢量數據的空間索引問題,進而產生了好比大規模處理 九交模型 的對應方案,解決了PG/SF單機處理矢量空間數據的性能瓶頸問題。
以 DISJOINT 爲例:
鏈接 Spark 程序
pak::pkg_install("harryprince/geospark") library(sparklyr) library(dplyr) library(geospark) conf = spark_config() conf$`sparklyr.instances.num` = 50 conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer" conf$spark.kryo.registrator <- "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator" Sys.set(SPARK_HOME=」~/spark」) sc <- spark_connect(master = "yarn-client",config = conf)
註冊空間算子
register_gis(sc)
本地僞造數據
polygons <- read.table(text="california area|POLYGON ((-126.4746 32.99024, -126.4746 42.55308, -115.4004 42.55308, -115.4004 32.99024, -126.4746 32.99024)) new york area|POLYGON ((-80.50781 36.24427, -80.50781 41.96766, -70.75195 41.96766, -70.75195 36.24427, -80.50781 36.24427)) texas area |POLYGON ((-106.5234 25.40358, -106.5234 36.66842, -91.14258 36.66842, -91.14258 25.40358, -106.5234 25.40358)) dakota area|POLYGON ((-106.084 44.21371, -106.084 49.66763, -95.71289 49.66763, -95.71289 44.21371, -106.084 44.21371)) ", sep="|",col.names=c("area","geom")) points <- read.table(text="New York|NY|POINT (-73.97759 40.74618) New York|NY|POINT (-73.97231 40.75216) New York|NY|POINT (-73.99337 40.7551) West Nyack|NY|POINT (-74.06083 41.16094) West Point|NY|POINT (-73.9788 41.37611) West Point|NY|POINT (-74.3547 41.38782) Westtown|NY|POINT (-74.54593 41.33403) Floral Park|NY|POINT (-73.70475 40.7232) Floral Park|NY|POINT (-73.60177 40.75476) Elmira|NY|POINT (-76.79217 42.09192) Elmira|NY|POINT (-76.75089 42.14728) Elmira|NY|POINT (-76.84497 42.12927) Elmira|NY|POINT (-76.80393 42.07202) Elmira|NY|POINT (-76.83686 42.08782) Elmira|NY|POINT (-76.75089 42.14728) Alcester|SD|POINT (-96.63848 42.97422) Aurora|SD|POINT (-96.67784 44.28706) Baltic|SD|POINT (-96.74702 43.72627) Beresford|SD|POINT (-96.79091 43.06999) Brandon|SD|POINT (-96.58362 43.59001) Minot|ND|POINT (-101.2744 48.22642) Abercrombie|ND|POINT (-96.73165 46.44846) Absaraka|ND|POINT (-97.21459 46.85969) Amenia|ND|POINT (-97.25029 47.02829) Argusville|ND|POINT (-96.95043 47.0571) Arthur|ND|POINT (-97.2147 47.10167) Ayr|ND|POINT (-97.45571 47.02031) Barney|ND|POINT (-96.99819 46.30418) Blanchard|ND|POINT (-97.25077 47.3312) Buffalo|ND|POINT (-97.54484 46.92017) Austin|TX|POINT (-97.77126 30.32637) Austin|TX|POINT (-97.77126 30.32637) Addison|TX|POINT (-96.83751 32.96129) Allen|TX|POINT (-96.62447 33.09285) Carrollton|TX|POINT (-96.89163 32.96037) Carrollton|TX|POINT (-96.89773 33.00542) Carrollton|TX|POINT (-97.11628 33.20743) Celina|TX|POINT (-96.76129 33.32793) Carrollton|TX|POINT (-96.89328 33.03056) Carrollton|TX|POINT (-96.77763 32.76727) Los Angeles|CA|POINT (-118.2488 33.97291) Los Angeles|CA|POINT (-118.2485 33.94832) Los Angeles|CA|POINT (-118.276 33.96271) Los Angeles|CA|POINT (-118.3076 34.07711) Los Angeles|CA|POINT (-118.3085 34.05891) Los Angeles|CA|POINT (-118.2943 34.04835) Los Angeles|CA|POINT (-118.2829 34.02645) Los Angeles|CA|POINT (-118.3371 34.00975) Los Angeles|CA|POINT (-118.2987 33.78659) Los Angeles|CA|POINT (-118.3148 34.06271)", sep="|",col.names=c("city","state","geom"))
points <- read.table(text="New York|NY|POINT (-73.97759 40.74618) New York|NY|POINT (-73.97231 40.75216) New York|NY|POINT (-73.99337 40.7551) West Nyack|NY|POINT (-74.06083 41.16094) West Point|NY|POINT (-73.9788 41.37611) West Point|NY|POINT (-74.3547 41.38782) Westtown|NY|POINT (-74.54593 41.33403) Floral Park|NY|POINT (-73.70475 40.7232) Floral Park|NY|POINT (-73.60177 40.75476) Elmira|NY|POINT (-76.79217 42.09192) Elmira|NY|POINT (-76.75089 42.14728) Elmira|NY|POINT (-76.84497 42.12927) Elmira|NY|POINT (-76.80393 42.07202) Elmira|NY|POINT (-76.83686 42.08782) Elmira|NY|POINT (-76.75089 42.14728) Alcester|SD|POINT (-96.63848 42.97422) Aurora|SD|POINT (-96.67784 44.28706) Baltic|SD|POINT (-96.74702 43.72627) Beresford|SD|POINT (-96.79091 43.06999) Brandon|SD|POINT (-96.58362 43.59001) Minot|ND|POINT (-101.2744 48.22642) Abercrombie|ND|POINT (-96.73165 46.44846) Absaraka|ND|POINT (-97.21459 46.85969) Amenia|ND|POINT (-97.25029 47.02829) Argusville|ND|POINT (-96.95043 47.0571) Arthur|ND|POINT (-97.2147 47.10167) Ayr|ND|POINT (-97.45571 47.02031) Barney|ND|POINT (-96.99819 46.30418) Blanchard|ND|POINT (-97.25077 47.3312) Buffalo|ND|POINT (-97.54484 46.92017) Austin|TX|POINT (-97.77126 30.32637) Austin|TX|POINT (-97.77126 30.32637) Addison|TX|POINT (-96.83751 32.96129) Allen|TX|POINT (-96.62447 33.09285) Carrollton|TX|POINT (-96.89163 32.96037) Carrollton|TX|POINT (-96.89773 33.00542) Carrollton|TX|POINT (-97.11628 33.20743) Celina|TX|POINT (-96.76129 33.32793) Carrollton|TX|POINT (-96.89328 33.03056) Carrollton|TX|POINT (-96.77763 32.76727) Los Angeles|CA|POINT (-118.2488 33.97291) Los Angeles|CA|POINT (-118.2485 33.94832) Los Angeles|CA|POINT (-118.276 33.96271) Los Angeles|CA|POINT (-118.3076 34.07711) Los Angeles|CA|POINT (-118.3085 34.05891) Los Angeles|CA|POINT (-118.2943 34.04835) Los Angeles|CA|POINT (-118.2829 34.02645) Los Angeles|CA|POINT (-118.3371 34.00975) Los Angeles|CA|POINT (-118.2987 33.78659) Los Angeles|CA|POINT (-118.3148 34.06271)", sep="|",col.names=c("city","state","geom"))
經過 sf 和 mapview 進行 dataframe 快速可視化
M1 = polygons %>% sf::st_as_sf(wkt="geom") %>% mapview::mapview() M2 = points %>% sf::st_as_sf(wkt="geom") %>% mapview::mapview() M1+M2
將數據複製到 Spark 集羣
polygons_tbl <- copy_to(sc,polygons) points_tbl <- copy_to(sc, points)
經過 Spark GIS SQL 計算
ex2 <- copy_to(sc,tbl(sc, sql(" SELECT area,state,count(*) cnt from (select area,ST_GeomFromWKT(polygons.geom ,'4326') as y from polygons) polygons, (SELECT ST_GeomFromWKT (points.geom,'4326') as x,state,city from points) points where ST_Contains(polygons.y,points.x) group by area,state")),"test2") Res = collect(ex2)
關聯計算結果,經過 leaflet 展現
Idx_df = polygons %>% left_join(Res,by = (c("area"="area"))) %>% sf::st_as_sf(wkt="geom") Idx_df %>% leaflet::leaflet() %>% leaflet::addTiles() %>% leaflet::addPolygons(popup = ~as.character(cnt),color=~colormap::colormap_pal()(cnt))
stars 創建在 sf 、raster、 tidyverse 之上,旨在簡化 Multidimension 數據處理的流程。以千米網格的交通流量統計爲例,實現 space x space x mode x time x time x metrics 的多維數據結構:
library(stars) ## Loading required package: abind ## Loading required package: sf ## Linking to GEOS 3.5.0, GDAL 2.2.2, PROJ 4.8.0 library(dplyr) nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
to = from = st_geometry(nc) # 100 polygons: O and D regions mode = c("car", "bike", "foot") # travel mode day = 1:100 # arbitrary library(units) units(day) = make_unit("days since 2015-01-01") hour = set_units(0:23, h) # hour of day dims = st_dimensions(origin = from, destination = to, mode = mode, day = day, hour = hour)
圖數據是一種比較特殊的數據結構,一般用 G(V,E) 來表示一個圖結構,它由節點(Vertex)和邊(Edge)兩部分組成。
tidygraph 提供了清爽的 graph/network 數據操做接口,由 Thomas Lin Pedersen 大神開發,終結了 igraph 和 networkx 的混沌。它使得圖結構能夠在節點表-邊表模式和圖模式之間相互切換,一方面,集成了主流圖模型,好比 pagerank,louvain 等。另外一方面,支持主流圖格式,包括 network,phylo,dendrogram,data.tree,graph ,簡直是知識圖譜和風控領域居家旅行必備神器。
tidygraph 中核心操做見以下:
之前文提到的 travel_network 爲例,將空間結構轉化爲圖結構,並經過 PageRank 計算節點中心度:
library(tidygraph) library(ggraph) travel_graph <- sfnet::sfnetworks(travel_network) %>% tidygraph::activate(nodes) %>% mutate(degree = tidygraph::centrality_pagerank()) travel_graph
能夠看到 travel_graph
生成了以 Node DataFrame + Edge DataFrame 的組合形式,分別記錄。
在 active(nodes)
的條件下,優先展現 Node 相關屬性。
經過 ggraph
進行可視化呈現
ggraph(travel_graph, layout = 'kk') + geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) + geom_node_point(aes(size = degree))
支持的算法列表:
中心度指標
centrality_alpha()
centrality_authority()
centrality_betweenness()
centrality_power()
centrality_closeness()
centrality_eigen()
centrality_hub()
centrality_pagerank()
centrality_subgraph()
centrality_degree()
centrality_edge_betweenness()
局部指標
local_size(order = 1, mindist = 0)
local_members(order = 1, mindist = 0)
local_triangles()
local_ave_degree()
local_transitivity()
圖模型
group_components()
group_edge_betweenness()
group_fast_greedy()
group_infomap()
group_label_prop()
group_leading_eigen()
group_louvain()
group_optimal()
group_spinglass()
group_walktrap()
group_biconnected_component()
在大型複雜網絡中,GraphFrames 是 tidygraph 很好的替代解決方案,依然以 PageRank 爲例:
鏈接 Spark
library(graphframes) library(sparklyr) library(dplyr) sc <- spark_connect(master = "local", version = "2.3.0")
獲取圖結構用例
g <- gf_friends(sc)
計算 PageRank
results <- gf_pagerank(g, tol = 0.01, reset_probability = 0.15) results
GraphFrame 的計算結果:
## GraphFrame ## Vertices: ## $ id <chr> "f", "b", "g", "a", "d", "c", "e" ## $ name <chr> "Fanny", "Bob", "Gabby", "Alice", "David", "Charlie",... ## $ age <int> 36, 36, 60, 34, 29, 30, 32 ## $ pagerank <dbl> 0.3283607, 2.6555078, 0.1799821, 0.4491063, 0.3283607... ## Edges: ## $ src <chr> "b", "c", "d", "e", "a", "a", "e", "f" ## $ dst <chr> "c", "b", "a", "f", "e", "b", "d", "c" ## $ relationship <chr> "follow", "follow", "friend", "follow", "friend",... ## $ weight <dbl> 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.5, 1.0
過去的三年,是數據科學日新月異的三年,大數據和人工智能經歷了資本的高潮迭起以後必將回歸本質,苦練數據科學基本功,精通數據集的高效清洗與理解纔是決勝數據科學的王道。