GraphX學習筆記——Programming Guide

時間 2019-12-04

標籤 graphx 學習筆記 programming guide 简体版

原文原文鏈接

學習的資料是官網的Programming Guidehtml

https://spark.apache.org/docs/latest/graphx-programming-guide.html

首先是GraphX的簡介java

GraphX是Spark中專門負責圖和圖並行計算的組件。node

GraphX經過引入了圖形概念來繼承了Spark RDD：一個鏈接節點和邊的有向圖算法

爲了支持圖計算，GraphX引入了一些算子： subgraph, joinVertices, and aggregateMessages等apache

和 Pregel API，此外還有一些algorithms 和 builders 來簡化圖分析任務。bash

關於構建 節點Vertex 和 邊Edgeide

1.若是須要將節點定義成一個類post

package graphx

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.graphstream.graph.implementations.{AbstractEdge, SingleGraph, SingleNode}

/**
  * Created by common on 18-1-22.
  */

// 抽象節點
class VertexProperty()
// User節點
case class UserProperty(val name: String) extends VertexProperty
// Product節點
case class ProductProperty(val name: String, val price: Double) extends VertexProperty

object GraphxLearning {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("GraphX").setMaster("local")
    val sc = new SparkContext(conf)

    // The graph might then have the type:
    var graph: Graph[VertexProperty, String] = null

  }
}

和節點同樣，邊也能夠定義成一個class，同時Graph類須要和定義的節點和邊的類型相對應學習

class Graph[VD, ED] {    // VD表示節點類型，ED表示邊類型
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]
}

2.若是節點的類型比較簡單，例如只是一個String或者(String,String)，就不須要定義成一個類ui

package graphx

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.graphstream.graph.implementations.{AbstractEdge, SingleGraph, SingleNode}

/**
  * Created by common on 18-1-22.
  */
object GraphxLearning {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("GraphX").setMaster("local")
    val sc = new SparkContext(conf)

    // Create an RDD for the vertices
    val users: RDD[(VertexId, (String, String))] =
      sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
        (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
    // Create an RDD for edges
    val relationships: RDD[Edge[String]] =
      sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),
        Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
    //Define a default user in case there are relationship with missing user
    val defaultUser = ("John Doe", "Missing")

    // 使用多個RDDs創建一個Graph，Graph的類型分別是節點加上邊的類型，有兩種節點，一種有ID，一種沒有
    val srcGraph: Graph[(String, String), String] = Graph(users, relationships, defaultUser)

  }
}

圖的一些算子

圖信息
numEdges: Long	計算整個圖中邊的數目
numVertices: Long	計算整個圖中頂點的數目
inDegrees: VertexRDD[Int]	計算全部點的入度，若頂點無入度，則不會出如今結果中
outDegrees: VertexRDD[Int]	計算全部點的出度，和inDegrees類似，若頂點無出度則不會出如今結果中
degrees: VertexRDD[Int]	計算全部頂點的出入度之和，孤立的頂點（無邊與之相連）不會出如今結果中
查看圖中的集合
`vertices: VertexRDD[VD]`	節點`VertexRDD`
`edges: EdgeRDD[ED]`	邊EdgeRDD
`triplets: RDD[EdgeTriplet[VD, ED]]`	三元組RDD
圖存儲
`persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]`
`cache(): Graph[VD, ED]`
`unpersistVertices(blocking: Boolean = true): Graph[VD, ED]`
`操做partition的算子`
`partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]`
操做Vertex和Edge的算子，以生成新的Graph
`mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]`
`mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]`
`mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]`
`mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]`
`mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2]) : Graph[VD, ED2]`
修改圖結構的算子
`reverse: Graph[VD, ED]`	改變有向邊的方向
`subgraph( epred: EdgeTriplet[VD,ED] => Boolean = (x => true), vpred: (VertexId, VD) => Boolean = ((v, d) => true)) : Graph[VD, ED]`	子圖
`mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]`
`groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]`	graphx中兩個節點之間能夠存在多條邊，能夠用於將這多條邊合併
Join算子
`joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD): Graph[VD, ED]`	使用頂點的更新數據生成新的頂點數據。將圖數據與輸入數據作內鏈接操做，過濾輸入數據中不存在的頂點，並對鏈接結果使用指定的UDF進行計算，若輸入數據中未包含圖中某些頂點的更新數據，則在新圖中使用頂點的舊數據
`outerJoinVertices[U, VD2](other: RDD[(VertexId, U)]) (mapFunc: (VertexId, VD, Option[U]) => VD2) : Graph[VD2, ED]`
聚合算子
collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]]	收集每一個頂點的相鄰頂點的ID數據，edgeDirection用來控制收集的方向
collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]]	收集每一個頂點的相鄰頂點的數據，當圖中頂點的出入度較大時，可能會佔用很大的存儲空間，參數edgeDirection用於控制收集方向
`aggregateMessages[Msg: ClassTag]( sendMsg: EdgeContext[VD, ED, Msg] => Unit, mergeMsg: (Msg, Msg) => Msg, tripletFields: TripletFields = TripletFields.All) : VertexRDD[A]`
`迭代圖並行計算的算子`
`pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)( vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId,A)], mergeMsg: (A, A) => A) : Graph[VD, ED]`
`基礎圖算法`
`pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]`
`connectedComponents(): Graph[VertexId, ED]`	聯通，無向聯通的節點將會有一個相同的VertexId
`triangleCount(): Graph[Int, ED]`
`stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED] }`	強聯通，有向聯通的節點將會有一個相同的VertexId
LabelPropagation	標籤傳播算法算法終止條件：它要求全部的node都知足，node的label必定是它的鄰居label中出現次數最多的(或最多的之一)，這意味着，每一個node的鄰居中，和它處於同一個community的數量必定大於等於處於其它community的數量
ShortestPaths	最短路徑算法
SVDPlusPlus	SVD算法

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。