Spark數據挖掘-深刻GraphX(1)

時間 2019-11-15

原文原文鏈接

Spark數據挖掘-深刻GraphX(1)

1 網絡數據集

當圖被用來描述系統中的組件之間的交互關係的時候，圖能夠被用來表示任何系統。圖原理提供了通用的語言和一系列工具來表示和分析複雜的系統。簡單的說：圖由一系列頂點和邊組成，每條邊鏈接兩個頂點表示這兩個頂點之間的某種關係。下面介紹一下本文將會演示的幾個有趣的圖將要用到的數據：html

圖名字	數據集地址	下載文件名	描述
郵件交流網絡圖	https://snap.stanford.edu/data/email-Enron.html	email-Enron.txt.gz	Enron公司158名僱員的電子郵件往來數據構成一個郵件交流網絡有向圖
食品品味網絡圖	http://yongyeol.com/2011/12/15/paper-flavor-network.html	ingr_comp.zip	經過三個食品網站獲取獲得的每一個食品組成成分和每一個成分對應的化學合成物構成一個網絡
我的社交網絡圖	http://snap.stanford.edu/data/egonets-Gplus.html	gplus.tar.gz	數據中的用戶圈子組成一個我的社交網絡，數據集還包括我的屬性信息

2 GraphX 圖形建立方式

在GraphX裏面有四種建立一個屬性圖的方法。每種構建圖的方法對數據都有必定的格式要求。下面一一分析。node

2.1 利用 Object Graph 的工廠方法建立

Object Graph 是 Class Graph 的伴生對象，它定義了建立 Graph 對象的 apply 方法定義以下：apache

def apply[VD, ED](
  vertices: RDD[(VertexId, VD)],
  edges: RDD[Edge[ED]],
  defaultVertexAttr: VD = null
  ): Graph[VD, ED]

此方法經過傳入頂點：RDD[(VertexId,VD)]和邊：RDD[Edge[ED]] 就能夠建立一個圖。注意參數： defaultVertexAttr 是用來設置那些邊中的頂點不在傳入的頂點集合當中的頂點的默認屬性，因此這個值的類型必須是和傳入頂點的屬性的類型同樣。微信

2.2 利用 edgeListFile 建立

一個很是常見的場景是：你數據集裏的數據表示的是頂點與頂點的關係即只表示邊。這種狀況下Graphx提供了GraphLoader.edgeListFile函數來自動生成圖，函數的定義以下：網絡

def edgeListFile(
  sc: SparkContext,
  path: String,
  canonicalOrientation: Boolean = false,
  numEdgePartitions: Int = -1)
  : Graph[Int, Int]

sc、path 這兩個參數不用多說，須要注意的參數解析以下：app

path 指向包含邊的文件或文件夾要求：文件每一行用兩個按照多個空格分割的正整數表示的邊，如： scrId dstId，Spark 讀取的時候會忽略# 開頭的行
canonicalOrientation 表示圖是否有方向若是值爲true，那麼只會加載 srcId > dstId 的邊，不然所有加載
加載完全部邊以後，自動按照邊生成頂點，默認的每一個頂點的屬性是1
numEdgePartitions 邊分區個數默認是按照文件分區來劃分的，也能夠指定

下面看一下關鍵源碼：機器學習

val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
  val builder = new EdgePartitionBuilder[Int, Int]
  iter.foreach { line =>
    if (!line.isEmpty && line(0) != '#') {
      val lineArray = line.split("\\s+")
      if (lineArray.length < 2) {
        throw new IllegalArgumentException("Invalid line: " + line)
      }
      val srcId = lineArray(0).toLong
      val dstId = lineArray(1).toLong
      if (canonicalOrientation && srcId > dstId) {
        builder.add(dstId, srcId, 1)
      } else {
        builder.add(srcId, dstId, 1)
      }
    }
  }
}

2.3 利用 fromEdges 建立

這個方法能夠理解爲edgeListFile方法內部就是調用這個方法。原理就是隻根據邊： RDD[Edge[ED]] 來生成圖，頂點就是由全部構成邊的頂點組成，頂點的默認屬性用戶能夠指定，定義以下：ide

def fromEdges[VD: ClassTag, ED: ClassTag](
    edges: RDD[Edge[ED]],
    defaultValue: VD): Graph[VD, ED]

2.4 利用 fromEdgeTuples 建立

這個方法也能夠理解爲edgeListFile方法內部就是調用這個方法。原理就是隻根據邊： RDD[(VertexId, VertexId)] 來生成圖，連邊的屬性都不知道，默認邊的屬性固然能夠設置，頂點就是由全部構成邊的頂點組成，頂點的默認屬性用戶能夠指定，定義以下：函數

def fromEdgeTuples[VD](
  rawEdges: RDD[(VertexId, VertexId)],
  defaultValue: VD,
  uniqueEdges: Option[PartitionStrategy] = None)
  : Graph[VD, Int]

其實後面三種方式都是不明確指定頂點，而是經過邊來推導出頂點，這很是適合無屬性圖，比較經常使用的是第一種和第二種方式。固然也能夠本身實現第三種方式的文件讀取方式，好比文件中不止兩列，還有屬性列等等，很是簡單。工具

3 GraphX 圖形建立實戰

3.1 建立一個雙向圖

先拿上面數據列表中的第一份數據，數據解壓以後的文件名爲：Email-Enron.txt，前面十條示例數據以下：

# Directed graph (each unordered pair of nodes is saved once): Email-Enron.txt
# Enron email network (edge indicated that email was exchanged, undirected edges)
# Nodes: 36692 Edges: 367662
# FromNodeId	ToNodeId
0	1
1	0
1	2
1	3

能夠發現這個數據集合很是適合上面edgeListFile方法建立圖形，代碼以下：

val emailGraph = GraphLoader.edgeListFile(sc, projectDir + "Email-Enron.txt")

查看一下圖中前面5個頂點和邊

emailGraph.vertices.take(5).foreach(println)
(19021,1)
(28730,1)
(23776,1)
(34207,1)
(31037,1)
emailGraph.edges.take(5).foreach(println)
Edge(0,1,1)
Edge(1,0,1)
Edge(1,2,1)
Edge(1,3,1)
Edge(1,4,1)

查看一下是不是雙向圖（任何兩個點只要有鏈接必須是來回指向），這裏只是查看頂點ID爲19021的點：

emailGraph.edges.filter(_.srcId == 19021).map(_.dstId).collect().foreach(println)
696
4232
6811
8315
26007
emailGraph.edges.filter(_.dstId == 19021).map(_.srcId).collect().foreach(println)
696
4232
6811
8315
26007

3.2 建立一個二分圖

什麼是二分圖？簡單來講：二分圖指的是圖的頂點分爲兩個集合，其中任意集合內部頂點不可能有邊關聯，關聯的邊頂點必定分佈在兩個不一樣的集合之中。詳細原理見Wiki百科
本文第二個數據集食物成分和化合物的關係圖就是二分圖。將下載的數據解壓，先來看一下壓縮包中每一個原始文件前十條數據：

文件1：ingr_info.tsv 從文件名能夠知道它是按照製表符分割的文件表示的是食物原料的信息

下面三列分別表示：原料ID	原料名字	分類
# id	ingredient name	category
0	magnolia_tripetala	flower
1	calyptranthes_parriculata	plant
2	chamaecyparis_pisifera_oil	plant derivative
3	mackerel	fish/seafood
4	mimusops_elengi_flower	flower
5	hyssop	herb
6	buchu	plant
7	black_pepper	spice
8	eryngium_poterium_oil	plant derivative
9	peanut_butter	plant derivative

文件2：comp_info.tsv 這個表示化合物的基礎信息

下面三列分別表示：化合物ID	化合物名字	CAS編號
# id	Compound name	CAS number
0	jasmone	488-10-8
1	5-methylhexanoic_acid	628-46-6
2	l-glutamine	56-85-9
3	1-methyl-3-methoxy-4-isopropylbenzene	1076-56-8
4	methyl-3-phenylpropionate	103-25-3
5	3-mercapto-2-methylpentan-1-ol_(racemic)	227456-27-1
6	ethyl-3-hydroxybutyrate	5405-41-4
7	cyclohexyl_butyrate	1551-44-6
8	methyl_dihydrojasmonate	24851-98-7
9	methyl_2-methylthiobutyrate	42075-45-6

文件3：ingr_comp.tsv 這個記錄的是 ingredient 和 compound 對應關係

# ingredient id	compound id
1392	906
1259	861
1079	673
22	906
103	906
1005	906
1005	278
1005	171

有了數據以後，若是你盲目的使用第三個文件直接按照上面的第一種方式建圖的話，那麼就會大錯特錯。由於第一列的ID和第二列的ID不是表示同一個事物，可是它們有交叉的數值。一個簡單的辦法就是第二列的值轉化爲第一列最大值+1以後再加上自身的數值，這樣保證兩個集合的ID沒有交叉。請看下面的代碼：

package clebeg.spark.graph

import org.apache.spark.graphx.{EdgeTriplet, VertexId, Edge, Graph}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


//定義下面的類將 ingredient 和 compount 統一表示 注意父類必定要能夠序列化
class FoodNode(val name: String) extends Serializable
case class Ingredient(override val name: String, val cat: String) extends FoodNode(name)
case class Compound(override val name: String, val cas: String) extends FoodNode(name)
/**
  * Created by clebegxie on 2015/11/25.
  */
object Graph1Food {
  val projectDir = "your_data_dir/"
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkInAction").setMaster("local[4]")
    val sc = new SparkContext(conf)
    val ingredients: RDD[(VertexId, FoodNode)] = sc.textFile(projectDir + "ingr_info.tsv").filter {
      !_.startsWith("#")
    }.map {
      line =>
        val array = line.split("\t")
        (array(0).toLong, Ingredient(array(1), array(2)))
    }
    //獲取獲得最大的 ingredient 的ID 而且加1
    val maxIngrId = ingredients.keys.max() + 1
    val compounds: RDD[(VertexId, FoodNode)] = sc.textFile(projectDir + "comp_info.tsv").filter {
      !_.startsWith("#")
    }.map {
      line =>
        val array = line.split("\t")
        (maxIngrId + array(0).toLong, Compound(array(1), array(2)))
    }
    //根據文件 ingr_comp.csv 生成邊，注意第二列的全部頂點都要加上 maxIngrId
    val links = sc.textFile(projectDir + "ingr_comp.tsv").filter {
      !_.startsWith("#")
    }.map {
      line =>
        val array = line.split("\t")
        Edge(array(0).toLong, maxIngrId + array(1).toLong, 1)
    }
    //將兩個頂點合併
    val vertices = ingredients ++ compounds
    val foodNetWork = Graph(vertices, links)
    //foodNetWork.vertices.take(10).foreach(println)
    //訪問一下這個網絡前面5條triplet的對應關係
    foodNetWork.triplets.take(5).foreach(showTriplet _ andThen println _)
  }

  def showTriplet(t: EdgeTriplet[FoodNode, Int]): String =
    "The ingredient " ++ t.srcAttr.name ++ " contains " ++ t.dstAttr.name
}

運行結果爲：

The ingredient calyptranthes_parriculata contains citral_(neral)
The ingredient chamaecyparis_pisifera_oil contains undecanoic_acid
The ingredient hyssop contains myrtenyl_acetate
The ingredient hyssop contains 4-(2,6,6-trimethyl-cyclohexa-1,3-dienyl)but-2-en-4-one
The ingredient buchu contains menthol

3.3 建立一我的與人之間類似性權重圖

數據集是使用上面介紹的Google+提供的我的關係數據，解壓以後有792個文件，每個文件名去掉後綴表明的是網絡ID，每一個網絡ID有6個文件，因此這裏有132個我的關係網絡。下面以ID爲100129275726588145876的網絡說明一下每一個文件的含義：

.edges 記錄的是邊，即ID對應的用戶之間有關聯，示例數據爲：

116374117927631468606 101765416973555767821
112188647432305746617 107727150903234299458
116719211656774388392 100432456209427807893
117421021456205115327 101096322838605097368
116407635616074189669 113556266482860931616
105706178492556563330 111169963967137030210
107527001343993112621 110877363259509543172
105513412023818293063 115710735637044108808
108736646334864181044 112393248315358692010
108683283643126638695 107111579950257773726

.feat 記錄的是每一個用戶ID對應的特徵，每一個維度上面都是取值爲 0 1，示例數據爲：

#注意這裏只是一行數據
114985346359714431656 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

.featnames 記錄的是上面feat每一個維度對應的含義（注意：上面之因此每一個維度取值都是 0 1，是由於這裏的特徵都是分類變量，而且作了 1 of n 編碼），示例數據爲：

//從下面的gender能夠看出，作了 1 of n 編碼
0 gender:1
1 gender:2
2 gender:3
3 institution:
4 institution:AMC Theatres
5 institution:AOL
6 institution:AT&T
7 institution:Aardvark
8 institution:Accenture
9 institution:Adobe Systems

下面之間給出建圖代碼，代碼意圖都有註釋：

val projectDir = "your_data_dir/"
val id = "100129275726588145876" //只創建這個ID對應的社交關係圖
type Feature = breeze.linalg.SparseVector[Int]
def main(args: Array[String]) {
  val conf = new SparkConf().setAppName("SparkInAction").setMaster("local[4]")
  val sc = new SparkContext(conf)
  //經過 .feat 文件讀取每一個頂點的屬性向量
  val featureMap = Source.fromFile(projectDir + id + ".feat").getLines().
  map {
    line =>
      val row = line.split(" ")
      //注意：ID 不能之間看成 Long 型的時候 經常用 hashcode 代替
      val key = abs(row.head.hashCode.toLong)
      val feat = SparseVector(row.tail.map(_.toInt))
      (key, feat)
  }.toMap

  //經過 .edges 文件獲得兩個用戶之間的關係 而且計算他們相同特徵的個數
  val edges = sc.textFile(projectDir + id + ".edges").map {
    line =>
      val row = line.split(" ")
      val srcId = abs(row(0).hashCode.toLong)
      val dstId = abs(row(1).hashCode.toLong)
      val srcFeat = featureMap(srcId)
      val dstFeat = featureMap(dstId)
      val numCommonFeats: Int = srcFeat dot dstFeat
      Edge(srcId, dstId, numCommonFeats)
  }

  //利用 fromEdges 創建圖
  val egoNetwork = Graph.fromEdges(edges, 1)

  //查看一下具備3個相同特徵的用戶對
  print(egoNetwork.edges.filter(_.attr == 3).count())
}

這裏須要注意下面兩個地方：