K-頻繁項集挖掘並行化算法（Apriori算法在Spark上的實現）

時間 2019-11-11

標籤頻繁挖掘並行算法 apriori spark 實現欄目大數據简体版

原文原文鏈接

你們好，下面爲你們分享的實戰案例是K-頻繁相機挖掘並行化算法。相信從事數據挖掘相關工做的同窗對頻繁項集的相關算法java

比較瞭解，這裏咱們用Apriori算法及其優化算法實現。

首先說一下實驗結果。對於2G，1800W條記錄的數據，咱們用了18秒就算完了1-8頻繁項集的挖掘。應該還算不錯。算法

給出題目：apache

本題的較第四題難度更大。咱們在寫程序的時候必定要注意寫出的程序是並行化的，而不是隻在client上運行的單機程序。否數據結構

則你的算法效率將讓你跌破眼鏡。此外還須要對算法作相關優化。在這裏主要和你們交流一下算法思路和相關優化。優化

對於Apriori算法的實如今這裏不作過多贅述，百度一下大片大片。在Spark上實現這個算法的時候主要分爲兩個階段第一階段spa

是一個總體的循環求出每一個項集的階段，第二階段主要是針對第i個項集求出第i+1項集的候選集的階段。scala

對於這個算法能夠作以下優化：設計

觀察！這點很重要，通過觀察能夠發現有大量重複的數據，所謂方向不對努力白費也是這個道理，首先須要壓縮重複的數據。否則會作許多無用功。

設計算法的時候必定要注意是並行化的，你們可能很疑惑，Spark不就是並行化的麼？但是你一不當心可能就寫成只在client端運行的算法了。

由於數據量比較大，切記多使用數據持久化以及BroadCast廣播變量對中間數據進行相應處理。

數據結構的優化，BitSet是一種優秀的數據結構他只需一位就能夠存儲以個整形數，對於所給出的數據都是整數的狀況特別適用。

下面給出算法實現源碼：

import scala.util.control.Breaks._

import scala.collection.mutable.ArrayBuffer

import java.util.BitSet

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark._

object FrequentItemset {

def main(args: Array[String]) {

if (args.length != 2) {

println("USage:<Datapath> <Output>")

}

//initial SparkContext

val sc = new SparkContext()

val SUPPORT_NUM = 15278611 //Transactions total is num=17974836, SUPPORT_NUM = num*0.85

val TRANSACITON_NUM = 17974836.0

val K = 8

//All transactions after removing transaction ID, and here we combine the same transactions.

val transactions = sc.textFile(args(0)).map(line =>

line.substring(line.indexOf(" ") + 1).trim).map((_, 1)).reduceByKey(_ + _).map(line => {

val bitSet = new BitSet()

val ss = line._1.split(" ")

for (i <- 0 until ss.length) {

bitSet.set(ss(i).toInt, true)

}

(bitSet, line._2)

}).cache()

//To get 1 frequent itemset, here, fi represents frequent itemset

var fi = transactions.flatMap { line =>

val tmp = new ArrayBuffer[(String, Int)]

for (i <- 0 until line._1.size()) {

if (line._1.get(i)) tmp += ((i.toString, line._2))

}

tmp

}.reduceByKey(_ + _).filter(line1 => line1._2 >= SUPPORT_NUM).cache()

val result = fi.map(line => line._1 + ":" + line._2 / TRANSACITON_NUM)

result.saveAsTextFile(args(1) + "/result-1")

for (i <- 2 to K) {

val candiateFI = getCandiateFI(fi.map(_._1).collect(), i)

val bccFI = sc.broadcast(candiateFI)

//To get the final frequent itemset

fi = transactions.flatMap { line =>

val tmp = new ArrayBuffer[(String, Int)]()

//To check if each itemset of candiateFI in transactions

bccFI.value.foreach { itemset =>

val itemArray = itemset.split(",")

var count = 0

for (item <- itemArray) if (line._1.get(item.toInt)) count += 1

if (count == itemArray.size) tmp += ((itemset, line._2))

}

tmp

}.reduceByKey(_ + _).filter(_._2 >= SUPPORT_NUM).cache()

val result = fi.map(line => line._1 + ":" + line._2 / TRANSACITON_NUM)

result.saveAsTextFile(args(1) + "/result-" + i)

bccFI.unpersist()

}

}

//To get the candiate k frequent itemset from k-1 frequent itemset

def getCandiateFI(f: Array[String], tag: Int) = {

val separator = ","

val arrayBuffer = ArrayBuffer[String]()

for(i <- 0 until f.length;j <- i + 1 until f.length){

var tmp = ""

if(2 == tag) tmp = (f(i) + "," + f(j)).split(",").sortWith((a,b) => a.toInt <= b.toInt).reduce(_+","+_)

else {

if (f(i).substring(0, f(i).lastIndexOf(',')).equals(f(j).substring(0, f(j).lastIndexOf(',')))) {

tmp = (f(i) + f(j).substring(f(j).lastIndexOf(','))).split(",").sortWith((a, b) => a.toInt <= b.toInt).reduce(_ + "," + _)

}

}

var hasInfrequentSubItem = false //To filter the item which has infrequent subitem

if (!tmp.equals("")) {

val arrayTmp = tmp.split(separator)

breakable {

for (i <- 0 until arrayTmp.size) {

var subItem = ""

for (j <- 0 until arrayTmp.size) {

if (j != i) subItem += arrayTmp(j) + separator

}

//To remove the separator "," in the end of the item

subItem = subItem.substring(0, subItem.lastIndexOf(separator))

if (!f.contains(subItem)) {

hasInfrequentSubItem = true

break

}

}

} //breakable

}

else hasInfrequentSubItem = true

//If itemset has no sub inftequent itemset, then put it into candiateFI

if (!hasInfrequentSubItem) arrayBuffer += (tmp)

} //for

arrayBuffer.toArray

}

}

先寫到這裏，歡迎你們提出相關的建議或意見。

（by老楊，轉載請註明出處）