Spark Kmeans的平方歐氏距離和偏差平方和及源碼分析

時間 2020-07-10

標籤 spark kmeans 平方歐氏距離偏差源碼分析欄目 Spark 简体版

原文原文鏈接

1.歐氏距離
d(x,y) = √( (x[1]-y[1])^2 + (x[1]-y[2])^2 + … + (x[n]-y[n])^2 )
2.squared Euclidean distance平方歐式距離
Spark KMeans的距離公式是使用了平方歐式距離，平方歐氏距離就是歐式距離的平方（去掉了開根號）
d(x,y) = (x[1]-y[1])^2 + (x[1]-y[2])^2 + … + (x[n]-y[n])^2
3.偏差平方和（Sum of Squared Error(SSE)）
Spark KMeans使用的偏差評價指標是偏差平方和
公式：∑（acfual - predicted）²
注：也就是各點到簇中心的平方歐式距離
4.Spark相關代碼
位於spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scalaapache

/**
   * Return the K-means cost (sum of squared distances of points to their nearest center) for this
   * model on the given data.
   */
  @Since("0.8.0")
  def computeCost(data: RDD[Vector]): Double = {
    val bcCentersWithNorm = data.context.broadcast(clusterCentersWithNorm)//廣播簇中心
    val cost = data.map(p =>
      distanceMeasureInstance.pointCost(bcCentersWithNorm.value, new VectorWithNorm(p)))
      .sum()//點到最近簇中心的距離求和
    bcCentersWithNorm.destroy()
    cost
  }

其中distanceMeasureInstance位於spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scalaide

/**
   * @return 離給定點最近的中心的指數，以及成本cost。
   */
  def findClosest(
      centers: Array[VectorWithNorm],
      point: VectorWithNorm): (Int, Double) = {
    var bestDistance = Double.PositiveInfinity
    var bestIndex = 0
    var i = 0
    while (i < centers.length) {
      val center = centers(i)
      val currentDistance = distance(center, point)//使用了平方歐式距離
      if (currentDistance < bestDistance) {
        bestDistance = currentDistance
        bestIndex = i
      }
      i += 1
    }
    (bestIndex, bestDistance)
  }

  /**
   * @return 給定點相對於給定簇中心的k-means成本cost。
   */
  def pointCost(
      centers: Array[VectorWithNorm],
      point: VectorWithNorm): Double = {
    findClosest(centers, point)._2
  }

總結：其實spark的偏差平方和代碼使用到了尋找簇中心的平方歐氏距離公式，因此說偏差平方和也就是各點到簇中心的平方歐式距離this