Spark的二次排序

時間 2019-11-08

標籤 spark 二次排序欄目 Spark 简体版

原文原文鏈接

一、數據樣本：apache

1 5
2 4
3 6
1 3
2 1
1 14
2 45
4 11
3 23
5 12
6 13ide

二、排序規則：先按照第一個字符排序，若是第一個相同，再按照第二個字符排序this

三、排序後的結果spa

1 3
1 5
1 14
2 1
2 4
2 45
3 6
3 23
4 11
5 12
6 13scala

四、spark二次排序實現blog

4.一、自定義key排序

package com.test.spark

/**
  * @author admin
  * scala處理二次排序的類
  * 自定義key
  */
class SecondSortByKey(val first: Int, val second: Int) extends Ordered[SecondSortByKey] with Serializable {
  def compare(other: SecondSortByKey): Int = {
    //this關鍵字可加，也可不加，若是遇到多個變量時，必須添加
    if (this.first - other.first != 0)
      this.first - other.first
    else
      this.second - other.second
  }

  //重寫toString方法
  /*override def toString(): String = {
    "first:" + first + " second:" + second
  }*/
}

4.二、二次排序程序編排接口

package com.test.spark

import org.apache.spark.{SparkConf, SparkContext}

/**
  * @author admin
  * Spark二次排序的具體實現步驟：
  * 第一步: 自定義key 實現scala.math.Ordered接口，和Serializeable接口
  * 第二步：將要進行二次排序的數據加載，按照<key，value>格式的RDD
  * 第三步：使用sortByKey 基於自定義的key進行二次排序
  * 第四步：去掉排序的key,只保留排序的結果
  */
object SparkSecondSortApplication {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SortSecond").setMaster("local[1]")
    // 獲取context
    val sc = new SparkContext(conf)
    // 加載到內存RDD
    val lines = sc.textFile("D:\\SparkDataTest\\sort.txt")
    // map操做，將要進行二次排序的數據加載，按照<key，value>格式的RDD
    val pairs = lines.map { line => {
      val spl = line.split(" ")
      (new SecondSortByKey(spl(0).toInt, spl(1).toInt), line)
    }
    }
    // 使用sortByKey 基於自定義的key進行二次排序， true:升序，false:降序
    val sortPair = pairs.sortByKey(true)

    // map操做，只須要保留排序結果
    val sortResult = sortPair.map(line => line._2)

    sortResult.collect().foreach { x => println(x) }

    // 中止sc
    sc.stop()
  }
}

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。