Spark SQL編程之DataSet篇es6
做者:尹正傑sql
版權聲明:原創做品,謝絕轉載!不然將追究法律責任。apache
一.建立DataSet編程
舒適提示: Dataset是具備強類型的數據集合,須要提供對應的類型信息。下面是具體案例。 scala> case class Person(name: String, age: Long) #建立一個樣例類 defined class Person scala> val caseClassDS = Seq(Person("YinZhengjie", 18)).toDS() #建立DataSet caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> caseClassDS.show #不難發現DataSet的方法和DataFrame的方法使用上很類似。 +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| +-----------+---+ scala> caseClassDS.createTempView("person") scala> spark.sql("select * from person").show +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| +-----------+---+ scala>
二.RDD轉換爲DataSetspa
scala> case class Person(name: String, age: Long) #建立一個樣例類 defined class Person scala> val listRDD = sc.makeRDD(List(("YinZhengjie",18),("Jason Yin",20),("Danny",28))) #建立一個RDD listRDD: org.apache.spark.rdd.RDD[(Int, String, Int)] = ParallelCollectionRDD[84] at makeRDD at <console>:27 scala> val mapRDD = listRDD.map( t => { Person( t._1,t._2) }) #使用map算子將listRDD各元素轉換成Person對象 mapRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[102] at map at <console>:30 scala> val ds = mapRDD.toDS #將rdd轉換爲DataSet ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> ds.show +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| | Jason Yin| 20| | Danny| 28| +-----------+---+ scala>
三.DataSet轉換爲RDDscala
scala> ds.show #查看DataSet數據 +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| | Jason Yin| 20| | Danny| 28| +-----------+---+ scala> ds res6: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> ds.rdd #將DataSet轉換成RDD res7: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[26] at rdd at <console>:29 scala> res7.collect #查看RDD的數據 res8: Array[Person] = Array(Person(YinZhengjie,18), Person(Jason Yin,20), Person(Danny,28)) scala>