Apache Spark DataFrames入門指南:操做DataFrame

文章目錄html

 

2、操做DataFrame

  在前面的文章中,咱們介紹瞭如何建立DataFrame。本文將介紹如何操做DataFrame裏面的數據和打印出DataFrame裏面數據的模式sql

打印DataFrame裏面的模式

  在建立完DataFrame以後,咱們通常都會查看裏面數據的模式,咱們能夠經過printSchema函數來查看。它會打印出列的名稱和類型:apache

students.printSchema api

root數組

 |-- id: string (nullable = true)函數

 |-- studentName: string (nullable = true)學習

 |-- phone: string (nullable = true)ui

 |-- email: string (nullable = true)spa

若是採用的是load方式參見DataFrame的,students.printSchema的輸出則以下:.net

root

 |-- id|studentName|phone|email: string (nullable = true)

對DataFrame裏面的數據進行採樣

  打印完模式以後,咱們要作的第二件事就是看看加載進DataFrame裏面的數據是否正確。重新建立的DataFrame裏面採樣數據的方法有不少種。咱們來對其進行介紹。

  最簡單的就是使用show方法,show方法有四個版本:
  (1)、第一個須要咱們指定採樣的行數def show(numRows: Int);
  (2)、第二種不須要咱們指定任何參數,這種狀況下,show函數默認會加載出20行的數據def show();
  (3)、第三種須要指定一個boolean值,這個值說明是否須要對超過20個字符的列進行截取def show(truncate: Boolean);
  (4)、最後一種須要指定採樣的行和是否須要對列進行截斷def show(numRows: Int, truncate: Boolean)。實際上,前三個函數都是調用這個函數實現的。

  Show函數和其餘函數不一樣的地方在於其不只會顯示須要打印的行,並且還會打印出頭信息,而且會直接在默認的輸出流打出(console)。來看看怎麼使用吧:

students.show()  //打印出20行

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

1|      Burke|1-300-746-8446|ullamcorper.velit...|

2|      Kamal|1-668-571-5046|pede.Suspendisse@...|

3|       Olga|1-956-311-1686|Aenean.eget.metus...|

4|      Belle|1-246-894-6340|vitae.aliquet.nec...|

5|     Trevor|1-300-527-4967|dapibus.id@acturp...|

6|     Laurel|1-691-379-9921|adipiscing@consec...|

7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|

8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|

9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

| 10|       Maya|1-271-683-2698|accumsan.convalli...|

| 11|        Emi|1-467-270-1337|        est@nunc.com|

| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|

| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|

| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|

| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|

| 16|      Amena|1-878-250-3129|lorem.luctus.ut@s...|

| 17|    Blossom|1-154-406-9596|Nunc.commodo.auct...|

| 18|        Guy|1-869-521-3230|senectus.et.netus...|

| 19|    Malachi|1-608-637-2772|Proin.mi.Aliquam@...|

| 20|     Edward|1-711-710-6552|lectus@aliquetlib...|

+---+-----------+--------------+--------------------+

only showing top 20 rows

students.show(15)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

1|      Burke|1-300-746-8446|ullamcorper.velit...|

2|      Kamal|1-668-571-5046|pede.Suspendisse@...|

3|       Olga|1-956-311-1686|Aenean.eget.metus...|

4|      Belle|1-246-894-6340|vitae.aliquet.nec...|

5|     Trevor|1-300-527-4967|dapibus.id@acturp...|

6|     Laurel|1-691-379-9921|adipiscing@consec...|

7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|

8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|

9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

| 10|       Maya|1-271-683-2698|accumsan.convalli...|

| 11|        Emi|1-467-270-1337|        est@nunc.com|

| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|

| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|

| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|

| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|

+---+-----------+--------------+--------------------+

only showing top 15 rows

 

students.show(true)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

1|      Burke|1-300-746-8446|ullamcorper.velit...|

2|      Kamal|1-668-571-5046|pede.Suspendisse@...|

3|       Olga|1-956-311-1686|Aenean.eget.metus...|

4|      Belle|1-246-894-6340|vitae.aliquet.nec...|

5|     Trevor|1-300-527-4967|dapibus.id@acturp...|

6|     Laurel|1-691-379-9921|adipiscing@consec...|

7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|

8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|

9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

| 10|       Maya|1-271-683-2698|accumsan.convalli...|

| 11|        Emi|1-467-270-1337|        est@nunc.com|

| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|

| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|

| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|

| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|

| 16|      Amena|1-878-250-3129|lorem.luctus.ut@s...|

| 17|    Blossom|1-154-406-9596|Nunc.commodo.auct...|

| 18|        Guy|1-869-521-3230|senectus.et.netus...|

| 19|    Malachi|1-608-637-2772|Proin.mi.Aliquam@...|

| 20|     Edward|1-711-710-6552|lectus@aliquetlib...|

+---+-----------+--------------+--------------------+

only showing top 20 rows

 

students.show(false)

+---+-----------+--------------+-----------------------------------------+

|id |studentName|phone         |email                                    |

+---+-----------+--------------+-----------------------------------------+

|1  |Burke      |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk|

|2  |Kamal      |1-668-571-5046|pede.Suspendisse@interdumenim.edu        |

|3  |Olga       |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu   |

|4  |Belle      |1-246-894-6340|vitae.aliquet.nec@neque.co.uk            |

|5  |Trevor     |1-300-527-4967|dapibus.id@acturpisegestas.net           |

|6  |Laurel     |1-691-379-9921|adipiscing@consectetueripsum.edu         |

|7  |Sara       |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu        |

|8  |Kaseem     |1-881-586-2689|cursus.et.magna@euismod.org              |

|9  |Lev        |1-916-367-5608|Vivamus.nisi@ipsumdolor.com              |

|10 |Maya       |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu |

|11 |Emi        |1-467-270-1337|est@nunc.com                             |

|12 |Caleb      |1-683-212-0896|Suspendisse@Quisque.edu                  |

|13 |Florence   |1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca   |

|14 |Anika      |1-856-828-7883|euismod@ligulaelit.co.uk                 |

|15 |Tarik      |1-398-171-2268|turpis@felisorci.com                     |

|16 |Amena      |1-878-250-3129|lorem.luctus.ut@scelerisque.com          |

|17 |Blossom    |1-154-406-9596|Nunc.commodo.auctor@eratSed.co.uk        |

|18 |Guy        |1-869-521-3230|senectus.et.netus@lectusrutrum.com       |

|19 |Malachi    |1-608-637-2772|Proin.mi.Aliquam@estarcu.net             |

|20 |Edward     |1-711-710-6552|lectus@aliquetlibero.co.uk               |

+---+-----------+--------------+-----------------------------------------+

only showing top 20 rows

 

students.show(10,false)

 

+---+-----------+--------------+-----------------------------------------+

|id |studentName|phone         |email                                    |

+---+-----------+--------------+-----------------------------------------+

|1  |Burke      |1-300-746-8446|ullamcorper.velit.in@ametnullaDonec.co.uk|

|2  |Kamal      |1-668-571-5046|pede.Suspendisse@interdumenim.edu        |

|3  |Olga       |1-956-311-1686|Aenean.eget.metus@dictumcursusNunc.edu   |

|4  |Belle      |1-246-894-6340|vitae.aliquet.nec@neque.co.uk            |

|5  |Trevor     |1-300-527-4967|dapibus.id@acturpisegestas.net           |

|6  |Laurel     |1-691-379-9921|adipiscing@consectetueripsum.edu         |

|7  |Sara       |1-608-140-1995|Donec.nibh@enimEtiamimperdiet.edu        |

|8  |Kaseem     |1-881-586-2689|cursus.et.magna@euismod.org              |

|9  |Lev        |1-916-367-5608|Vivamus.nisi@ipsumdolor.com              |

|10 |Maya       |1-271-683-2698|accumsan.convallis@ornarelectusjusto.edu |

+---+-----------+--------------+-----------------------------------------+

only showing top 10 rows

  咱們還可使用head(n: Int)方法來採樣數據,這個函數也須要輸入一個參數標明須要採樣的行數,並且這個函數返回的是Row數組,咱們須要遍歷打印。固然,咱們也可使用head()函數直接打印,這個函數只是返回數據的一行,類型也是Row。

students.head(5).foreach(println)

[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]

[2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu]

[3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu]

[4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk]

[5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net]

println(students.head())

[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]

除了show、head函數。咱們還可使用first和take函數,他們分別調用head()和head(n)

println(students.first())

[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]

students.take(5).foreach(println)

[1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk]

[2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu]

[3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu]

[4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk]

[5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net]

查詢DataFrame裏面的列

  正如你所看到的,全部的DataFrame裏面的列都是有名稱的。Select函數能夠幫助咱們從DataFrame中選擇須要的列,而且返回一個全新的DataFrame,下面我將此進行介紹。

  一、只選擇一列。假如咱們只想從DataFrame中選擇email這列,由於DataFrame是不可變的(immutable),因此這個操做會返回一個新的DataFrame:

val emailDataFrame: DataFrame = students.select("email")

如今咱們有一個名叫emailDataFrame全新的DataFrame,並且其中只包含了email這列,讓咱們使用show來看看是不是這樣的:

emailDataFrame.show(3)

+--------------------+

|               email|

+--------------------+

|ullamcorper.velit...|

|pede.Suspendisse@...|

|Aenean.eget.metus...|

+--------------------+

only showing top 3 rows

  二、選擇多列。其實select函數支持選擇多列。

val studentEmailDF = students.select("studentName", "email")

studentEmailDF.show(5)

+-----------+--------------------+

|studentName|               email|

+-----------+--------------------+

|      Burke|ullamcorper.velit...|

|      Kamal|pede.Suspendisse@...|

|       Olga|Aenean.eget.metus...|

|      Belle|vitae.aliquet.nec...|

|     Trevor|dapibus.id@acturp...|

+-----------+--------------------+

only showing top 5 rows

  須要主要的是,咱們select列的時候,須要保證select的列是有效的,換句話說,就是必須保證select的列是printSchema打印出來的。若是列的名稱是無效的,將會出現org.apache.spark.sql.AnalysisException異常,以下:

val studentEmailDF = students.select("studentName", "iteblog")

studentEmailDF.show(5)

 

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'iteblog' given input columns id, studentName, phone, email;

根據條件過濾數據

  如今咱們已經知道如何在DataFrame中選擇須要的列,讓咱們來看看如何根據條件來過濾DataFrame裏面的數據。對應基於Row的數據,咱們能夠將DataFrame看做是普通的Scala集合,而後咱們根據須要的條件進行相關的過濾,爲了展現清楚,我在語句沒後面都用show函數展現過濾的結果。

students.filter("id > 5").show(7)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

6|     Laurel|1-691-379-9921|adipiscing@consec...|

7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|

8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|

9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

| 10|       Maya|1-271-683-2698|accumsan.convalli...|

| 11|        Emi|1-467-270-1337|        est@nunc.com|

| 12|      Caleb|1-683-212-0896|Suspendisse@Quisq...|

| 13|   Florence|1-603-575-2444|sit.amet.dapibus@...|

| 14|      Anika|1-856-828-7883|euismod@ligulaeli...|

| 15|      Tarik|1-398-171-2268|turpis@felisorci.com|

+---+-----------+--------------+--------------------+

only showing top 10 rows

 

students.filter("studentName =''").show(7)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

| 21|           |1-598-439-7549|consectetuer.adip...|

| 32|           |1-184-895-9602|accumsan.laoreet@...|

| 45|           |1-245-752-0481|Suspendisse.eleif...|

| 83|           |1-858-810-2204|sociis.natoque@eu...|

| 94|           |1-443-410-7878|Praesent.eu.nulla...|

+---+-----------+--------------+--------------------+

  注意看第一個過濾語句,雖然id被解析成String了,可是程序依然正確地作出了比較。咱們也能夠對多個條件進行過濾:

students.filter("studentName ='' OR studentName = 'NULL'").show(7)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

| 21|           |1-598-439-7549|consectetuer.adip...|

| 32|           |1-184-895-9602|accumsan.laoreet@...|

| 33|       NULL|1-105-503-0141|Donec@Inmipede.co.uk|

| 45|           |1-245-752-0481|Suspendisse.eleif...|

| 83|           |1-858-810-2204|sociis.natoque@eu...|

| 94|           |1-443-410-7878|Praesent.eu.nulla...|

+---+-----------+--------------+--------------------+

咱們還能夠採用類SQL的語法對數據進行過濾:

students.filter("SUBSTR(studentName,0,1) ='M'").show(7)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

| 10|       Maya|1-271-683-2698|accumsan.convalli...|

| 19|    Malachi|1-608-637-2772|Proin.mi.Aliquam@...|

| 24|    Marsden|1-477-629-7528|Donec.dignissim.m...|

| 37|      Maggy|1-910-887-6777|facilisi.Sed.nequ...|

| 61|     Maxine|1-422-863-3041|aliquet.molestie....|

| 77|      Maggy|1-613-147-4380| pellentesque@mi.net|

| 97|    Maxwell|1-607-205-1273|metus.In@musAenea...|

+---+-----------+--------------+--------------------+

only showing top 7 rows

對DataFrame裏面的數據進行排序

使用sort函數咱們能夠對DataFrame中指定的列進行排序:

students.sort(students("studentName").desc).show(7)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

| 50|      Yasir|1-282-511-4445|eget.odio.Aliquam...|

| 52|       Xena|1-527-990-8606|in.faucibus.orci@...|

| 86|     Xandra|1-677-708-5691|libero@arcuVestib...|

| 43|     Wynter|1-440-544-1851|amet.risus.Donec@...|

| 31|    Wallace|1-144-220-8159| lorem.lorem@non.net|

| 66|      Vance|1-268-680-0857|pellentesque@netu...|

| 41|     Tyrone|1-907-383-5293|non.bibendum.sed@...|

+---+-----------+--------------+--------------------+

only showing top 7 rows

也能夠對多列進行排序:

students.sort("studentName", "id").show(10)

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

| 21|           |1-598-439-7549|consectetuer.adip...|

| 32|           |1-184-895-9602|accumsan.laoreet@...|

| 45|           |1-245-752-0481|Suspendisse.eleif...|

| 83|           |1-858-810-2204|sociis.natoque@eu...|

| 94|           |1-443-410-7878|Praesent.eu.nulla...|

| 91|       Abel|1-530-527-7467|    urna@veliteu.edu|

| 69|       Aiko|1-682-230-7013|turpis.vitae.puru...|

| 47|       Alma|1-747-382-6775|    nec.enim@non.org|

| 26|      Amela|1-526-909-2605| in@vitaesodales.edu|

| 16|      Amena|1-878-250-3129|lorem.luctus.ut@s...|

+---+-----------+--------------+--------------------+

only showing top 10 rows

從上面的結果咱們能夠看出,默認是按照升序進行排序的。咱們也能夠將上面的語句寫成下面的:

students.sort(students("studentName").asc, students("id").asc).show(10)

這兩個語句運行的效果是一致的。

對列進行重命名

  若是咱們對DataFrame中默認的列名不感興趣,咱們能夠在select的時候利用as對其進行重命名,下面的列子將studentName重命名爲name,而email這列名字不變:

students.select(students("studentName").as("name"), students("email")).show(10)

+--------+--------------------+

|    name|               email|

+--------+--------------------+

|   Burke|ullamcorper.velit...|

|   Kamal|pede.Suspendisse@...|

|    Olga|Aenean.eget.metus...|

|   Belle|vitae.aliquet.nec...|

|  Trevor|dapibus.id@acturp...|

|  Laurel|adipiscing@consec...|

|    Sara|Donec.nibh@enimEt...|

|  Kaseem|cursus.et.magna@e...|

|     Lev|Vivamus.nisi@ipsu...|

|    Maya|accumsan.convalli...|

+--------+--------------------+

only showing top 10 rows

將DataFrame看做是關係型數據表

  DataFrame的一個強大之處就是咱們能夠將它看做是一個關係型數據表,而後在其上運行SQL查詢語句,只要咱們進行下面兩步便可實現:
  (1)、將DataFrame註冊成一張名爲students的表:

students.registerTempTable("students")

  (2)、而後咱們在其上用標準的SQL進行查詢:

sqlContext.sql("select * from students where studentName!='' order by email desc").show(7)

 

+---+-----------+--------------+--------------------+

| id|studentName|         phone|               email|

+---+-----------+--------------+--------------------+

| 87|      Selma|1-601-330-4409|vulputate.velit@p...|

| 96|   Channing|1-984-118-7533|viverra.Donec.tem...|

4|      Belle|1-246-894-6340|vitae.aliquet.nec...|

| 78|       Finn|1-213-781-6969|vestibulum.massa@...|

| 53|     Kasper|1-155-575-9346|velit.eget@pedeCu...|

| 63|      Dylan|1-417-943-8961|vehicula.aliquet@...|

| 35|     Cadman|1-443-642-5919|ut.lacus@adipisci...|

+---+-----------+--------------+--------------------+

only showing top 7 rows

對兩個DataFrame進行Join操做

  前面咱們已經知道如何將DataFrame註冊成一張表,如今咱們來看看如何使用普通的SQL對兩個DataFrame進行Join操做。

  一、內聯:內聯是默認的Join操做,它僅僅返回兩個DataFrame都匹配到的結果,來看看下面的例子:

val students1 = sqlContext.csvFile(filePath = "E:\\StudentPrep1.csv", useHeader = true, delimiter = '|')

val students2 = sqlContext.csvFile(filePath = "E:\\StudentPrep2.csv", useHeader = true, delimiter = '|')

val studentsJoin = students1.join(students2, students1("id") === students2("id"))

studentsJoin.show(studentsJoin.count.toInt)

 

+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+

| id|studentName|         phone|               email| id|       studentName|         phone|               email|

+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+

1|      Burke|1-300-746-8446|ullamcorper.velit...|  1|BurkeDifferentName|1-300-746-8446|ullamcorper.velit...|

2|      Kamal|1-668-571-5046|pede.Suspendisse@...|  2|KamalDifferentName|1-668-571-5046|pede.Suspendisse@...|

3|       Olga|1-956-311-1686|Aenean.eget.metus...|  3|              Olga|1-956-311-1686|Aenean.eget.metus...|

4|      Belle|1-246-894-6340|vitae.aliquet.nec...|  4|BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...|

5|     Trevor|1-300-527-4967|dapibus.id@acturp...|  5|            Trevor|1-300-527-4967|dapibusDifferentE...|

6|     Laurel|1-691-379-9921|adipiscing@consec...|  6|LaurelInvalidPhone|     000000000|adipiscing@consec...|

7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|  7|              Sara|1-608-140-1995|Donec.nibh@enimEt...|

8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|  8|            Kaseem|1-881-586-2689|cursus.et.magna@e...|

9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|  9|               Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

| 10|       Maya|1-271-683-2698|accumsan.convalli...| 10|              Maya|1-271-683-2698|accumsan.convalli...|

+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+

  二、右外聯:在內鏈接的基礎上,還包含右表中全部不符合條件的數據行,並在其中的左表列填寫NULL ,來看看下面的實例:

val studentsRightOuterJoin = students1.join(students2, students1("id") === students2("id"), "right_outer")

studentsRightOuterJoin.show(studentsRightOuterJoin.count.toInt)

+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+

|  id|studentName|         phone|               email| id|         studentName|         phone|               email|

+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+

|   1|      Burke|1-300-746-8446|ullamcorper.velit...|  1|  BurkeDifferentName|1-300-746-8446|ullamcorper.velit...|

|   2|      Kamal|1-668-571-5046|pede.Suspendisse@...|  2|  KamalDifferentName|1-668-571-5046|pede.Suspendisse@...|

|   3|       Olga|1-956-311-1686|Aenean.eget.metus...|  3|                Olga|1-956-311-1686|Aenean.eget.metus...|

|   4|      Belle|1-246-894-6340|vitae.aliquet.nec...|  4|  BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...|

|   5|     Trevor|1-300-527-4967|dapibus.id@acturp...|  5|              Trevor|1-300-527-4967|dapibusDifferentE...|

|   6|     Laurel|1-691-379-9921|adipiscing@consec...|  6|  LaurelInvalidPhone|     000000000|adipiscing@consec...|

|   7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|  7|                Sara|1-608-140-1995|Donec.nibh@enimEt...|

|   8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|  8|              Kaseem|1-881-586-2689|cursus.et.magna@e...|

|   9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|  9|                 Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

10|       Maya|1-271-683-2698|accumsan.convalli...| 10|                Maya|1-271-683-2698|accumsan.convalli...|

|null|       null|          null|                null|999|LevUniqueToSecondRDD|1-916-367-5608|Vivamus.nisi@ipsu...|

+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+

  三、左外聯:在內鏈接的基礎上,還包含左表中全部不符合條件的數據行,並在其中的右表列填寫NULL ,一樣咱們來看看下面的實例:

val studentsLeftOuterJoin = students1.join(students2, students1("id") === students2("id"), "left_outer")

studentsLeftOuterJoin.show(studentsLeftOuterJoin.count.toInt)

+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+

| id|studentName|         phone|               email|  id|       studentName|         phone|               email|

+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+

1|      Burke|1-300-746-8446|ullamcorper.velit...|   1|BurkeDifferentName|1-300-746-8446|ullamcorper.velit...|

2|      Kamal|1-668-571-5046|pede.Suspendisse@...|   2|KamalDifferentName|1-668-571-5046|pede.Suspendisse@...|

3|       Olga|1-956-311-1686|Aenean.eget.metus...|   3|              Olga|1-956-311-1686|Aenean.eget.metus...|

4|      Belle|1-246-894-6340|vitae.aliquet.nec...|   4|BelleDifferentName|1-246-894-6340|vitae.aliquet.nec...|

5|     Trevor|1-300-527-4967|dapibus.id@acturp...|   5|            Trevor|1-300-527-4967|dapibusDifferentE...|

6|     Laurel|1-691-379-9921|adipiscing@consec...|   6|LaurelInvalidPhone|     000000000|adipiscing@consec...|

7|       Sara|1-608-140-1995|Donec.nibh@enimEt...|   7|              Sara|1-608-140-1995|Donec.nibh@enimEt...|

8|     Kaseem|1-881-586-2689|cursus.et.magna@e...|   8|            Kaseem|1-881-586-2689|cursus.et.magna@e...|

9|        Lev|1-916-367-5608|Vivamus.nisi@ipsu...|   9|               Lev|1-916-367-5608|Vivamus.nisi@ipsu...|

| 10|       Maya|1-271-683-2698|accumsan.convalli...|  10|              Maya|1-271-683-2698|accumsan.convalli...|

| 11|    iteblog|        999999| iteblog@iteblog.com|null|              null|          null|                null|

+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+

將DataFrame保存成文件

  下面我來介紹如何將DataFrame保存到一個文件裏面。前面咱們加載csv文件用到了load函數,與之對於的用於保存文件可使用save函數。具體操做包括如下兩步:

  一、首先建立一個map對象,用於存儲一些save函數須要用到的一些屬性。這裏我將制定保存文件的存放路徑和csv的頭信息。

val saveOptions = Map("header" -> "true", "path" -> "iteblog.csv")

  爲了基於學習的態度,咱們從DataFrame裏面選擇出studentName和email兩列,而且將studentName的列名重定義爲name。

val copyOfStudents = students.select(students("studentName").as("name"), students("email"))

  二、下面咱們調用save函數保存上面的DataFrame數據到iteblog.csv文件夾中

copyOfStudents.write.format("com.databricks.spark.csv").mode(SaveMode.Overwrite).options(saveOptions).save()

  mode函數能夠接收的參數有Overwrite、Append、Ignore和ErrorIfExists。從名字就能夠很好的理解,Overwrite表明覆蓋目錄下以前存在的數據;Append表明給指定目錄下追加數據;Ignore表明若是目錄下已經有文件,那就什麼都不執行;ErrorIfExists表明若是保存目錄下存在文件,那麼拋出相應的異常。

  須要注意的是,上述path參數指定的是保存文件夾,並非最後的保存文件名。

相關文章
相關標籤/搜索