Spark的WorkCount的例子

時間 2019-12-18

原文原文鏈接

以前爲了搭建scala開發spark的環境花了幾天的時間，終於搞定了，具體能夠參考：http://www.cnblogs.com/ljy2013/p/4964201.html 。下面就是用一個示例來測試本身的開發環境了，因而就只用了大數據比較經典的例子：WordCount。下面詳細說明一下：html

一、首先安裝以前搭建的環境，建立maven工程來寫scala的代碼。工程目錄以下：java

二、編寫代碼git

package com.yiban.datacenter.Spark_demo


import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem

/**
 * @author ${user.name}
 */
object App {
  
  def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
  
  def main(args : Array[String]) {
    
    //hadoop configuration  沒有這個在local模式下會報錯
    val hadoopconf = new Configuration();
    hadoopconf.setBoolean("fs.hdfs.impl.disable.cache", true);
    val fileSystem = FileSystem.get(hadoopconf);
    
    //spark configuration
    val conf = new SparkConf().setAppName("wordcount").setMaster("yarn-cluster")   //這裏採用yarn集羣的方式運行

    
     val sc = new SparkContext(conf)
    
    
    val wordcount=sc.textFile("/user/liujiyu/input", 1).flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_).saveAsTextFile("/user/liujiyu/sparkwordcountoutput")
    
    
    val data = Array(1, 2, 3, 4, 5)
    val data2=Seq(1,2,3)
    val distData = sc.parallelize(data)
    
    
    distData.saveAsTextFile("/user/liujiyu/spark-demo")

  }

}

三、pom.xml文件內容以下：github

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.yiban.datacenter</groupId>
  <artifactId>Spark-demo</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <description>My wonderfull scala app</description>
  <inceptionYear>2015</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>

  <properties>
    <maven.compiler.source>1.6</maven.compiler.source>
    <maven.compiler.target>1.6</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.10.5</scala.version>
    <scala.compat.version>2.10</scala.compat.version>
  </properties>
  <repositories>
    <repository>
    	<id>cloudera-repo-releases</id>
    	<url>https://repository.cloudera.com/artifactory/repo/</url>
  	</repository>
  </repositories>
  
  <dependencies>
  	<dependency>
  		<groupId>org.apache.spark</groupId>
		<artifactId> spark-core_2.10</artifactId> 
		<version>1.5.2</version>
  	</dependency>
  	<dependency>
  		<groupId>org.apache.hadoop</groupId>
		<artifactId> hadoop-client</artifactId> 
		<version>2.6.0-cdh5.4.4</version>
  	</dependency>
  	
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>

    <!-- Test -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs2</groupId>
      <artifactId>specs2-core_${scala.compat.version}</artifactId>
      <version>2.4.16</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.compat.version}</artifactId>
      <version>2.2.4</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <!-- see http://davidb.github.com/scala-maven-plugin -->
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.0</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-make:transitive</arg>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.18.1</version>
        <configuration>
          <useFile>false</useFile>
          <disableXmlReport>true</disableXmlReport>
          <!-- If you have classpath issue like NoDefClassError,... -->
          <!-- useManifestOnlyJar>false</useManifestOnlyJar -->
          <includes>
            <include>**/*Test.*</include>
            <include>**/*Suite.*</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

四、執行maven clean package 對工程進行打包。apache

五、將對應打包好的文件放到集羣上去運行app

執行以下命令進行運行：maven

spark-submit --class "com.yiban.datacenter.Spark_demo.App" --master yarn-cluster Spark-demo-0.0.1-SNAPSHOT.jaroop

運行結束，會在對應路徑產生結果，查看hdfs對應路徑結果便可。測試