環境:CDH 5.13.0 (詳見5. Word Count
)html
例若有以下兩個文檔:java
基於這兩個文本文檔,構造一個詞典:git
Dictionary = {1:」Bob」, 2. 「like」, 3. 「to」, 4. 「play」, 5. 「basketball」, 6. 「also」, 7. 「football」, 8. 「games」, 9. 「Jim」, 10. 「too」}。
這個詞典一共包含10個不一樣的單詞,利用詞典的索引號,上面兩個文檔每個均可以用一個10維向量表示(用整數數字0~n(n爲正整數)表示某個單詞在文檔中出現的次數):github
向量中每一個元素表示詞典中相關元素在文檔中出現的次數(下文中,將用單詞的直方圖表示)。
不過,在構造文檔向量的過程當中能夠看到,咱們並無表達單詞在原來句子中出現的次序。apache
基於BOW模型和情感詞詞典,咱們能夠將每一個詞分類爲積極、消極、中立三種之一。根據全篇文章的詞統計,判斷該文檔的情感偏向bash
import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.HashMap; import java.util.Map; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SentimentAnalysis { public static class Map extends Mapper<Object, Text, Text, Text> { private HashMap<String, String> emotinDict = new HashMap<String, String>(); @Override public void setup(Context context) throws IOException { Configuration configuration = context.getConfiguration(); String dict = configuration.get("dict", ""); BufferedReader reader = new BufferedReader((new FileReader(dict))); String line = reader.readLine(); while (line != null) { String[] word_emotion = line.split("\t"); emotinDict.put(word_emotion[0].trim().toLowerCase(), word_emotion[1].trim()); line = reader.readLine(); } reader.close(); } @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String fileName = ((FileSplit) context.getInputSplit()).getPath() .getName(); String line = value.toString().trim(); StringTokenizer tokenizer = new StringTokenizer(line); Text filename = new Text(fileName); while (tokenizer.hasMoreTokens()) { String word = tokenizer.nextToken().trim().toLowerCase(); if (emotinDict.containsKey(word)) { context.write(filename, new Text(emotinDict.get(word))); } } } } public static class Reduce extends Reducer<Text, Text, Text, Text> { @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { HashMap<String, Integer> map = new HashMap<String, Integer>(); for (Text value : values) { String emotion = value.toString(); int count = map.containsKey(emotion) ? map.get(emotion) : 0; map.put(emotion, count + 1); } StringBuilder builder = new StringBuilder(); Iterator<String> iterator = map.keySet().iterator(); while (iterator.hasNext()) { String emotion = iterator.next(); int count = map.get(emotion); builder.append(emotion).append("\t").append(count); context.write(key, new Text(builder.toString())); builder.setLength(0); } } } // args[0] : input path // args[1] : output path // args[2] : path of emotion dictionary. public static void main(String[] args) throws Exception { Configuration configuration = new Configuration(); // args[2] : path of emotion dictionary. configuration.set("dict", args[2]); Job job = Job.getInstance(configuration); job.setJarByClass(SentimentAnalysis.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); // args[0] : input path // args[1] : output path FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
下載app
四份樣本文檔:ide
將樣本文檔放入input/
文件夾oop
將SentimentAnalysis.java
,input
, emotionDict.txt
拷貝到同一文件夾下。在命令行下,將文件打包並上傳HDFS執行post
hdfs dfs -put input /. #將input上傳的HDFS export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar hadoop com.sun.tools.javac.Main *.java #編譯java文件 jar cf SentimentAnalysis.jar *.class #將class打包成jar chmod 777 emotionDict.txt #開放所有權限 # hadoop 會在雲端執行jar程序 # jar 參數指明程序的類型 # SentimentAnalysis.jar 是程序的名稱 # SentimentAnalysis 是入口class的名稱 # /input 是HDFS的路徑input,至關於args[0] # /output 是HDFS的路徑output,至關於args[1] # ${PWD}/emotionDict.txt 使用的是本地的emotionDict文件,至關於args[0] hadoop jar SentimentAnalysis.jar SentimentAnalysis /input /output ${PWD}/emotionDict.txt #用hadoop執行jar hdfs dfs -cat /output/* #查看輸出,看到_SUCCESS
情感分計算公式爲:
sentiment = (positive - negative) / (postitive + negative)
代碼中略做修改。原始代碼出自cloudera
官方示例 Example: Sentiment Analysis Using MapReduce Custom Counters