Part I:詞頻統計並返回topN算法
統計的文本數據:app
what do you do how do you do how do you do how are you
from operator import add from pyspark import SparkContext def sort_t(): sc = SparkContext(appName="testWC") data = sc.parallelize(["what do you do", "how do you do", "how do you do", "how are you"]) result = data.flatMap(lambda x: x.split(" ")) \ .map(lambda x: (x, 1)). \ reduceByKey(add). \ sortBy(lambda x: x[1], False).take(3) for k, v in result: print k, v if __name__ == '__main__': sort_t()
Part II:調用排序算法並返回topNspa
樣本數據 numbers_data.txt:code
15561 112 -40 51467112 234 8561 112 -34 53467111 121 2345 789 34 14561 -21 12112 101 100 -4 23 51467111 2434 15567 132 -14 51467111 237
from pyspark import SparkContext def solve(): sc = SparkContext(appName="Sort_test_example") lines = sc.textFile("../input/numbers_data.txt") results = lines.flatMap(lambda x: x.split(" ")) \ .map(lambda x: (int(x), 1)) \ .sortByKey(ascending=False).take(3) output = results for (key, value) in output: print key print key if __name__ == '__main__':
solve()
注:若出現並列時,返回多個並列的數 blog