pyspark進行詞頻統計並返回topN

Part I:詞頻統計並返回topN算法

統計的文本數據:app

what do you do
how do you do
how do you do
how are you
from operator import add

from pyspark import SparkContext


def sort_t():
    sc = SparkContext(appName="testWC")
    data = sc.parallelize(["what do you do", "how do you do", "how do you do", "how are you"])
    result = data.flatMap(lambda x: x.split(" ")) \
        .map(lambda x: (x, 1)). \
        reduceByKey(add). \
        sortBy(lambda x: x[1], False).take(3)
    for k, v in result:
        print k, v


if __name__ == '__main__':
    sort_t()

 

 

 

Part II:調用排序算法並返回topNspa

樣本數據 numbers_data.txt:code

15561
112
-40
51467112
234
8561
112
-34
53467111 121
2345 789 34
14561 -21
12112 101 100
-4 23
51467111
2434
15567
132
-14
51467111
237

  

from pyspark import SparkContext


def solve():
    sc = SparkContext(appName="Sort_test_example")
    lines = sc.textFile("../input/numbers_data.txt")
    results = lines.flatMap(lambda x: x.split(" ")) \
        .map(lambda x: (int(x), 1)) \
        .sortByKey(ascending=False).take(3)
    output = results
    for (key, value) in output:
        print key
    print key


if __name__ == '__main__':

solve()

注:若出現並列時,返回多個並列的數 blog

相關文章
相關標籤/搜索