符號分詞和詞頻統計

時間 2019-12-08

原文原文鏈接

如今有一段文本ide

As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

我就是想看看裏面的詞的高頻和低頻spa

我須要作兩件事情code

1. 先分詞，分詞咱們就按照標點和空格來分orm

2. 接着統計詞頻blog

import re
from collections import Counter


def count_words(text):
    """Count """
    counts = dict()
    # convert to lower case
    text_lower = text.lower()
    tokens = re.split('\W+', text_lower)
    counts = Counter(tokens)
    return counts


def test_run():
    with open("text.txt", "r") as f:
        text = f.read()
        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

        print("10 most common words:\nWord\nCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))

        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))


if __name__ == '__main__':
    test_run()