操做文本
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"
需求
- 讀取文件
- 去除全部標點符號和換行符,並把全部大寫變成小寫
- 合併相同的詞,統計每一個詞出現的頻率,並按照詞頻從大到小排序
- 將結果按行輸出到文件 out.txt
代碼實現
import re
def parse(text):
# 使用正則表達式去除標點符號和換行符
text = re.sub(r'[^\w ]', ' ', text)
# 轉爲小寫
text = text.lower()
# 生成全部單詞的列表
world_list = text.split(' ')
# 去除空白單詞
world_list = filter(None, world_list)
# 生成單詞和詞頻的字典
word_cnt = {}
for word in world_list:
if word not in word_cnt:
word_cnt[word] = 0
word_cnt[word] += 1
# 按照詞頻排序
sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)
return sorted_word_cnt
with open('in.txt', 'r', encoding='utf-8') as fin:
text = fin.read()
word_and_freq = parse(text)
with open('out.txt', 'w') as fout:
for word, freq in word_and_freq:
fout.write(f'{word} {freq}\n')