英文詞幹提取器,import nltk,porter = nltk.PorterStemmer(),porter.stem('lying') 。git
詞性標註器,pos_tag處理詞序列,根據句子動態判斷,import nltk,text = nltk.word_tokenize("And now for something completely different」),nltk.pos_tag(text) 。CC 鏈接詞,RB 副詞,IN 介詞,NN 名次,JJ 形容詞。github
標註自定義詞性標註語料庫,tagged_token = nltk.tag.str2tuple('fly/NN') 。字符串轉成二元組。布朗語料庫標註 nltk.corpus.brown.tagged_words() 。正則表達式
nltk中文語料庫,nltk.download()。下載 Corpora sinica_treebank,臺灣中國研究院。算法
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk for word in nltk.corpus.sinica_treebank.tagged_words(): print(word[0], word[1])
jieba切詞,https://github.com/fxsjy/jieba,自定義語料中文切詞,自動詞性標註。微信
詞性自動標註。默認標註器 DefaultTagger,標註爲頻率最高詞性。dom
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk default_tagger = nltk.DefaultTagger('NN') raw = '我 好 想 你' tokens = nltk.word_tokenize(raw) tags = default_tagger.tag(tokens) print(tags)
正則表達式標註器,RegexpTagge,知足特定正則表達式詞性。機器學習
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk pattern = [(r'.*們$','PRO')] tagger = nltk.RegexpTagger(pattern) print(tagger.tag(nltk.word_tokenize('咱們 一塊兒 去 大家 和 他們 去過 的 地方')))
查詢標註器,多個最頻繁詞和詞性,查找語料庫,匹配標註,剩餘詞用默認標註器(回退)。學習
一元標註,已標註語料庫訓練,模型標註新語料。code
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk tagged_sents = [[(u'我', u'PRO'), (u'小兔', u'NN')]] unigram_tagger = nltk.UnigramTagger(tagged_sents) sents = [[u'我', u'你', u'小兔']] # brown_tagged_sents = nltk.corpus.brown.tagged_sents(categories='news') # unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) # sents = nltk.corpus.brown.sents(categories='news') tags = unigram_tagger.tag(sents[0]) print(tags)
二元標註、多元標註,一元標註 UnigramTagger 只考慮當前詞,不考慮上下文。二元標註器 BigramTagger 考慮前面詞。三元標註 TrigramTagger。orm
組合標註器,提升精度和覆蓋率,多種標註器組合。
標註器存儲,訓練好持久化,存儲硬盤。加載。
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk train_sents = [[(u'我', u'PRO'), (u'小兔', u'NN')]] t0 = nltk.DefaultTagger('NN') t1 = nltk.UnigramTagger(train_sents, backoff=t0) t2 = nltk.BigramTagger(train_sents, backoff=t1) sents = [[u'我', u'你', u'小兔']] tags = t2.tag(sents[0]) print(tags) from pickle import dump print(t2) output = open('t2.pkl', 'wb') dump(t2, output, -1) output.close() from pickle import load input = open('t2.pkl', 'rb') tagger = load(input) input.close() print(tagger)
機器學習,訓練模型,已知數據統計學習;使用模型,統計學習模型計算未知數據。有監督,訓練樣本數據有肯定判斷,判定新數據。無監督,訓練樣本數據沒有判斷,自發生成結論。最難是選算法。
貝葉斯,機率論,隨機事件條件機率。公式:P(B|A)=P(A|B)P(B)/P(A)。已知P(A|B)、P(A)、P(B),計算P(B|A)。貝葉斯分類器:
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk my_train_set = [ ({'feature1':u'a'},'1'), ({'feature1':u'a'},'2'), ({'feature1':u'a'},'3'), ({'feature1':u'a'},'3'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ({'feature1':u'b'},'2'), ] classifier = nltk.NaiveBayesClassifier.train(my_train_set) print(classifier.classify({'feature1':u'a'})) print(classifier.classify({'feature1':u'b'}))
分類,最重要知道哪些特徵最能反映分類特色,特徵選取。文檔分類,最能表明分類詞。特徵提取,找到最優信息量特徵:
# coding:utf-8 import sys import importlib importlib.reload(sys) import nltk from nltk.corpus import movie_reviews import random documents =[(list(movie_reviews.words(fileid)),category)for category in movie_reviews.categories()for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = [word for (word, freq) in all_words.most_common(2000)] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features featuresets = [(document_features(d), c) for (d,c) in documents] # classifier = nltk.NaiveBayesClassifier.train(featuresets) # classifier.classify(document_features(d)) train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) print(nltk.classify.accuracy(classifier, test_set)) classifier.show_most_informative_features(5)
詞性標註,上下文語境文本分類。句子分割,標點符號分類,選取單獨句子標識符合並鏈表、數據特徵。識別對話行爲,問候、問題、回答、斷言、說明。識別文字蘊含,句子可否得出另外一句子結論,真假標籤。
參考資料: http://www.shareditor.com/blogshow?blogId=67 http://www.shareditor.com/blogshow?blogId=69 https://www.jianshu.com/p/6e5ace051c1e 《Python 天然語言處理》
歡迎推薦上海機器學習工做機會,個人微信:qingxingfengzi