上一期文章是如何從文本中提取特徵信息?,文本分析第一步要解決的是如何將文本非結構化信息轉化爲結構化信息,其中最關鍵的是特徵抽取,咱們使用scikit-learn庫fit和tranform方法實現了文本數據的特徵抽取。python
可是對於fit和transform,你們可能仍是有點迷糊。最近又將《Applied Text Analysis WIth Python》讀了一遍(別驚訝,82頁過一遍很快的。以前一直覺得這本書82頁,今天才發現這本書完整版是400多頁。)我主要結合這本書代碼和本身的理解,實現了fit和tranform算法,方便你們更好的理解文本分析特徵抽取。算法
1、scikit庫 代碼實例app
fit方法做用:給文本數據創建詞典的過程ide
1.1 咱們先看看fit代碼實例函數
corpus = ["Hey hey hey lets go get lunch today :)", "Did you go home?", "Hey!!! I need a favor"] from sklearn.feature_extraction.text import CountVectorizer vectorize = CountVectorizer() #fit學會語料中的全部詞語,構建詞典 vectorize.fit(corpus) #這裏咱們查看下「詞典」,也就是特徵集(11個特徵詞) print(vectorize.get_feature_names()) ['did', 'favor', 'get', 'go', 'hey', 'home', 'lets', 'lunch', 'need', 'today', 'you']
1.2 transform實例
根據創建好的詞典vectorize對corpus進行編碼。這裏爲了便於觀看理解,咱們使用pandas處理下數據輸出。工具
import pandas as pd dtm = vectorize.transform(corpus) colums_name = vectorize.get_feature_names() series = dtm.toarray() print(pd.DataFrame(series, columns = colums_name ))
從上面的dataframe表中,行表明一個文檔,列表明特徵詞。好比第1行,hey列的所對應的單元格值爲3,說明corpus中第一個document(Hey hey hey lets go get lunch today :) 出現了三次hey。測試
2、fit 與 transform算法實現
思路:編碼
首先要對輸入的文本數據可以分詞(這裏咱們假設是英文吧)rest
對英文字符可以識別是否爲符號,防止出現如「good_enough」這種中間含有非英文字符。code
剔除中止詞,如「a」、「 the」等
詞幹化
通過步驟1-4清洗,輸出乾淨的詞語列表數據。
基於詞語列表,這裏須要有一個容器存儲每個新出現的單詞,構建出特徵詞詞典。
2.1 分詞
這裏咱們直接使用nltk.tokenize庫中的word_tokenize分詞函數。
from nltk.tokenize import word_tokenize word_tokenize("Today is a beatiful day!") ['Today', 'is', 'a', 'beatiful', 'day', '!']
咱們看到上面結果有「!」,因此接下來咱們要判斷分詞結果是否爲單詞。
2.2 標點符號判斷
《Applied text analysis with python》一書中判別分詞結果是否爲符號代碼爲
def is_punct(token): return all(unicodedata.category(char).startswith('P') for char in token)
測試了下發現,category(符號),結果爲「Po」。
import unicodedata #這裏以「!」作個測試 unicodedata.category('!') Po
而all(data)函數是Python內置函數,當data內各個元素一致時返回True,不然返回False。
print(all([True, False])) print(all([True, True])) False True
2.3 中止詞
nltk提供了豐富的文本分析工具,中止詞表所有爲小寫單詞,因此判斷前要先將token小寫化。
def is_stopword(token): stopwords = nltk.corpus.stopwords.words('english') return token.lower() in stopwords
2.4 詞幹化
對單複數、不一樣時態、不一樣語態等異形詞歸併爲一個統一詞。這裏有stem和lemmatize兩種實現方法,下面咱們分別看看算法。
2.4.1 stem
import nltkdef stem(token): stem = nltk.stem.SnowballStemmer('english') return stem.stem(token)
2.4.2 lemmatize
from nltk.corpus import wordnet as wn from nltk.stem.wordnet import WordNetLemmatizer def lemmatize(token, pos_tag): lemmatizer = WordNetLemmatizer() tag = { 'N': wn.NOUN, 'V': wn.VERB, 'R': wn.ADV, 'J': wn.ADJ}.get(pos_tag[0]) if tag: return lemmatizer.lemmatize(token.lower(), tag) else:return None print(stem('better')) print(lemmatize('better', 'JJ')) better good
從中咱們能夠看出lemmatize更準確,對於小數據量的分析,爲了力求精準我我的建議用lemmatize。
2.5 清洗數據
def clean(document): return [lemmatize(token, tag) for (token, tag) in nltk.pos_tag(word_tokenize(document)) if not is_punct(token) and not is_stopword(token)] print(clean('He was a soldier 20 years ago!')) ['soldier', None, 'year', 'ago']
結果中出現None,這是不能容許的。緣由應該是lemmatize函數。因此咱們要加一個判斷
def clean(document): return [lemmatize(token, tag) for (token, tag) in nltk.pos_tag(word_tokenize(document))if not is_punct(token) and not is_stopword(token) and lemmatize(token, tag)] print(clean('He was a soldier 20 years ago!')) ['soldier', 'year', 'ago']
2.6 構建詞典-fit
咱們須要將待分析的文本數據中抽取出全部的特徵詞,並將其存入一個詞典列表中。思路:凡是新出現,不存在於詞典列表vocab中,就將其加入到vocab中。
def fit(X, y=None): vocab = [] for doc in X: for token in clean(doc): if token not in vocab: vocab.append(token) return vocab X = ["The elephant sneezed at the sight of potatoes.Its very interesting thing.\nBut at the sight of potatoes", "Bats can see via echolocation. See the bat sight sneeze!\nBut it is a bats", "Wondering, she opened the door to the studio.\nHaha!good"]
print(fit(X)) ['elephant', 'sneeze', 'sight', 'potatoes.its', 'interesting', 'thing', 'potato', 'bat', 'see', 'echolocation', 'wondering', 'open', 'door', 'studio', 'haha', 'good']
詞典已經構建好了。
2.7 對待分析文本數據編碼-transform
根據構建好的詞典列表,咱們開始對文本數據進行轉碼。思路不難,只要對文檔分詞結果與詞典列表一一分析,該特徵詞出現幾回就爲幾。
def transform(documents): vacab = fit(documents) for doc in documents: result = [] tokens = clean(doc) for va in vacab: result.append(tokens.count(va)) yield result documents = ["The elephant sneezed at the sight of potatoes.Its very interesting thing.\nBut at the sight of potatoes", "Bats can see via echolocation. See the bat sight sneeze!\nBut it is a bats", "Wondering, she opened the door to the studio.\nHaha!good"] print(list(transform(documents))) [[1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 0, 0, 0, 0, 3, 2, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]
3、完整版
如今咱們將上面的代碼合併爲TextExtractFeature類
import nltk import unicodedata from collections import defaultdict from nltk.corpus import wordnet as wn from nltk.stem.wordnet import WordNetLemmatizer from nltk.tokenize import word_tokenize class TextExtractFeature(object): def __init__(self, language='english'): self.stopwords = set(nltk.corpus.stopwords.words(language)) self.lemmatizer = WordNetLemmatizer() def is_punct(self, token): return all(unicodedata.category(char).startswith('P') for char in token) def is_stopword(self, token): return token.lower() in self.stopwords def lemmatize(self, token, pos_tag): tag = { 'N': wn.NOUN, 'V': wn.VERB, 'R': wn.ADV, 'J': wn.ADJ}.get(pos_tag[0]) if tag: return self.lemmatizer.lemmatize(token.lower(), tag) else:return None def clean(self, document): return [self.lemmatize(token, tag).lower() for (token, tag) in nltk.pos_tag(word_tokenize(document)) if not self.is_punct(token) and not self.is_stopword(token) and self.lemmatize(token, tag)] def fit(self, X, y=None): self.y = y self.vocab = [] self.feature_names = defaultdict(int) for doc in X: for token in self.clean(doc): if token not in self.vocab: self.feature_names[token] = len(self.vacab) self.vocab.append(token) def get_feature_names(self): return self.feature_names def transform(self, documents): for idx,doc in enumerate(documents): result = [] tokens = self.clean(doc) for va in self.vocab: result.append(tokens.count(va)) if self.y: result.append(self.y[idx]) yield result
documents = [ "The elephant sneezed at the sight of potatoes.Its very interesting thing.\nBut at the sight of potatoes", "Bats can see via echolocation. See the bat sight sneeze!\nBut it is a bats", "Wondering, she opened the door to the studio.\nHaha!good", ] y = [1, 1, 1] tef = TextExtractFeature(language='english') #構建詞典tef.fit(documents, y) #打印詞典映射關係。即特徵詞 print(tef.get_feature_names()) for s in tef.transform(documents): print(s)
defaultdict(<class 'int'>, {'elephant': 0, 'sneeze': 1, 'sight': 2, 'potatoes.its': 3, 'interesting': 4, 'thing': 5, 'potato': 6, 'bats': 7, 'see': 8, 'echolocation': 9, 'bat': 10, 'wondering': 11,'open': 12, 'door': 13, 'studio': 14, 'haha': 15, 'good': 16}) [1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] [0, 1, 1, 0, 0, 0, 0, 1, 2, 1, 2, 0, 0, 0, 0, 0, 0, 1] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]