python庫使用整理

時間 2019-12-07

標籤 python 使用整理欄目 Python 简体版

原文原文鏈接

1. 環境搭建html

l Python安裝包：www.python.orgnode

l Microsoft Visual C++ Compiler for Pythonpython

l pip（get-pip.py）：pip.pypa.io/en/latest/installing.htmlmysql

n pip install + 安裝包 --安裝包（.whl，.tar.gz，.zip）git

n pip uninstall + 安裝包 --卸載包github

n pip show --files + 安裝包 --查看已安裝包信息正則表達式

n pip list outdated --查看待更新包信息算法

n pip install -U + 安裝包 --升級包sql

n pip search + 安裝包 --搜索包數據庫

n pip help --顯示幫助信息

l Anaconda：continuum.io/downloads

n conda install + 安裝包

n conda uninstall + 安裝包

l 一些經常使用包：www.lfd.uci.edu/~gohlke/pythonlibs/

n re：正則匹配；os：文件操做；random：隨機數；time：時間戳

n requests：網頁交互（get & post）

n beautifulsoup4：用於規範化的靜態網頁的分析

n mysqlclient、PyMySQL：Mysql數據庫操做接口

n numpy+mkl：矩陣運算

n scipy：科學計算（插值，積分，優化，圖像處理，特殊函數）

n matplotlib：圖形繪製

n scikit-learn：機器學習庫

n jieba、smallseg：中文分詞 https://pypi.python.org/pypi/jieba/

n pandas：大型數據處理

n nltk：提供50多個語料庫和詞典資源（分類、分詞、詞幹提取、解析、語義推理）

n Pattern：擁有一系列的天然語言處理工具，詞性標註工具(Part-Of-Speech Tagger)，N元搜索(n-gram search)，情感分析(sentiment analysis)，WordNet，也支持機器學習的向量空間模型，聚類，向量機。

n TextBlob：處理文本數據（詞性標註、名詞短語抽取、情感分析、分類、翻譯）

n Gensim：用於對大型語料庫進行主題建模、文件索引、類似度檢索等，能夠處理大於內存的輸入數據

n PyNLPI

n spaCy：一個商業的開源軟件，結合了Python和Cython的NLP工具

n Polyglot：支持大規模多語言應用程序的處理（165種語言的分詞，196種語言的辨識，40種語言的專有名詞識別，16種語言的詞性標註，136種語言的情感分析，137種語言的嵌入，135種語言的形態分析，以及69種語言的翻譯）

n MontyLingua：免費的、功能強大的、端到端的英文處理工具，適用於信息檢索和提取，請求處理，問答系統，能詞性標註和實體識別一些語義信息。

n BLLIP Parser：集成了生成成分分析器和最大熵排序的統計天然語言分析器

n Quepy：提供了將天然語言問題轉換成爲數據庫查詢語言中的查詢

n PIL：圖像處理

n xgboost（eXtreme Gradient Boosting）：梯度上升算法

2. 經常使用命令

l python XXX.py --運行程序

l python setup.py build --編譯，須在setup.py所在目錄下

l python setup.py install --安裝，同上

l python setup.py sdist --製做分發包，同上

l python setup.py bdist_wininst --製做windows下的分發包，同上

3. 標準定義

#!usr/bin/python

# -*- coding: utf-8 -*-

# encoding=utf-8

from distutils.core import setup

import sys

reload(sys)

sys.setdefaultencoding('utf8')

# 安裝包打包

setup(name="example", version="v1.0.0", description="setup_examples", author="SweetYu", py_modules=['文件夾1.文檔1','文件夾1.文檔2','文件夾2.文檔3'])

#類的定義

class A:

__count = 0 # 私有類變量

def __init__(self, name):

self.name = name # 公有成員變量

A.__count += 1

@classmethod

def getCount(cls): # 類函數

return cls.__count

def getName(self): # 成員函數

return self.name

if __name__ == '__main__':

main()

4. json、string、random、re

正則表達式（Res，regex pattens）

l 元符號

．表示任意字符

［］用來匹配一個指定的字符類別

^ 取非

* 前一個字符重複0到無窮次

$ 前一個字符重複1到無窮次

？前一個字符重複0到1次

{m}前一個字符重複m次

l 特殊匹配

\d 匹配任何十進制數，至關於類 [0-9]。

\D 匹配任何非數字字符，至關於類 [^0-9]

\s 匹配任何空白字符，至關於類 [ fv]

\S 匹配任何非空白字符，至關於類 [^ fv]

\w 匹配任何字母數字字符，至關於類 [a-zA-Z0-9_]

\W 匹配任何非字母數字字符，至關於類 [^a-zA-Z0-9_]

import json, os, re, time

import string, random

obj = [[1,2,3],123,123.123,'abc',{'key1':(1,2,3),'key2':(4,5,6)}]

encode_obj = json.dumps(obj, skipkeys=True, sort_keys=True) #將對象轉換成字符串

decode_obj = json.loads(encode_obj) #將字符串轉換成對象

random.randint(0,255) #獲取0~255範圍內的整數

field = string.letters + string.digits #大小寫字母+數字

random.sample(field,5) #獲取長度爲5的field範圍內的隨機字符串

#文件處理

fp = open(filename,"w", encoding='utf-8-sig') # 'w', 'w+', 'r', 'r+', 'a', 'a+'

fp.write(unicode("\xEF\xBB\xBF", "utf-8")) #寫到文件開頭，指示文件爲UTF-8編碼

fp.close()

m = re.search("^ab+","asdfabbbb") # m.group() à 'abbbb'

re.findall("^a\w+","abcdfa\na1b2c3",re.MULTILINE)

re.findall("a{2,4}","aaaaaaaa") # ['aaaa', 'aaaa']

re.findall("a{2,4}?","aaaaaaaa") # ['aa', 'aa', 'aa', 'aa']

re.split("[a-zA-Z]+","0A3b9z") 　 # ['0', '3', '9', '']

m = re.match("(?P<first_name>\w+) (?P<last_name>\w+)","sam lee")

> m.group("first_name") # 'sam'

> m.group("last_name") # 'lee'

re.sub('[abc]', 'o', 'caps rock') # 'cops rook'

5. os、time

在Python中，一般有這幾種方式來表示時間：

1）時間戳：從1970年1月1日00:00:00開始按秒計算的偏移量，如：time()，clock()，mktime()

2）格式化的時間字符串，如：strftime ()

3）struct_time元組：如：gmtime()【UTC時區】，localtime()【當前時區】，strptime()

import os, time

time.strptime('2011-05-05 16:37:06', '%Y-%m-%d %X')

time.strftime("%Y-%m-%d %X", time.localtime())

time.localtime(1304575584.1361799) #(tm_year=2011, tm_mon=5, tm_mday=5, tm_hour=14, tm_min=6, tm_sec=24, tm_wday=3, tm_yday=125, tm_isdst=0)

time.sleep(s) #線程休眠s秒

#獲取文件建立時間例：1483882912.37 Sun Jan 08 21:41:52 2017

time.ctime(os.path.getctime(fileName))

os.path.exists(fileName) #是否存在某一文件

os.chdir(dir) #修改當前目錄

os.getcwd() #獲取當前目錄

os.listdir(dir) #返回指定目錄下的全部文件和目錄名

os.remove(fileName) #刪除一個文件

os.makedirs(dir/fileName) #生成多層遞規目錄

os.rmdir(dir) #刪除單級目錄

os.rename(oldName, newName) #重命名文件

> os.sep #當前平臺下路徑分隔符

> os.linesep #給出當前平臺使用的行終止符

> os.environ #獲取系統環境變量

os.path.abspath(path) #顯示該路徑的絕對路徑

os.path.dirname(path) #返回該路徑的父目錄

os.path.isfile(path) #若是path是一個文件，則返回True

os.path.isdir(path) #若是path是一個目錄，則返回True

os.path.splitext(fileName) #得到(文件名,文件名後綴)

os.stat() #獲取文件或者目錄信息

os.path.join(dir,fileName) #鏈接目錄與文件名或目錄結果爲path/name

6. requests

import requests

session = requests.session()

url = 'https://api.github.com/some/endpoint'

params = {'some': 'data'}

headers = {'content-type': 'application/json'}

files = {'file': open('report.xls', 'rb')}

cookies = dict(cookies_are='working')

r = session.post(url, data = json.dumps(params), headers = headers, files = files, cookies = cookies, allow_redirects = False, timeout = 0.001)

# get、put、delete、options、head

> r.status_code # requests.codes.ok

> r.url

> r.encoding

> r.text

> r.content

> r.headers # r.headers['Content-Type']或 r.headers.get('content-type')

> r.history

7. beautifulsoup4

BeautifulSoup將HTML文檔轉換成一個樹形結構，每一個節點都是Python對象，全部對象能夠概括爲4種：Tag、NavigableString、BeautifulSoup、Comment。

當前結點：

.name：標籤名

.attr：標籤的屬性集合（json）

.string：標籤內容

.strings：標籤對象包含的多個內容列表（list）

.stripped_strings：去除多餘空白內容後的多個內容列表（list）

直接/全部子節點：

.contents

.children .descendants

(全部)父節點：

.parent .parents

兄弟節點：

.next_sibling

.previous_sibling

先後節點【不分層次】：

.next_element .next_elements

.previous_element .previous_elements

搜索文檔樹

find_all、find
find_parents、find_parent
find_next_siblings、find_next_sibling
find_previous_siblings、find_previous_sibling
find_all_next、find_next
find_all_previous、find_previous
select

name參數：標籤名、正則表達式、列表、True（匹配任何值）、方法

keyword參數：class、id、href等標籤屬性

text 參數：標籤內容

limit參數：返回列表的大小

recursive 參數，是否僅包含直接子結點（True/False）

標籤名不加任何修飾，類名前加點，id名前加 #

import re

from bs4 import BeautifulSoup

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

soup = BeautifulSoup(html)

> soup.prettify()

> soup.find_all(has_class_but_no_id)

> soup.find(re.compile("^b"))

> soup.select('a[class="sister"]')

if type(soup.a.string)==bs4.element.Comment:

print(soup.a.string) #遍歷某一對象的非註釋字符串

for str in soup.stripped_strings:

print(repr(str))

8. numpy

經常使用屬性：Itemsize（單個元素字節數）、size（元素個數）、shape、ndim（維數）、dtype、axis=0（列）/1（行）
經常使用函數：max、min、sum、sin、floor（向下取整）、dot（矩陣相乘）、exp、.vstack（縱向合併）、hstack（橫向合併）
矩陣運算：transpose（轉置）、trace（跡）、eig（特徵值和特徵向量）、inner（內積）、outer（外積）

import numpy as np

import numpy.linalg as nplg

a = np.array([[1,2],[3,4]] , dtype=np.int32) #[[1 2][3 4]]

nplg.eig(a) #矩陣A的特徵向量和特徵值

b = np.arange(6).reshape(2,3) #[[0 1][2 3][4 5]]

c = np.linspace(1,3,9) #[1. 1.25 1.5 1.75 2. 2.25 2.5 2.75 3.]

d = np.zeros((1,3)) # ones全1矩陣；eyes 單位矩陣；zeros 全0矩陣

e = d.repeat(2,axis=0) #[[0 0 0][0 0 0]]

np.merage(a,b) #合併數據

a.tofile("a.bin")

f = np.fromfile("a.bin", dtype= a.dtype)

f.shape = a.shape

np.save("a.npy", a)

g = np.load("a.npy")

9. scipy

10. sklearn

1) Scikit-learn自己不支持深度學習和增強學習，也不支持GPU加速，不支持圖模型和序列預測；

2) Scikit-learn歷來不作除機器學習領域以外的其餘擴展；

3) Scikit-learn歷來不採用未經普遍驗證的算法；

數據預處理

特徵提取：將輸入數據轉換爲具備零均值和單位權方差的新變量

歸一化：將文本或圖像數據轉換爲可用於機器學習的數字變量

from sklearn.preprocessing import *

# z-score標準化：減去均值，除以標準

X = StandardScaler().fit_transform(X)

# 最小-最大規範化：縮放至特定範圍

X = MinMaxScaler().fit_transform(X)

X = MaxAbsScaler().fit_transform(X)

# 數據歸一化/規範化

X = Normalizer(X, norm='l2').fit_transform(X) # norm可取值l1,l2,max

# 數值特徵二值化

X = Binarizer(threshold=1.1).fit_transform(X) # threshold爲閾值

# 類別數據編碼，OneHot編碼：OneHotEncoder

# 標籤二值化：將類別特徵轉換爲多維二元特徵，並將每一個特徵擴展成用一維表示

label_binarize([1, 6], classes=[1, 2, 4, 6]) # 輸出[1, 0, 0, 0],[0, 0, 0, 1]

# 類別編碼

LabelEncoder().fit_transform(['A','A','b','c']) # 輸出[0, 0, 1, 2]

# 缺失值填補

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

imp.fit_transform([[1, 2], [np.nan, 3], [7, 6]])) #[[1, 2], [4, 3], [7, 6]]

# 生成多項式特徵例：[a,b] -> [1,a,b,a^2,ab,b^2]

PolynomialFeatures(2).fit_transform(X)

# 增長僞特徵

FunctionTransformer(np.log1p).fit_transform([[0, 1], [2, 3]])

分類算法

線性：樸素貝葉斯（NB）、K-最近鄰（KNN）、邏輯迴歸（LR）

n 訓練和預測的效率較高，但對特徵的依賴程度也高

n 需在特徵工程上儘可能對特徵進行選擇、變換或者組合，使得特徵具備線性可分性

非線性：隨機森林（RF）、決策樹（DT）、梯度提高（GBDT）、支持向量機-交叉驗證（SVM-CV）、多層感知機（MLP）神經網絡

n 可建模複雜的分類面，能更好的擬合數據。

# NB(Multinomial Naive Bayes) Classifier

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB(alpha=0.01)

# KNN Classifier

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

# LR(Logistic Regression) Classifier

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty='l2')

# RF(Random Forest) Classifier

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=8)

# DT(Decision Tree) Classifier

from sklearn import tree

model = tree.DecisionTreeClassifier()

# GBDT(Gradient Boosting Decision Tree) Classifier

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=200)

# SVM Classifier

from sklearn.svm import SVC

model = SVC(kernel='rbf', probability=True)

# SVM Classifier using CV(Cross Validation)

from sklearn.grid_search import GridSearchCV

from sklearn.svm import SVC

model = SVC(kernel='rbf', probability=True)

param_grid = {'C': [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001]}

grid_search = GridSearchCV(model, param_grid, n_jobs = 1, verbose=1)

grid_search.fit(train_x, train_y)

best_parameters = grid_search.best_estimator_.get_params()

for para, val in best_parameters.items():

print para, val

model=SVC(kernel='rbf',C=best_parameters['C'],gamma=best_parameters['gamma'],probability=True)

# MLP Classifier

from sklearn.neural_network import MLPClassifier

clf=MLPClassifier(solver='lbfgs',alpha=1e-5,hidden_layer_sizes=(5,2),random_state=1)

#創建模型

model.fit(train_x, train_y)

#預測與評價

from sklearn import metrics

predict = model.predict(test_x)

precision = metrics.precision_score(test_y, predict)

recall = metrics.recall_score(test_y, predict)

accuracy = metrics.accuracy_score(test_y, predict)

迴歸分析

支持向量迴歸（SVR）

貝葉斯迴歸、內核嶺迴歸（KR）、高斯迴歸

嶺迴歸：經過增長懲罰函數來判斷、消除特徵間的共線性

Lasso迴歸（least absolute shrinkage and selection operator，最小絕對值收縮和選擇算子）

彈性網絡（Elastic Net）：使用L1和L2先驗做爲正則化矩陣的線性迴歸模型

最小角迴歸（LARS，least angle regression），可用做參數選擇，獲得一個相關係數的稀疏向量

# 產生200個樣本，500個特徵（維）的迴歸樣本空間

from sklearn.datasets import make_regression

reg_data, reg_target = make_regression(n_samples=200, n_features=500, n_informative=5, noise=5)

# 線性迴歸、貝葉斯迴歸、Lasso迴歸以及特徵提取、嶺迴歸、LARS迴歸

from sklearn import linear_model

regr_model = linear_model.LinearRegression()

bys_model = linear_model.BayesianRidge(compute_score=True)

lasso_model = linear_model.Lasso()

r_model = linear_model.Ridge(alpha=.5)

lars_model = linear_model.Lars(n_nonzero_coefs=10)

# 交叉驗證

lasso_cv = linear_model.LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001,

fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1,

normalize=False, positive=False, precompute='auto',

random_state=None, selection='cyclic', tol=0.0001,verbose=False)

lasso_cv.fit(reg_data, reg_target)

> lasso_cv.alpha_ # 正則化項L1的係數

> lasso_cv.intercept_ # 截距

new_reg_data = reg_data[:,lasso_cv.coef_!=0] #相關係數!=0的特徵，被提取獲得

# 高斯迴歸

from sklearn import gaussian_process

gp_model = gaussian_process.GaussianProcess(theta0 = 1e-2, thetaL = 1e-4, thetaU= 1e-1)

# SVR迴歸、KR迴歸

from sklearn.svm import SVR

from sklearn.grid_search import GridSearchCV

from sklearn.kernel_ridge import KernelRidge

svr_model = GridSearchCV(SVR(kernel = 'rbf', gamma = 0.1), cv = 5,

param_grid = {"C": [1e0,1e1,1e2,1e3], "gamma": np.logspace(-2, 2, 5)})

kr_model = GridSearchCV(KernelRidge(kernel = 'rbf', gamma = 0.1), cv = 5,

param_grid = {"alpha": [1e0,0.1,1e-2,1e-3], "gamma": np.logspace(-2,2,5)})

#訓練模型

model.fit (X_train, y_train)

# 打印相關係數

print('Coefficients: \n', model.coef_)

print("Residual sum of squares: %.2f" % np.mean((model.predict(X_test) - y_test) ** 2))

print('Variance score: %.2f' % model.score(X_test, y_test))

聚類算法（無監督）

K-均值（K-means）聚類，譜聚類，均值偏移，分層聚類，DBSCAN聚類

import numpy as np

from sklearn import datasets

# 產生500個樣本，6個特徵（維），5個簇的聚類樣本空間

X, Y = datasets.make_blobs(n_samples=500, n_features=6, centers=5, cluster_std=[0.4, 0.3, 0.4, 0.3, 0.4], random_state=11)

# K-means Cluster

from sklearn.cluster import KMeans

clf_model = KMeans(n_clusters=3, max_iter=300, n_init=10)

# 譜聚類

from sklearn.cluster import SpectralClustering

clf_model = SpectralClustering() #或SpectralClustering(n_clusters=k, gamma=gamma)

# DBSCAN Cluster

from sklearn.cluster import DBSCAN

clf_model = DBSCAN(eps = 0.3, min_samples = 10)

#創建模型

clf_model.fit(X)

y_pred = clf_model.fit_predict(X)

#預測與評價

from sklearn import metrics

y_pred = clf_model.labels_

n_clusters_ = len(set(y_pred)) - (1 if -1 in y_pred else 0)

print("Estimated number of clusters: %d" % n_clusters_)

print("Homogeneity: %0.3f" % metrics.homogeneity_score(Y, y_pred))

print("Completeness: %0.3f" % metrics.completeness_score(Y, y_pred))

print("V-measure: %0.3f" % metrics.v_measure_score(Y, y_pred))

print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(Y, y_pred))

print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(Y, y_pred))

print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, y_pred))

print("Calinski-Harabasz Score", metrics.calinski_harabaz_score(X, y_pred))

數據降維

目的：減小要考慮的隨機變量的個數

應用場景：可視化處理、效率提高

主成分分析（PCA）、非負矩陣分解（NMF）、文檔生成主題模型(LDA)、特徵選擇

# PCA (Principal Components Analysis)

from sklearn import decomposition

pca = decomposition.PCA()

pca.fit(X) # 直接對數據進行降維

print(pca.explained_variance_)

pca.n_components = 2

X_reduced = pca.fit_transform(X)

# NMF(Nonnegtive Matrix Factorization)

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import NMF

tfidf_vector = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features,

stop_words='english')

tfidf = tfidf_vector.fit_transform(data_samples)

nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)

# LDA(Latent Dirichlet Allocation)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.decomposition import LatentDirichletAllocation

tf_vector = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,

stop_words='english')

tf = tf_vector.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, learning_method='online',

learning_offset=50, random_state=0)

.fit(tf)

# 打印模型結果

feature_names = vector.get_feature_names()

for topic_idx, topic in enumerate(model.components_):

print("Topic #%d:" % topic_idx)

print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

模型選擇

對於給定參數和模型的比較、驗證和選擇

目的：經過參數調整來提高精度

格點搜索，交叉驗證，各類針對預測偏差評估的度量函數

#交叉驗證

from sklearn import metrics

from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

clf = SVC(kernel='linear', C=1)

scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# 模型的保存與載入

import pickle

from sklearn.externals import joblib

joblib.dump(clf , 'c:/km.pkl')

clf = joblib.load('c:/km.pkl')

with open('save/clf.pickle', 'wb') as f: pickle.dump(clf, f)

with open('save/clf.pickle', 'rb') as f: clf = pickle.load(f)

print(clf.predict(X[0:1])) #測試讀取後的Model

11. matplotlib

from matplotlib.matlab import *

from pylab import *

plot(x, y1, 'r*', linewidth=2, label='f(x)=sin(x)') #普通曲線圖

plot(x, y2, 'b-', linewidth=2, label='f(x)=2^x')

xlabel('x'); ylabel('f(x)'); title('Simple plot')

legend(loc='upper left') # 添加圖例

grid(True) # 顯示網格

savefig("sin.png" ,dpi=72) # 保存圖表（dpi爲分辨率）

show() # 顯示圖表，注：每次顯示後刷新畫板

text(2,4, r'$ \alpha_i \beta_j \pi \lambda \omega $' , size=15) #

text(4,4, r'$ \sin(0) = cost (\frac {\pi} {2}) $' , size=15) #

text(2,2 , r'$ lim_{x \rightarrow y} \frac{1} {x^3} $', size=15) #

text(4,2 , r'$ \sqrt[4] {x} = \sqrt {y} $', size=15) #

12. jieba https://github.com/fxsjy/jieba

import jieba

import jieba.posseg as pseg

sentence = "我來到北京清華大學"

#分詞

seg_list = jieba.cut(sentence, cut_all=True) # 全模式

seg_list = jieba.cut(sentence, cut_all=False) # 精確模式

seg_list = jieba.cut(sentence) # 默認是精確模式

seg_list = jieba.cut_for_search(sentence) # 搜索引擎模式

#加載詞典，格式：詞語 [詞頻] [詞性]

jieba.load_userdict(file_name) # file_name 爲文件類對象或自定義詞典的路徑

#手動調整使某個詞（不）分開

jieba.suggest_freq(('中', '將'), True)

jieba.suggest_freq('臺中', True)

#關鍵詞提取

jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n','v'))

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n','v'))

#詞性標註

words = pseg.cut(sentence)

for word, flag in words:

print('%s %s' % (word, flag))

13. pandas

import numpy as np

import pandas as pd

from pandas import Sereis, DataFrame

data = DataFrame(np.arange(16).reshape(4,4),index=list('abcd'),columns=list('wxyz'))

a b c d

w 0 1 2 3

x 4 5 6 7

y 8 9 10 11

z 12 13 14 15

data[1:4,0:3] #第2到4行，第1到3列

data[['x':'z'],[0:2]]

data.iat[1:4,[0:2]]

data.ix[1:4,[‘a’,’b’,’c’]]

data.loc[['x',’y’,’z’],[‘a’:’c’]]

data.irow(0) #第1行 data.iloc[-1:] #最後1行

data.icol(0) #第1列 data.iloc[:-1] #最後1列

data.ix[data.a>5,3] # ‘a’列>5的值所在行【’y’,’z’】，第4列的數據

data.head() #返回data的前幾行數據，默認爲前五行，須要前十行則data.head(10)

data.tail() #返回data的後幾行數據，默認爲後五行，須要後十行則data.tail(10)

14. nltk

l nps_chat # 即時消息聊天會話語料庫

l brown # 布朗語料庫

[1] 是第一個百萬詞級的英語電子語料庫

[2] 由布朗大學於 1961年建立，包含500個不一樣來源的文本

[3] 按照文體分類：['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

l reuters # 路透社語料庫

[1] 包含 10,788 個新聞文檔，共計 130 萬字

[2] 這些文檔分紅 90 個主題，按照「訓練」和「測試」分爲兩組

[3] 新聞報道每每涉及多個主題，類別每每互相重疊

[4] 能夠查找由一個或多個文檔涵蓋的主題，也能夠查找包含在一個或多個類別中的文檔

l inaugural # 就任演說語料庫

[1] 55個文本，每一個文本都是一個總統的演說，具備時間維度

l udhr # 標註文本語料庫「世界人權宣言」

[1] 包含多國語言['Chickasaw', 'English','German_Deutsch','Greenlandic_Inuktikut','Hungarian_Magyar', 'Ibibio_Efik']

15. PIL

import zbar

from PIL import Image

#識別二維碼

scanner = zbar.ImageScanner() #建立圖片掃描對象

scanner.parse_config('enable') #設置對象屬性

img = Image.open(filename).convert('L') #打開一張二維碼圖片, #默認mode="r"

qrCode = zbar.Image(img.width, img.height,'Y800', img.tobytes()) #轉換圖片爲字節信息並掃描

scanner.scan(qrCode)

data += s.data for s in qrCode

del(img) # 刪除圖片對象

print(data) # 輸出解碼結果

#在圖片上添加文字，增長噪音點

draw = ImageDraw.Draw(img)

font = ImageFont.truetype(fontfile, min(img.size)/30)

draw.text((0,img.height - fontsize), data, font=font, fill=(255,0,0))

draw.point((random.randint(0,width), random.randint(0,height)), fill=(0,0,255))

#按比例縮放後，模糊處理並保存

rate = max( img.width/p_width, img.height/p_height )

if rate!=0:

img.thumbnail((img.size[0]/rate , img.size[1]/rate)) #注：此處有兩個括號

img = img.filter(ImageFilter.BLUR)

img.show(); img.save(filename, 'jpeg') # 或者是'png'

img.close()

16. goose

17. xgboost

l 優點

一、正則化，減小過擬合

二、並行處理，也支持Hadoop實現

三、高度的靈活性，容許自定義優化目標和評價標準

四、缺失值處理

五、剪枝

六、內置交叉驗證，得到最優boosting迭代次數

七、可在已有模型上繼續訓練

l 參數

一、通用參數：宏觀函數控制

u booster：每次迭代的模型，gbtree（默認）：基於樹的模型，gbliner：線性模型

u silent：0-默認；1-靜默模式，不輸出任何信息

u nthread：默認值爲最大可能的線程數

二、 Booster參數：控制每一步的booster(tree/regression)

u eta [默認0.3]：學習速率，經過減小每一步的權重，可提升模型的魯棒性。典型值爲0.01-0.2。

u min_child_weight [默認1]：最小樣本權重的和

u max_depth [默認6]：樹的最大深度，典型值：3-10

u max_leaf_nodes：最大的節點或葉子的數量

u gamma [默認0]：節點分裂所需的最小損失函數降低值

u max_delta_step [默認0]：每棵樹權重改變的最大步長

u subsample [默認1]：每棵樹隨機採樣的比例，典型值：0.5-1

u colsample_bytree [默認1]：每棵隨機採樣的列數的佔比(每一列是一個特徵)。典型值：0.5-1

u colsample_bylevel [默認1]：樹的每一級的每一次分裂，對列數的採樣的佔比

u lambda [默認1]：權重的L2正則化項，和Ridge regression相似

u alpha [默認1]：權重的L1正則化項，和Lasso regression相似，能夠應用在很高維度的狀況下，使得算法的速度更快

u scale_pos_weight [默認1]：在樣本不平衡時，把這個參數設爲一個正值，可使算法更快收斂

三、學習目標參數：控制訓練目標的表現

u objective [默認reg:linear]：須要被最小化的損失函數。

l binary:logistic 二分類的邏輯迴歸，返回預測的機率(不是類別)

l multi:softmax 使用softmax多分類器，返回預測的類別(不是機率)，還需設一個參數：num_class(類別數目)

l multi:softprob 和multi:softmax參數同樣，可是返回的是每一個數據屬於各個類別的機率

u eval_metric [默認值取決於objective參數的取值]：對於有效數據的度量方法。

l 對於迴歸問題，默認值是rmse，均方根偏差

l 對於分類問題，默認值是error，二分類錯誤率(閾值爲0.5)

l mae 平均絕對偏差

l logloss 負對數似然函數值

l merror 多分類錯誤率

l mlogloss 多分類

l logloss損失函數

l auc 曲線下面積

u seed [默認0]：隨機數的種子，設置它可復現隨機數據的結果，也可用於調整參數

import xgboost as xgb

#加載XGBoost的二進制的緩存文件

dtrain = xgb.DMatrix('train.svm.txt')

deval = xgb.DMatrix('test.svm.buffer')

#加載Numpy的二維數組，並處理 DMatrix中的缺失值，給樣本設置權重

data = np.random.rand(5,10) # 5個訓練樣本，10個特徵

label = np.random.randint(2, size=5)

w = np.random.rand(5,1)

dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w)

#將scipy.sparse格式的數據轉化爲 DMatrix格式

csr = scipy.sparse.csr_matrix((data, (row,col)))

dtrain = xgb.DMatrix(csr)

#將DMatrix格式的數據保存成XGBoost的二進制格式，在下次加載時能夠提升加載速度

dtrain.save_binary("train.buffer")

#參數設置

params = {

'booster': 'gbtree', # gbtree（默認）：基於樹的模型，gbliner：線性模型

'objective': 'binary:logistic',

'eval_metric':'logloss',

'scale_pos_weight':1, #樣本不平衡時，把這個參數設定爲正值，可以使算法更快收斂

'max_depth':6, #一般取[3,10], 樹的最大深度

'min_child_weight':1, #默認爲1，最小樣本權重的和

'gamma':0.15, #一般取[0.1,0.2]，節點分裂所需的最小損失函數降低值，默認爲0

'subsample':0.9, #一般取[0.5,1]，每棵樹隨機採樣的比例

'colsample_bytree':0.9, #一般取[0.5,0.9]，隨機採樣的列數的佔比(每一列是一個特徵)，默認爲1

'lambda':0.1, #權重的L2正則化項

'alpha':0.2, #權重的L1正則化項

'eta':0.15 #學習速率，默認爲0.3，一般取[0.01,0.2]

}

plst = param.items()

#調整參數

res = xgb.cv(params, xgTrain)

#定義驗證數據集，驗證算法的性能

evallist = [(deval,'eval'), (dtrain,'train')]

#訓練模型

num_round = 10

bst = xgb.train( plst, dtrain, num_round, evallist, evals=evals, early_stopping_rounds=5)

#保存模型

bst.save_model('model.bin')

bst.dump_model('dump.raw.txt')

bst.dump_model('dump.raw.txt','featmap.txt')

#加載模型

bst = xgb.Booster({'nthread':4})

model = bst.load_model("model.bin")

#預測結果

data = np.random.rand(7,10) # 7個測試樣本，10個特徵

dtest = xgb.DMatrix(data, missing = -999.0 )

ypred = bst.predict(model)

ypred = bst.predict(model, ntree_limit=bst.best_iteration)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。