(1)數據標準化(Standardization or Mean Removal and Variance Scaling)python
進行標準化縮放的數據均值爲0,具備單位方差。函數
scale函數提供一種便捷的標準化轉換操做,以下:工具
- >>> from sklearn import preprocessing
- >>> X=[[1.,-1.,2.],
- [2.,0.,0.],
- [0.,1.,-1.]]
- >>> X_scaled = preprocessing.scale(X)
- >>> X_scaled
- array([[ 0. , -1.22474487, 1.33630621],
- [ 1.22474487, 0. , -0.26726124],
- [-1.22474487, 1.22474487, -1.06904497]])
- >>> X_scaled.mean(axis=0)
- array([ 0., 0., 0.])
- >>> X_scaled.std(axis=0)
- array([ 1., 1., 1.])
一樣咱們也能夠經過preprocessing模塊提供的Scaler(StandardScaler 0.15之後版本)工具類來實現這個功能:
- >>> scaler = preprocessing.StandardScaler().fit(X)
- >>> scaler
- StandardScaler(copy=True, with_mean=True, with_std=True)
- >>> scaler.mean_
- array([ 1. , 0. , 0.33333333])
- >>> scaler.std_
- array([ 0.81649658, 0.81649658, 1.24721913])
- >>> scaler.transform(X)
- array([[ 0. , -1.22474487, 1.33630621],
- [ 1.22474487, 0. , -0.26726124],
- [-1.22474487, 1.22474487, -1.06904497]])
(2)數據規範化(Normalization)
把數據集中的每一個樣本全部數值縮放到(-1,1)之間。
- >>> X = [[ 1., -1., 2.],
- [ 2., 0., 0.],
- [ 0., 1., -1.]]
- >>> X_normalized = preprocessing.normalize(X, norm='l2')
- >>> X_normalized
- array([[ 0.40824829, -0.40824829, 0.81649658],
- [ 1. , 0. , 0. ],
- [ 0. , 0.70710678, -0.70710678]])
- >>> normalizer = preprocessing.Normalizer().fit(X)
- >>> normalizer
- Normalizer(copy=True, norm='l2')
- >>> normalizer.transform(X)
- array([[ 0.40824829, -0.40824829, 0.81649658],
- [ 1. , 0. , 0. ],
- [ 0. , 0.70710678, -0.70710678]])
- >>> normalizer.transform([[-1., 1., 0.]])
- array([[-0.70710678, 0.70710678, 0. ]])
(3)二進制化(Binarization)
將數值型數據轉化爲布爾型的二值數據,能夠設置一個閾值(threshold)
- >>> X = [[ 1., -1., 2.],
- [ 2., 0., 0.],
- [ 0., 1., -1.]]
- >>> binarizer = preprocessing.Binarizer().fit(X)
- >>> binarizer
- Binarizer(copy=True, threshold=0.0)
- >>> binarizer.transform(X)
- array([[ 1., 0., 1.],
- [ 1., 0., 0.],
- [ 0., 1., 0.]])
- >>> binarizer = preprocessing.Binarizer(threshold=1.1)
- >>> binarizer.transform(X)
- array([[ 0., 0., 1.],
- [ 1., 0., 0.],
- [ 0., 0., 0.]])
(4)標籤預處理(Label preprocessing)編碼
4.1)標籤二值化(Label binarization)spa
LabelBinarizer一般用於經過一個多類標籤(label)列表,建立一個label指示器矩陣.net
- >>> lb = preprocessing.LabelBinarizer()
- >>> lb.fit([1, 2, 6, 4, 2])
- LabelBinarizer(neg_label=0, pos_label=1)
- >>> lb.classes_
- array([1, 2, 4, 6])
- >>> lb.transform([1, 6])
- array([[1, 0, 0, 0],
- [0, 0, 0, 1]])
上例中每一個實例中只有一個標籤(label),LabelBinarizer也支持每一個實例數據顯示多個標籤:code
- >>> lb.fit_transform([(1, 2), (3,)])
- array([[1, 1, 0],
- [0, 0, 1]])
- >>> lb.classes_
- array([1, 2, 3])
4.2)標籤編碼(Label encoding)
orm
- >>> from sklearn import preprocessing
- >>> le = preprocessing.LabelEncoder()
- >>> le.fit([1, 2, 2, 6])
- LabelEncoder()
- >>> le.classes_
- array([1, 2, 6])
- >>> le.transform([1, 1, 2, 6])
- array([0, 0, 1, 2])
- >>> le.inverse_transform([0, 0, 1, 2])
- array([1, 1, 2, 6])
也能夠用於非數值類型的標籤到數值類型標籤的轉化:blog
- >>> le = preprocessing.LabelEncoder()
- >>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
- LabelEncoder()
- >>> list(le.classes_)
- ['amsterdam', 'paris', 'tokyo']
- >>> le.transform(["tokyo", "tokyo", "paris"])
- array([2, 2, 1])
- >>> list(le.inverse_transform([2, 2, 1]))
- ['tokyo', 'tokyo', 'paris']