優化算法與特徵縮放

時間 2019-11-06

標籤優化算法特徵縮放简体版

原文原文鏈接

特徵縮放
目的
因爲原始數據的值範圍變化很大，在一些機器學習算法中，若是沒有標準化，目標函數將沒法正常工做。例如，大多數分類器按歐幾里德距離計算兩點之間的距離。若是其中一個要素具備寬範圍的值，則距離將受此特定要素的控制。所以，應對全部特徵的範圍進行歸一化，以使每一個特徵大體與最終距離成比例。
應用特徵縮放的另外一個緣由是梯度降低與特徵縮放比沒有它時收斂得快得多。

In statistics and applications of statistics, normalization can have a range of meanings.[1] In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment. In the case of normalization of scores in educational assessment, there may be an intention to align distributions to a normal distribution. A different approach to normalization of probability distributions is quantile normalization, where the quantiles of the different measures are brought into alignment.

In another usage in statistics, normalization refers to the creation of shifted and scaled versions of statistics, where the intention is that these normalized values allow the comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influences, as in an anomaly time series. Some types of normalization involve only a rescaling, to arrive at values relative to some size variable. In terms of levels of measurement, such ratios only make sense for ratio measurements (where ratios of measurements are meaningful), not interval measurements (where only distances are meaningful, but not ratios).

In theoretical statistics, parametric normalization can often lead to pivotal quantities – functions whose sampling distribution does not depend on the parameters – and to ancillary statistics – pivotal quantities that can be computed from observations, without knowing parameters.

在統計學和統計學應用中，規範化能夠有一系列含義。[1]在最簡單的狀況下，打分的標準化意味着在取平均前將不一樣量綱的觀測值調整到統一量綱下。
在更復雜的狀況下，歸一化能夠指更復雜的調整，主要目的是要將不一樣維度的數據在分佈層面作到機率對齊。
通常場景下歸一化會有意將分佈與正態分佈對齊，好比教育評估、身高統計。
機率分佈歸一化的差別點每每是分位數歸一化，標準化過程當中不一樣維度間的分位數將被對齊。

在統計學的另外一種用法中，歸一化是指將維度、數據進行縮放、轉換，
經過比較不一樣相關維度間標準化結果，縮放到同一空間中，達到消除不一樣數據之間異常分佈致使的不良影響，相似的時序數據中的異常分佈也會被消除。

某些類型的規範化僅涉及從新縮放，以得到相對於某個大小變量的值。就測量水平而言，這樣的比率僅對比率測量（其中測量的比率是有意義的）有意義，而不是間隔測量（其中僅距離是有意義的，而不是比率）。

在理論統計中，參數標準化一般能夠做用於兩種狀況：關鍵量和輔助統計；
其中關鍵量是採樣分佈不依賴於參數的函數；輔助統計表明在不知道參數配置的狀況下關鍵量由觀測值計算推導。

AdaGrad

學習率逐參數的除以歷史梯度平方和的平方根，使得每一個參數的學習率不一樣 

1.簡單來說，設置全局學習率以後，每次經過，全局學習率逐參數的除以歷史梯度平方和的平方根，使得每一個參數的學習率不一樣

2.效果是：在參數空間更爲平緩的方向，會取得更大的進步（由於平緩，因此歷史梯度平方和較小，對應學習降低的幅度較小）

3.缺點是,使得學習率過早，過量的減小

4.在某些模型上效果不錯。

Karpathy作了一個這幾個方法在MNIST上性能的比較，其結論是： 
adagrad相比於sgd和momentum更加穩定，即不須要怎麼調參。而精調的sgd和momentum系列方法不管是收斂速度仍是precision都比adagrad要好一些。在精調參數下，通常Nesterov優於momentum優於sgd。而adagrad一方面不用怎麼調參，另外一方面其性能穩定優於其餘方法。