you can see Machine Learning(3)Least-squares classification 1 for more details.
generalized model: api
yk(x)=g(wTkx)
g is the activation function:
g(x)=11+e−a⇒logisticregression
gk(x)=exk∑Kj=1exj⇒Softmaxregression
1.2 nonlinear basis functions
If g is monotonous (which is typically the case), the resulting decision boundaries are still linear in input space x. Thus transform vector x with D nonlinear basis functions ϕi(x): app
yk(xn)=g(wTkΦ(xn))=g(∑i=0Dwkiϕi(xn)),ϕ0(xn)=1
Advantages: Allow non-linear decision boundaries. By choosing the right
ϕi(x)
, every continuous function can (in principle)be approximated with arbitrary accuracy.
1.3 Why use generalized linear model compared with linear model
Can be used to limit the effect of outliers In linear model, the yk(xn)can grow arbitrarily large for some xn. As a result, too correct points (far away from linear boundaries but classified correctly) will have a strong influence on the decision boundaries. ⇒ML(3) Least-squares classification 3 By choosing a suitable nonlinear activation function( eg: sigmoid function in 2.1), we can limit those influences because the output for sigmoid function is [0,1]ide
Choice of a sigmoid leads to a nice probabilistic interpretation⇒ 2.1svg
However, Least-squares minimization in general no longer leads to a closed-form analytical solution, we need do the Gradient descent to update the weight. ⇒ 3ui
2. How to obtain generalized linear model
2.1 Logistic Sigmoid Activation Function
a nice probabilistic interpretation: Consider 2 classes: this
if
a=−lnp(x|C2)P(C2)p(x|C1)P(C1)⇒P(C1|x)=g(a)=11+e−a
Thus using function g(a), it leads to a nice probabilistic interpretation. atom
logistic function: g(a)is logistic function, we also use σ(a). Here are some properties of this function: spa
σ(−a)=1−σ(a)dσda=σ(1−σ)
logistic regression: In the following, we will consider models of the form:rest
p(C1|ϕ)=y1(ϕ)=σ(a)=σ(wT1ϕ(x))p(C2|ϕ)=1−p(C1|ϕ)
This model is called logistic regression.
Why use logistic regression: Because we only need to update fewer parameters. Assume we have an M-dimensional feature space ϕ. In logistic regression model, if we assume the p(x|C1) and p(x|C2)represented by Gaussians, and we need:
the number of means: 2M(M for p(x|C1), M for p(x|C2))
the number of variance:M(M+1)2 (suppose the variances of p(x|C1) and p(x|C2)are same)
prior probability: 1 (p(C1)=1−p(C2))
thus in total, with Gaussians representation we need to updateM(M+5)2+1parameters.
But in logistic regression, we only need to update M parameters of p(C1|ϕ) because p(C2|ϕ)=1−p(C1|ϕ));
Gradient descent is Iterative minimization. step1: Start with an initial guess for the parameter values w(0)kj; step2: Move towards a (local) minimum by following the gradient.
w(τ+1)kj=w(τ)kj−η∂E(W)∂w(τ)kj(3.1)
formula(3.1) corresponds to a 1st order Taylor expansion. If you are interested in why gradient descent correspond to 1st order Talyor expansion, just go on.
Using 1st order Taylor expansion of E(W)in Wτ:
E(W(τ)−ηΔ)=E(W(τ))+Δ(−ηΔ)<E(W(τ))
since
η>0
Δ=∂E(Wτ)∂Wτ
Thus updating W will lead to the minimum of
E(W)
.
3.2 Batch learning
Process the full data at once to computer the gradient.
E(W)=∑i=1NEi(W)wτ+1ji=wτji−η∂E(W)∂Wτji
3.3 Stochastic learning/Sequential Updating
Choose a single training sample xn to obtain En(W);
wτ+1ji=wτji−η∂En(W)∂Wτji
3.4 Delta rule /LMS rule
Delta/LMS rule are based on least-squares error. Error function (least-squares error) of linear model:
Cases with differentiable, non-linear activation function:
En(W)=12∑k=1K(tkn−g(∑j=1Mwkjϕj(xn)))2
∂En(W)∂wkˆj=g′(ykˆ(xn)−tkˆn)ϕj(xn)(3.4.2)
Both formula 3.4.1 and 3.4.2 are the Delta/LMS rule.
3.5 logistic regression
3.5.1 Gradient Descent (1st order)
Let’s consider a data set (ϕn,tn) with n = 1,…,N where ϕn=ϕ(xn)and tn∈{0,1},t=(t1,t2...tN)T
With yn=p(C1|ϕn), we can write the likelihood as
p(t|w)=∏n=1Nytnn(1−yn)1−tn
since
yn
only have parameter w and the second
yC2(xn)=1−yn
, thus w is vector here, we needn’t use Capital letter W;
Define the error function as the negative log-likelihood:
E(w)=−lnp(t|w)=−∑n=1Ntnlnyn+(1−tn)ln(1−yn)(3.5.1)
Formula 3.4.1 is the so-called
cross-entropy error function.