Thanks for Stanford University's online course of Loss Functions and Optimisation and its accompanying web demo, now I finally have a concrete simple example of several Loss Function implementations, which I can delve into the maths.
The Architecture
In this architecture, W ∈ 2×3, x ∈ 1×2, b ∈ 1×3, s ∈ 1×3.
s = f(x, W) = Wx + b
The value of W is initialised to:
The initial value of b is set to:
Calculating the Scores
There are 9 sets of input values for X, hence, Xi, i=[0..8]. Given the dataset, the s is calculated:
x0 | x1 | y | s0 = w0,0*x0 + w1,0*x1 + b0 | s1 = w0,1*x0 + w1,1*x1 + b1 | s2 = w0,2*x0 + w1,2*x1 + b2 |
0.5 | 0.4 | 0 | 1*0.5 + 2*0.4 + 0 = 1.3 | 2*0.5 + (-4)*0.4 + 0.5 = -0.1 | 3*0.5 + (-1)*0.4 + (-0.5) = 0.6 |
0.8 | 0.3 | 0 | 1.4 | 0.9 | 1.6 |
0.3 | 0.8 | 0 | 1.9 | -2.1 | -0.4 |
-0.4 | 0.3 | 1 | 0.2 | -1.5 | -2 |
-0.3 | 0.7 | 1 | 1.1 | -2.9 | -2.1 |
-0.7 | 0.2 | 1 | -0.3 | -1.7 | -2.8 |
0.7 | -0.4 | 2 | -0.1 | 3.5 | 2 |
0.5 | -0.6 | 2 | -0.7 | 3.9 | 1.6 |
-0.4 | -0.5 | 2 | -1.4 | 1.7 | -1.2 |
Multiclass SVM Loss Functions
Weston Watkins 1999
where i=[0, number of input samples), j=[0, number of output classes)
since sj = Wjxi + bj and syi= Wyixi + byi
we have
where n is the number of times that syi appeared in Li (since Li is a sum of many terms).
n should be in the range of [0, number of output classes - 1).
y | s0 | s1 | s2 | Li | Note |
0 | 1.3 | -0.1 | 0.6 | max(0, s1-s0+1) + max(0, s2-s0+1) = max(0, -0.1-1.3+1) + max(0.6-1.3+1)= 0+0.3= 0.3 | Syi=S0 |
0 | 1.4 | 0.9 | 1.6 | max(0, 0.9-1.4+1) + max(1.6-1.4+1) = 0.5 + 1.2 = 1.7 |
|
0 | 1.9 | -2.1 | -0.4 | max(0, -2.1-1.9+1) + max(0, -0.4-1.9+1) = max(0, -3) + max(0, -1.3) = 0+0 = 0 |
|
1 | 0.2 | -1.5 | -2 | max(0, s0-s1+1) + max(0, s2-s1+1) = max(0, 0.2+1.5+1) + max(0, -2+1.5+1) = 2.7+0.5 = 3.2 | Syi=S1 |
1 | 1.1 | -2.9 | -2.1 | max(0, 1.1+2.9+1) + max(-2.1+2.9+1) = 5+1.8 = 6.8 |
|
1 | -0.3 | -1.7 | -2.8 | max(0, -0.3+1.7+1) + max(-2.8+1.7+1) = 2.4+0 = 2.4 |
|
2 | -0.1 | 3.5 | 2 | max(0, s0-s2+1) + max(0, s1-s2+1) = max(0, -0.1-2+1) + max(0, 3.5-2+1) = 0 + 2.5 = 2.5 | Syi=S2 |
2 | -0.7 | 3.9 | 1.6 | max(-0.7-1.6+1) + max(0, 3.9-1.6+1) = 0 + 3.3 = 3.3 |
|
2 | -1.4 | 1.7 | -1.2 | max(0, -1.4+1.2+1)+max(0, 1.7+1.2+1) = 0.8 + 3.9 = 4.7 |
|
| | | | L = 1/N ⋅∑Li = 2.766666667 |
|
One vs. All
The partial derivatives are the same as the Weston Watkins 1999 formula, with n=1 because there is only one Syi in the formula.
y | s0 | s1 | s2 | Li | Note |
0 | 1.3 | -0.1 | 0.6 | max(0, 1-s0) + max(0, 1+s1) + max(0, 1+s2) = max(0, 1-1.3) + max(0, 1-0.1) + max(0, 1+0.6) = 0 + 0.9 + 1.6 = 2.5 | Syi=S0 |
0 | 1.4 | 0.9 | 1.6 | max(0, 1-1.4) + max(0, 1+0.9) + max(0, 1+1.6) = 0 + 1.9 + 2.6 = 4.5 |
|
0 | 1.9 | -2.1 | -0.4 | 0.6 |
|
1 | 0.2 | -1.5 | -2 | max(0, 1+s0) + max(0, 1-s1) + max(0, 1+s2) = max(0, 1+0.2) + max(0, 1+1.5) + max(0, 1-2) = 1.2 + 2.5 + 0 = 3.7 | Syi=S1 |
1 | 1.1 | -2.9 | -2.1 | max(0, 1+1.1) + max(0, 1+2.9) + max(0, 1-2.1) = 2.1 + 3.9 + 0 = 6 |
|
1 | -0.3 | -1.7 | -2.8 | 3.4 |
|
2 | -0.1 | 3.5 | 2 | max(0, 1+s0) + max(0, 1+s1) + max(0, 1-s2) = max(0, 1-0.1) + max(0, 1+3.5) + max(0, 1-2) = 0.9 + 4.5 + 0 = 5.4 | Syi=S2 |
2 | -0.7 | 3.9 | 1.6 | 5.2 |
|
2 | -1.4 | 1.7 | -1.2 | 4.9 |
|
| | | | L = 1/N ⋅∑Li = 4.022222222 | |
Structured SVM
Li= max(0, max( sj ) – syi+ 1) , where j ≠ yi
The partial derivatives are the same as the One vs. All formula.
y | s0 | s1 | s2 | Li | Note |
0 | 1.3 | -0.1 | 0.6 | max(0, max(s1, s2)-s0+1) = max(0, max(-0.1, 0.6)-1.3+1) = max(0, 0.6-1.3+1) = 0.3 | Syi=S0 |
0 | 1.4 | 0.9 | 1.6 | 1.2 |
|
0 | 1.9 | -2.1 | -0.4 | max(0, max(-2.1, -0.4)-1.9+1) = max(0, -0.4-1.9+1) = 0 |
|
1 | 0.2 | -1.5 | -2 | max(0, max(s0, s2)-s1+1) = max(0, max(0.2, -2)+1.5+1) = max(0, 0.2+1.5+1) = 2.7 | Syi=S1 |
1 | 1.1 | -2.9 | -2.1 | 5 |
|
1 | -0.3 | -1.7 | -2.8 | 2.4 |
|
2 | -0.1 | 3.5 | 2 | 2.5 | Syi=S2 |
2 | -0.7 | 3.9 | 1.6 | 3.3 |
|
2 | -1.4 | 1.7 | -1.2 | 3.9 |
|
| | | | L = 1/N ⋅∑Li = 2.366666667 |
|
Softmax
Finding partial derivatives using chain rule:
solve for each term:
∂v/∂Wj = ∂(Wj⋅x+bj)/∂Wj = x
∂u/∂v = ∂(ev)/∂v = ev = esj
∂Li/∂u = ∂[–ln( esyi / ∑ju )]/∂u = ∂[–ln( esyi ) + ln(∑ju )]/∂u
= ∂[–syi + ln(∑ju )]/∂u = –0 + ∂[ln(∑ju )]/∂u
= 1 / ∑ju = 1 / ∑jesj
(chain rule steps omitted showing ∂[ln(A+B+C)]/∂A = 1/(A+B+C) )
therefore,
Solve for ∂Li/∂bj = ∂Li/∂u ⋅ ∂u/∂v ⋅ ∂v/∂bj
the only new term is ∂v/∂bj = ∂(Wj⋅x+bj)/∂bj = 1
therefore,
Solve for ∂Li/∂Wyi
let ∂Li/∂Wyi = ∂Li/∂u ⋅ ∂u/∂v ⋅ ∂v/∂Wyi where u=esyi , v=syi = Wyi⋅xi+byi
solve for each term:
∂v/∂Wyi = ∂(Wyi⋅xi+byi)/∂Wyi = xi
∂u/∂v = ∂(ev)/∂v = ev = esyi
∂Li/∂u = ∂[–ln( u / ∑ju )]/∂u = ∂[–ln( u ) + ln(∑ju )]/∂u
= ∂[–ln( u )]/∂u + ∂[ln(∑ju )]/∂u
= –1/u + 1 / ∑ju = –1/esyi + 1 / ∑jesj
therefore,
Solve for ∂Li/∂byi = ∂Li/∂u ⋅ ∂u/∂v ⋅ ∂v/∂byi
the only different term here is
∂v/∂byi = ∂(Wyi⋅xi+byi)/∂byi = 1
therefore,
y | s0 | s1 | s2 | Li (using ln) | Li (log10) | Note |
0 | 1.3 | -0.1 | 0.6 | -ln( exp(s0) / (exp(s0)+exp(s1)+exp(s2) )= 0.3 | 2.5 | Syi=S0 |
0 | 1.4 | 0.9 | 1.6 | 1.7 | 4.5 |
|
0 | 1.9 | -2.1 | -0.4 | 0 | 0.6 |
|
1 | 0.2 | -1.5 | -2 | -ln( exp(s1) / (exp(s0)+exp(s1)+exp(s2) )= 3.2 | 3.7 | Syi=S1 |
1 | 1.1 | -2.9 | -2.1 | 6.8 | 6 |
|
1 | -0.3 | -1.7 | -2.8 | 2.4 | 3.4 |
|
2 | -0.1 | 3.5 | 2 | -ln( exp(s2) / (exp(s0)+exp(s1)+exp(s2) )= 2.5 | 5.4 | Syi=S2 |
2 | -0.7 | 3.9 | 1.6 | 3.3 | 5.2 |
|
2 | -1.4 | 1.7 | -1.2 | 4.7 | 4.9 |
|
| | | | L = 1/N ⋅∑Li = 1.836640394 | 0.7976427883 |
|
No comments:
Post a Comment