Saturday 1 June 2024

Study Notes of a Simple Neuro Network

 Thanks for Stanford University's online course of Loss Functions and Optimisation and its accompanying web demo, now I finally have a concrete simple example of several Loss Function implementations, which I can delve into the maths.

The Architecture


In this architecture, W ∈  2×3, x ∈  1×2, b ∈  1×3, s ∈  1×3.

s = f(x, W) = Wx + b

The value of W is initialised to: W = [ 1 2 3 2 -4 -1 ] {\displaystyle {\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}}

The initial value of b is set to: b = [ 0 0.5 -0.5 ] {\displaystyle {\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}}

Calculating the Scores

There are 9 sets of input values for X, hence, Xi, i=[0..8]. Given the dataset, the s is calculated:
x0x1ys0 =
w0,0*x0 + w1,0*x1 + b0
s1 =
w0,1*x0 + w1,1*x1 + b1
s2 =
w0,2*x0 + w1,2*x1 + b2
0.50.401*0.5 + 2*0.4 + 0 =
1.3
2*0.5 + (-4)*0.4 + 0.5 =
 -0.1
3*0.5 + (-1)*0.4 + (-0.5) =
 0.6
0.80.301.4 0.91.6
0.30.801.9-2.1-0.4
-0.40.310.2-1.5-2
-0.30.711.1-2.9-2.1
-0.70.21-0.3-1.7-2.8
0.7-0.42-0.13.52
0.5-0.62-0.73.91.6
-0.4-0.52-1.41.7-1.2

Multiclass SVM Loss Functions

Weston Watkins 1999

Li = j yi max(0, sjsyi + 1)
where i=[0, number of input samples), j=[0, number of output classes)
since  sjWjxi + bj and syiWyixi + byi  
we have 
Li sj =1   ,      Li Wj =xi      and      Li bj =1 Li syi =–n   ,      Li Wyi =–xin      and      Li byi =–1n
where n is the number of times that syi appeared in Li (since Li is a sum of many terms).
n should be in the range of [0, number of output classes - 1).

ys0s1s2LiNote
01.3-0.10.6max(0, s1-s0+1) + max(0, s2-s0+1) =
max(0, -0.1-1.3+1) + max(0.6-1.3+1)=
0+0.3=
0.3
Syi=S0
01.40.91.6max(0, 0.9-1.4+1) + max(1.6-1.4+1) =
0.5 + 1.2 =
1.7

01.9-2.1-0.4max(0, -2.1-1.9+1) + max(0, -0.4-1.9+1) =
max(0, -3) + max(0, -1.3) =
0+0 =
0

10.2-1.5-2max(0, s0-s1+1) + max(0, s2-s1+1) =
max(0, 0.2+1.5+1) + max(0, -2+1.5+1) =
2.7+0.5 =
3.2
Syi=S1
11.1-2.9-2.1max(0, 1.1+2.9+1) + max(-2.1+2.9+1) =
5+1.8 =
6.8

1-0.3-1.7-2.8max(0, -0.3+1.7+1) + max(-2.8+1.7+1) =
2.4+0 =
2.4

2-0.13.52max(0, s0-s2+1) + max(0, s1-s2+1) =
max(0, -0.1-2+1) + max(0, 3.5-2+1) =
0 + 2.5 =
2.5
Syi=S2
2-0.73.91.6max(-0.7-1.6+1) + max(0, 3.9-1.6+1) =
0 + 3.3 =
3.3

2-1.41.7-1.2max(0, -1.4+1.2+1)+max(0, 1.7+1.2+1) =
0.8 + 3.9 =
4.7

L = 1/Li = 2.766666667

One vs. All

Li =max(0, –syi +1) + j yi max(0, sj+1)
The partial derivatives are the same as the Weston Watkins 1999 formula, with n=1 because there is only one Syi in the formula.

ys0s1s2LiNote
01.3-0.10.6max(0, 1-s0) + max(0, 1+s1) + max(0, 1+s2) =
max(0, 1-1.3) + max(0, 1-0.1) + max(0, 1+0.6) =
0 + 0.9 + 1.6 =
2.5
Syi=S0
01.40.91.6max(0, 1-1.4) + max(0, 1+0.9) + max(0, 1+1.6) =
0 + 1.9 + 2.6 =
4.5

01.9-2.1-0.40.6
10.2-1.5-2max(0, 1+s0) + max(0, 1-s1) + max(0, 1+s2) =
max(0, 1+0.2) + max(0, 1+1.5) + max(0, 1-2) =
1.2 + 2.5 + 0 =
3.7
Syi=S1
11.1-2.9-2.1max(0, 1+1.1) + max(0, 1+2.9) + max(0, 1-2.1) =
2.1 + 3.9 + 0 =
6

1-0.3-1.7-2.83.4
2-0.13.52max(0, 1+s0) + max(0, 1+s1) + max(0, 1-s2) =
max(0, 1-0.1) + max(0, 1+3.5) + max(0, 1-2) =
0.9 + 4.5 + 0 =
5.4
Syi=S2
2-0.73.91.65.2
2-1.41.7-1.24.9
L = 1/Li = 4.022222222

Structured SVM

Limax(0,  max( sj )  syi+ 1)    , where j  yi
The partial derivatives are the same as the One vs. All formula.

ys0s1s2LiNote
01.3-0.10.6max(0, max(s1, s2)-s0+1) =
max(0, max(-0.1, 0.6)-1.3+1) =
max(0, 0.6-1.3+1) =
0.3
Syi=S0
01.40.91.61.2
01.9-2.1-0.4max(0, max(-2.1, -0.4)-1.9+1) =
max(0, -0.4-1.9+1) =
0

10.2-1.5-2max(0, max(s0, s2)-s1+1) =
max(0, max(0.2, -2)+1.5+1) =
max(0, 0.2+1.5+1) =
2.7
Syi=S1
11.1-2.9-2.15
1-0.3-1.7-2.82.4
2-0.13.522.5Syi=S2
2-0.73.91.63.3
2-1.41.7-1.23.9
L = 1/Li = 2.366666667

Softmax

Li =–ln( P(Y=yi |X=xi)) =–ln( esyi j esj )
Finding partial derivatives using chain rule:
let    Li Wj = Li u × u v × v Wj     where    u=esj ,    v=sj=Wjxi+bj

solve for each term:
v/Wj = (Wjx+bj)/Wj = x
u/v (ev)/v = ev = esj
Li/u ∂[ln( esyi / ju )]/u  ∂[ln( esyi ) + ln(ju )]/u
    =  ∂[syi + ln(ju )]/u = 0∂[ln(ju )]/u
   =  1 /  ju = 1 /  jesj     
(chain rule steps omitted showing ∂[ln(A+B+C)]/A = 1/(A+B+C) )
therefore, Li Wj = Pjxi  

Solve for  Li/bj = Li/u ⋅ u/v ⋅ v/bj 
the only new term is v/bj = (Wjx+bj)/bj = 1
therefore,  Li bj = Pj 

Solve for Li/Wyi 
let ∂Li/Wyi = Li/u ⋅ u/v ⋅ v/Wyi   where u=esyi  , v=syi = Wyixi+byi
solve for each term:
v/Wyi = (Wyixi+byi)/Wyi xi
u/v (ev)/v = eesyi
Li/u ∂[ln( u / ju )]/u  ∂[ln( u ) + ln(ju )]/u 
    = ∂[ln( u )]/u + ∂[ln(ju )]/u
   =  1/u + 1 /  ju  = 1/esyi  + 1 /  jesj    
therefore,  Li Wyi = (Pyi–1)xi 

Solve for Li/byi =  Li/u ⋅ u/v ⋅ v/byi 
the only different term here is 
v/byi = (Wyixi+byi)/byi  = 1
therefore,  Li byi = Pyi–1  

ys0s1s2Li (using ln)Li 
(log10)
Note
01.3-0.10.6-ln( exp(s0) / (exp(s0)+exp(s1)+exp(s2) )=
0.3
2.5Syi=S0
01.40.91.61.74.5
01.9-2.1-0.400.6
10.2-1.5-2-ln( exp(s1) / (exp(s0)+exp(s1)+exp(s2) )=
3.2
3.7Syi=S1
11.1-2.9-2.16.86
1-0.3-1.7-2.82.43.4
2-0.13.52-ln( exp(s2) / (exp(s0)+exp(s1)+exp(s2) )=
2.5
5.4Syi=S2
2-0.73.91.63.35.2
2-1.41.7-1.24.74.9
L = 1/Li = 1.8366403940.7976427883

No comments: