Romen's eSpace: Study Notes of a Simple Neuro Network

Thanks for Stanford University's online course of Loss Functions and Optimisation and its accompanying web demo, now I finally have a concrete simple example of several Loss Function implementations, which I can delve into the maths.

The Architecture

In this architecture, W ∈ $\mathbb {R}$ ^2×3, x ∈ $\mathbb {R}$ ^1×2, b ∈ $\mathbb {R}$ ^1×3, s ∈ $\mathbb {R}$ ^1×3.

s = f(x, W) = Wx + b

The value of W is initialised to: ${\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}$

The initial value of b is set to: ${\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}$

Calculating the Scores

There are 9 sets of input values for X, hence, X_i, i=[0..8]. Given the dataset, the s is calculated:

x0	x1	y	s0 = w0,0x0 + w1,0x1 + b0	s1 = w0,1x0 + w1,1x1 + b1	s2 = w0,2x0 + w1,2x1 + b2
0.5	0.4	0	10.5 + 20.4 + 0 = 1.3	20.5 + (-4)0.4 + 0.5 = -0.1	30.5 + (-1)0.4 + (-0.5) = 0.6
0.8	0.3	0	1.4	0.9	1.6
0.3	0.8	0	1.9	-2.1	-0.4
-0.4	0.3	1	0.2	-1.5	-2
-0.3	0.7	1	1.1	-2.9	-2.1
-0.7	0.2	1	-0.3	-1.7	-2.8
0.7	-0.4	2	-0.1	3.5	2
0.5	-0.6	2	-0.7	3.9	1.6
-0.4	-0.5	2	-1.4	1.7	-1.2

Multiclass SVM Loss Functions

Weston Watkins 1999

L_{i} = \sum_{j \neq y_{i}}^{} max(0, s_{j} - s_{y_{i}} + 1)

where i=[0, number of input samples), j=[0, number of output classes)

since s_j = W_jx_i + b_j and s_{y_i}= W_{y_i}x_i + b_{y_i}

we have

\frac{\partial L_{i}}{\partial s_{j}} =1, \frac{\partial L_{i}}{\partial W_{j}} = x_{i} and \frac{\partial L_{i}}{\partial b_{j}} =1

\frac{\partial L_{i}}{\partial s_{y_{i}}} = –n, \frac{\partial L_{i}}{\partial W_{y_{i}}} = {–x}_{i} n and \frac{\partial L_{i}}{\partial b_{y_{i}}} =-1 n

where n is the number of times that s_{y_i} appeared in L_i (since L_i is a sum of many terms).
n should be in the range of [0, number of output classes - 1).

y	s0	s1	s2	L_i	Note
0	1.3	-0.1	0.6	max(0, s1-s0+1) + max(0, s2-s0+1) = max(0, -0.1-1.3+1) + max(0.6-1.3+1)= 0+0.3= 0.3	Syi=S0
0	1.4	0.9	1.6	max(0, 0.9-1.4+1) + max(1.6-1.4+1) = 0.5 + 1.2 = 1.7
0	1.9	-2.1	-0.4	max(0, -2.1-1.9+1) + max(0, -0.4-1.9+1) = max(0, -3) + max(0, -1.3) = 0+0 = 0
1	0.2	-1.5	-2	max(0, s0-s1+1) + max(0, s2-s1+1) = max(0, 0.2+1.5+1) + max(0, -2+1.5+1) = 2.7+0.5 = 3.2	Syi=S1
1	1.1	-2.9	-2.1	max(0, 1.1+2.9+1) + max(-2.1+2.9+1) = 5+1.8 = 6.8
1	-0.3	-1.7	-2.8	max(0, -0.3+1.7+1) + max(-2.8+1.7+1) = 2.4+0 = 2.4
2	-0.1	3.5	2	max(0, s0-s2+1) + max(0, s1-s2+1) = max(0, -0.1-2+1) + max(0, 3.5-2+1) = 0 + 2.5 = 2.5	Syi=S2
2	-0.7	3.9	1.6	max(-0.7-1.6+1) + max(0, 3.9-1.6+1) = 0 + 3.3 = 3.3
2	-1.4	1.7	-1.2	max(0, -1.4+1.2+1)+max(0, 1.7+1.2+1) = 0.8 + 3.9 = 4.7
				L = 1/N ⋅∑L_i = 2.766666667

One vs. All

L_{i} =max(0, {–s}_{y_{i}} +1) + \sum_{j \neq y_{i}}^{} max(0, s_{j} +1)

The partial derivatives are the same as the Weston Watkins 1999 formula, with n=1 because there is only one S_yi in the formula.

y	s0	s1	s2	L_i	Note
0	1.3	-0.1	0.6	max(0, 1-s0) + max(0, 1+s1) + max(0, 1+s2) = max(0, 1-1.3) + max(0, 1-0.1) + max(0, 1+0.6) = 0 + 0.9 + 1.6 = 2.5	Syi=S0
0	1.4	0.9	1.6	max(0, 1-1.4) + max(0, 1+0.9) + max(0, 1+1.6) = 0 + 1.9 + 2.6 = 4.5
0	1.9	-2.1	-0.4	0.6
1	0.2	-1.5	-2	max(0, 1+s0) + max(0, 1-s1) + max(0, 1+s2) = max(0, 1+0.2) + max(0, 1+1.5) + max(0, 1-2) = 1.2 + 2.5 + 0 = 3.7	Syi=S1
1	1.1	-2.9	-2.1	max(0, 1+1.1) + max(0, 1+2.9) + max(0, 1-2.1) = 2.1 + 3.9 + 0 = 6
1	-0.3	-1.7	-2.8	3.4
2	-0.1	3.5	2	max(0, 1+s0) + max(0, 1+s1) + max(0, 1-s2) = max(0, 1-0.1) + max(0, 1+3.5) + max(0, 1-2) = 0.9 + 4.5 + 0 = 5.4	Syi=S2
2	-0.7	3.9	1.6	5.2
2	-1.4	1.7	-1.2	4.9
				L = 1/N ⋅∑L_i = 4.022222222

Structured SVM

L_i= max(0, max( s_j ) – s_{y_i}+ 1) , where j ≠ y_i

The partial derivatives are the same as the One vs. All formula.

y	s0	s1	s2	L_i	Note
0	1.3	-0.1	0.6	max(0, max(s1, s2)-s0+1) = max(0, max(-0.1, 0.6)-1.3+1) = max(0, 0.6-1.3+1) = 0.3	Syi=S0
0	1.4	0.9	1.6	1.2
0	1.9	-2.1	-0.4	max(0, max(-2.1, -0.4)-1.9+1) = max(0, -0.4-1.9+1) = 0
1	0.2	-1.5	-2	max(0, max(s0, s2)-s1+1) = max(0, max(0.2, -2)+1.5+1) = max(0, 0.2+1.5+1) = 2.7	Syi=S1
1	1.1	-2.9	-2.1	5
1	-0.3	-1.7	-2.8	2.4
2	-0.1	3.5	2	2.5	Syi=S2
2	-0.7	3.9	1.6	3.3
2	-1.4	1.7	-1.2	3.9
				L = 1/N ⋅∑L_i = 2.366666667

Softmax

L_{i} = -ln(P(Y = y_{i} | X = x_{i})) = -ln (\frac{e^{s_{y_{i}}}}{\sum_{j}^{} e^{s_{j}}})

Finding partial derivatives using chain rule:

let \frac{\partial L_{i}}{\partial W_{j}} = \frac{\partial L_{i}}{\partial u} \times \frac{\partial u}{\partial v} \times \frac{\partial v}{\partial W_{j}}

where u = e^{s_{j}}, v = s_{j} = W_{j} \cdot x_{i} + b_{j}

solve for each term:

∂v/∂W_j = ∂(W_j⋅x+b_j)/∂W_j = x

∂u/∂v = ∂(e^v)/∂v = e^v = e^s_j

∂L_i/∂u = ∂[–ln( e^s_{y_i} / ∑_ju )]/∂u = ∂[–ln( e^s_{y_i} ) + ln(∑_ju )]/∂u

= ∂[–s_{y_i} + ln(∑_ju )]/∂u = –0 + ∂[ln(∑_ju )]/∂u

= 1 / ∑_ju = 1 / ∑_je^s_j

(chain rule steps omitted showing ∂[ln(A+B+C)]/∂A = 1/(A+B+C) )

therefore,

\frac{\partial L_{i}}{\partial W_{j}} = P_{j} \cdot x_{i}

Solve for ∂L_i/∂b_j = ∂L_i/∂u ⋅ ∂u/∂v ⋅ ∂v/∂b_j

the only new term is ∂v/∂b_j = ∂(W_j⋅x+b_j)/∂b_j = 1

therefore,

\frac{\partial L_{i}}{\partial b_{j}} = P_{j}

Solve for ∂L_i/∂W_{y_i}

let ∂L_i/∂W_{y_i} = ∂L_i/∂u ⋅ ∂u/∂v ⋅ ∂v/∂W_{y_i} where u=e^s_{y_i} , v=s_{y_i} = W_{y_i}⋅x_i+b_{y_i}

solve for each term:

∂v/∂W_{y_i} = ∂(W_{y_i}⋅x_i+b_{y_i})/∂W_{y_i} = x_i

∂u/∂v = ∂(e^v)/∂v = e^v= e^s_{y_i}

∂L_i/∂u = ∂[–ln( u / ∑_ju )]/∂u = ∂[–ln( u ) + ln(∑_ju )]/∂u

= ∂[–ln( u )]/∂u + ∂[ln(∑_ju )]/∂u

= –1/u + 1 / ∑_ju = –1/e^s_{y_i} + 1 / ∑_je^s_j

therefore,

\frac{\partial L_{i}}{\partial W_{y_{i}}} = (P_{y_{i}} -1) \cdot x_{i}

Solve for ∂L_i/∂b_{y_i} = ∂L_i/∂u ⋅ ∂u/∂v ⋅ ∂v/∂b_{y_i}

the only different term here is

∂v/∂b_{y_i} = ∂(W_{y_i}⋅x_i+b_{y_i})/∂b_{y_i} = 1

therefore,

\frac{\partial L_{i}}{\partial b_{y_{i}}} = P_{y_{i}} -1

y	s0	s1	s2	L_i (using ln)	L_i (log10)	Note
0	1.3	-0.1	0.6	-ln( exp(s0) / (exp(s0)+exp(s1)+exp(s2) )= 0.3	2.5	Syi=S0
0	1.4	0.9	1.6	1.7	4.5
0	1.9	-2.1	-0.4	0	0.6
1	0.2	-1.5	-2	-ln( exp(s1) / (exp(s0)+exp(s1)+exp(s2) )= 3.2	3.7	Syi=S1
1	1.1	-2.9	-2.1	6.8	6
1	-0.3	-1.7	-2.8	2.4	3.4
2	-0.1	3.5	2	-ln( exp(s2) / (exp(s0)+exp(s1)+exp(s2) )= 2.5	5.4	Syi=S2
2	-0.7	3.9	1.6	3.3	5.2
2	-1.4	1.7	-1.2	4.7	4.9
				L = 1/N ⋅∑L_i = 1.836640394	0.7976427883

Romen's eSpace

Saturday, 1 June 2024

Study Notes of a Simple Neuro Network

The Architecture

Calculating the Scores

Multiclass SVM Loss Functions

Weston Watkins 1999

One vs. All

Structured SVM

Softmax

No comments:

Blog Archive

Tags/Labels

About Me

Archived Cluster Map

Ads

Followers