Saturday 29 June 2024

First Art Sale

 It's been a fruitful month. I joined the local art society this year and participated the Half-Price Sale for the first time. It was on the weekend of June 1st and 2nd. Despite the terrible weather on Saturday, we had a pretty good turn out overall. I managed to sell one of the four paintings that I entered in the sale. They are all landscapes from our Tasmania holiday 2 years ago.


However, the biggest fun I had this month was learning neural networks following cs231n course. I am up to the last 2 parts of assignment 2. I hope I will finish it this weekend!

Saturday 22 June 2024

Calculating Gradient of Batch Normalisation

Part of the cs231n assignment 2 is to calculate the gradients of Batch Normalisation layer. Here are the equations calculating the BN:

X=[ x1 x2
...
xN
]   with dimension (N, D)

μ= 1N k=1 N xk   with dimension (D,)
v= 1N k=1 N ( xkμ) 2   with dimension (D,)
σ= v+ε    with dimension (D,)
yi= xi-μ σ    where yi is dimension (D,) and Y (or x_hat) is (N, D)
The basic partial derivatives of the above equations are as following. They are the building blocks to find the final ∂L/∂x.
L is loss function and LY=γ×dout, of dimension (N, D)
μxi = 1N k=1 N 1    of dimension (D,).
vμ = 1N k=1 N (2xk2μ) = 2N k=1 N (xkμ)    of dimension (D,).
This turns out to be 0 because sum of xi and sum of mu are the same
However, vxi = 2N (xiμ)
σv = 0.5×1v+ε = 12σ
yiμ = −1σ
yiσ = −xiμσ2

Thanks to this post I understand the processing using the computational graph. The following table shows the computational graph: top-down is the forward pass in black; bottom up is backward pass in red.

(1):= x (N,D)
d(3)+d(2)
*1/N*np.ones((N,D))
=*
∂μ/∂x
∂L/∂μ * ∂μ/∂x
∂L/∂v * ∂v/μ * ∂μ/∂x
(9):= γ (D,) (11):= β (D,)
↓            ↘→ (2):= mean= 1 N i=1 N xi  ↓   ↓
(d(4)+d(8))
=...*∂v/∂x - ∂L/∂μ
∂L/∂v *∂v/∂x - ∂L/μ
(-1)*(d(4)+d(8)).sum(axis=0)
= - ∑(-
∂L/∂μ - ∂L/μ2)
=
∑(∂L/∂μ + ∂L/μ2)
 ↓   ↓
(3):= (1)-(2)  ←↙   ↓   ↓
↓            ↘→ (4):= (3) **2
*2*(3)
=*(-∂v/
μ) = - ∂L/μ2
=*(∂v/∂x) = ∂L/∂v * 
∂v/∂x
 ↓   ↓
       ↓ (5):= var =         1 N i=1 N (4)i
*1/N*np.ones((N,D))
 ↓   ↓
       ↓ (6): = std = sqrt((5)+ε)
*0.5*1/std
=*
∂σ/∂v = ∂L/∂v
 ↓   ↓
*(7)
=*(-
∂Y/∂μ) = -∂L/∂μ
(7):= 1/(6)
*[-1/((6)**2)]
=*
∂Y/∂σ =∂L/σ
 ↓   ↓
(8):= (3) * (7) ←↙
[*(3)].sum(axis=0)
 ↓   ↓
*γ=
∂L/∂Y
 ↓
  ↓
(10):= (8) * (9) ←←↙ *(8)   ↓
dout   ↓
(12):= (10) + (11) ←←← ←←←↙ dβ= dout.sum(axis=0)
out (N,D)
Loss
In python:
x, sample_mean, sample_var, sample_std, gamma, x_hat, eps = cache
    N, D=dout.shape
   
    dbeta = dout.sum(axis=0)
    dgamma = (dout * x_hat).sum(axis=0)

    # using computational graph in https://romenlaw.blogspot.com/2024/06/calculating-gradient-using-computation.html
    step3 = x-sample_mean

    d10 = dout * gamma
    d8_3 = d10 * (1/sample_std)          
    d8_7 = (d10 * step3).sum(axis=0)    
    d7 = - d8_7 / (sample_var + eps)    
    d6 = d7 * 0.5 / sample_std          
    d5 = d6 / N * np.ones(shape=(N,D))  
    d4 = d5 * 2 * step3                  
    d3_1 = d4 + d8_3                     # (N,D)
    d3_2 = -1 * (d4 + d8_3).sum(axis=0)  # (D,)
    d2 = d3_2 / N * np.ones(shape=(N,D)) # (N,D)
    dx = d2 + d3_1

Intuitively, it's like following the 3 paths from Y to X directed by the red arrows. Now doing it the analytical way using chain rule.
From page 4 of the original paper https://arxiv.org/pdf/1502.03167 we have the formulae for the derivatives. However, since the equations used in the assignment 2 is different from the paper (especially how the variance and mean is used in calculating x_hat or y), we can rewrite the derivatives using the notations in assignment 2:
 
Lxi =( Lμ μxi + Lv vμ μxi ) + Lv vxi Lyi yiμ
     = Lyi yiμ μxi + Lyi yiσ σv vμ μxi + Lyi yiσ σv vxi Lyi yiμ
     =[ doutiγ (1σ) 1N i=1 N 1
       doutiγ (xiμσ2) 12σ 2N i=1 N (xiμ) 1N k=1 N 1 ]
       + doutiγ (xiμσ2) 12σ 2N i=1 N (xiμ)
       doutiγ (1σ)
     = 1N i=1 N [ doutiγ σ + 1N ( i=1 N doutiγyi σ ) yi ]        ← in below python code, this is the dL_mu_x
       1N ( i=1 N doutiγyi σ ) yi        ← in below python code, this is the dL_v_x
       + doutiγ σ        ← in below python code, this is the dL_mu
In python code (modified slightly from here for readability):
x, mean, var, std, gamma, x_hat, eps = cache
    S = lambda x: x.sum(axis=0)                     # helper function
   
    dbeta = dout.sum(axis=0)
    dgamma = (dout * x_hat).sum(axis=0)

    N = dout.shape[0]  # dout dimension (N,D)
    dx = dout * gamma / (N * std)          # temporarily initialize scale value
   
    dL_v_x = -S(dx*x_hat)*x_hat
    dL_mu = - N*dx

    dL_mu2 = -dL_v_x
    d_mu_x = S(-dx + dL_mu2)  #*np.ones(x.shape)
    #d_mu_x = -S(dx)

    dx = dL_v_x - dL_mu + d_mu_x
The dx difference between this and the above method is about 1e-10. Curiously, the standard answer ignores the dL_mu2 term but yields better result 5e-13. I wonder why. 2 months later, after watching Andrej Karpathy's lecture on back prop, I realised that dv/dμ is actually 0.

Sunday 16 June 2024

Summary of a Fully Connected Neural Network

 I usually spend my weekends on painting. For the last couple of weeks however, I have been learning Deep Learning following the cs231n course. Now that I have just finished Assignment 1, the two main things I have learned are the theory/maths taught in the course, as well as how to use numpy to implement them. Here is my summary of what I have learned using the 2 fully connected-layer neural network.

The architecture (Forward pass should be read from bottom up; Back propagation is top down):


Layers Forward Backward
Output
number of nodes (classes): C
scores: (C,)
Loss function: Softmax(f(x)) = Li =–ln( esyi j esj ) = – esyi + j esj
L= 1 N i=1 N Li+R(W)
Regularisation: R(W)= 12 λ k l Wk,l2
# x is the output of the previous layer
N=x.shape[0]
P = np.exp(x - x.max(axis=1, keepdims=True))
P /= P.sum(axis=1, keepdims=True)          

loss = -np.log(P[range(N), y]).sum() / N  

loss += 0.5 * self.reg * (np.sum(self.params['W2']**2)
+ np.sum(self.params['W1']**2) )
Gradients:
L Sj =Pj

L Syi =Pyi- 1
# x is the scores
# P=exp(scores) / scores_exp_sum, dimention is (N,C)
# grad x_j = Pj
# grad x_yi = Pyi-1
N=x.shape[0]
P = np.exp(x - x.max(axis=1, keepdims=True)) # numerically stable exponents
P /= P.sum(axis=1, keepdims=True)            # row-wise probabilities (softmax)

P[range(N), y] -= 1
dx = P / N
Fully Connected Layer #2

W2: (H, C)
b2: (C,)
f(x) = W2x + b2
# X is the output of the previous layer
scores = X.dot(W)
Gradients FC2
R W =λ

Tip: use dimension analysis! Note that you do not need to remember the expressions for dW and dX because they are easy to re-derive based on dimensions.
# dout is the gradient passed in from the Output layer
# i.e. the dx from above
dx = dout.dot(w.T).reshape(x.shape)

dw = x.reshape(x.shape[0], np.prod(x.shape[1:])).T.dot(dout)
dw += dw * self.reg

db = np.sum(dout, axis=0)
Fully Connected Layer #1
number of nodes: H
W1: (D, H)
b1: (H,)
Activation: ReLU(f(x))
out = np.maximum(0, out)
 f(x) = W1x + b1
out = input.dot(w) + b
Gradients FC1
ReLU backward:
# dout is gradient from above layer FC2
# i.e. the dx from above
x[x<0]=0
x[x>0]=1
dx = np.multiply(x, dout)
f(x) backward: same as FC2 layer above
Input
input data dimension: D
number of input data/rows: N
X: (N, D)
The input images are (32, 32, 3), which is reshaped into 32 x 32 x 3 = 3072
i.e. D = 3072
# reshape x into (N,D)
input = x.reshape(x.shape[0], np.prod(x.shape[1:]))
# or better
input = x.reshape(x.shape[0], -1))
Here is a good summary of the different optimisation algorithms: https://www.youtube.com/watch?v=spbBQshdhL4

Some of my learnings from doing assignment 1:

method pre-process best accuracy
KNN reshaping 32x32x3 into 3072 28% with K=10
1-layer SVM reshaping 32x32x3 into 3072,
zero center each image (by subtracting mean of training set),
append bias (initialised to 1) as extra column for each image
training: 37%
validation: 38%
with lr=e-7
reg=5e4
1-layer Softmax same as SVM above training: 33%
validation: 34%
with lr=e-7
reg=2.5e4
2-layer reshaping 32x32x3 into 3072 validation: 53.8%
test: 52.7
with lr=e-3
reg=0.5
epochs=20
H size=100
1-layer SVM on features extract 2 features (HOG, color histogram) for each image,
zero-center the feature values,
normalise the feature values,
add bias dimension 
SVM test = 41.4%
2-layer on features same as abovetest = 60.3%
with lr=1.209071e-01
epochs=10
H=274
reg=0.000001

K-Nearest Neighbour (KNN)

The idea behind this approach is to compute the L2 distance between each test image and all the training images, then sum them up. There is no training involved. The distance calculation happens at test time.
Performance wise on Colab with CPU only, using 2 for loops took 43s, one loop took 51s (using sqrt()) or 38s(without sqrt()), using Numpy's broadcasting feature took less than 1s.
Two loops:
num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                # this takes 43s to run with sqrt, 35s without.
                dists[i,j]=np.sum((self.X_train[j]-X[i])**2)
      return dists
Vectorisation approach:
# using (I1-I2)^2 = I1^2+I2^2-2*I1*I2
        # this takes 1s
        dists = np.sum(self.X_train ** 2, axis=1) \
          + (np.sum(X ** 2, axis=1))[:, np.newaxis] \
          -2 * np.dot(X, self.X_train.T)

        return dists

The output dists stores num_test rows of distances; each row contains num_train columns, which is the L2 distance between ith test image and jth training image.

dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)
(500, 5000)

Using KNN to predict an image's classification is basically finding it's indices in the dists for the K shortest distances, then find the most frequent y-label in those K elements:
    def predict_labels(self, dists, k=1):
        """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # DONE:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort.                             #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            indices=np.argsort(dists[i])[:k]
            closest_y=self.y_train[indices]

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            #########################################################################
            # DONE:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            values, counts = np.unique(closest_y, return_counts=True)
            most_frequent_value = values[counts.argmax()]
            y_pred[i]=most_frequent_value

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        return y_pred

Testing with various K values:
From the chart, the value K=10 seems to yield the highest accuracy. The testing result using this K value is about 28%.

Linear Classifier SVM

Forward pass Backward propagation
idx = np.random.choice(num_train, size=batch_size, replace=False)
X_batch = X[idx]
y_batch = y[idx]

# evaluate loss and gradient
loss, grad = svm_loss_vectorized(X_batch, y_batch, reg)
# perform parameter update
self.W-=learning_rate * grad
The loss and gradient calculation:
def svm_loss_vectorized(W, X, y, reg):
####################
# calculate loss
####################
    loss = 0.0
    dW = np.zeros(W.shape)  # initialize the gradient as zero

    num_train=X.shape[0]
    scores=X.dot(W)
    # extract all Syi into a 1xN matrix (a column)
    scores_yi=scores[np.arange(num_train) , y][: , np.newaxis]
    margins = np.maximum(0, scores - scores_yi + 1)  
    # set all yi elements to 0
    margins[np.arange(num_train),y] = 0
   
    loss = np.mean(np.sum(margins, axis=1))
    # Add regularization to the loss.
    loss += reg * np.sum(W * W)
 
####################
# calculate gradient
####################
  mask = np.zeros(margins.shape)   
    # for positions where margins>0, the gradient at Sj is X[i]
    mask[margins > 0] = 1
   
    # for Yi positions, it's -nX[i], where n is number of times Syi appeared in
    # margins, which is the sum of all appearances of Sj
    row_sum = np.sum(mask, axis=1)
    mask[np.arange(num_train), y] = -row_sum.T

    dW += np.dot(X.T, mask)
    dW /= num_train

    # Regularize
    dW += reg*W
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW
visualising the learned weights:

Linear Classifier Softmax

The only difference here is the loss and gradient calculation:
def softmax_loss_vectorized(W, X, y, reg):

    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)


    # *****calculate scores*****
    num_train=X.shape[0]
    num_classes = W.shape[1]
    scores = X.dot(W)
    # scores is N x C matrix
    scores -= np.max(scores, axis=1)[:, np.newaxis]
    # scores_y and scores_sum are 1-dimentional with N elements
    scores_y = scores[np.arange(num_train),y]
    scores_exp_sum = np.sum(np.exp(scores), axis=1)

    # *****calculate loss*****
    losses = np.log(scores_exp_sum) - scores_y
    loss=np.sum(losses) / num_train
    loss += reg*np.sum(W**2)

# *****calculate gradient*****
    # P=exp(scores) / scores_exp_sum, dimention is (N,C)
    # grad Wj = Pj * xi
    # grad Wyi = (Pyi-1) * xi
    P=np.exp(scores) / scores_exp_sum[:, np.newaxis]
    P[np.arange(num_train), y] -= 1
    dW += X.T.dot(P)
    dW /= num_train
    dW += reg * 2 * W

    return loss, dW

visualising the weights:


2-Layer Neural Network

Visualising output of bad hyper parameters: slow learning rate, low accuracy, not distinct features (grainy, noisy)

Visualising output of better hyper parameters: (but the accuracy chart suggests overfitting)

Features

'Manually' extract 2 features for each image: Histogram of Oriented Gradients (HOG) and color histogram. Use these features as input for the networks.
The best accuracy results show that using features is more effective than the raw images alone.
Some interesting visuals:







I am having so much fun following this course, I am going to explore more of the AI related courses.


Saturday 1 June 2024

Study Notes of a Simple Neuro Network

 Thanks for Stanford University's online course of Loss Functions and Optimisation and its accompanying web demo, now I finally have a concrete simple example of several Loss Function implementations, which I can delve into the maths.

The Architecture


In this architecture, W ∈  2×3, x ∈  1×2, b ∈  1×3, s ∈  1×3.

s = f(x, W) = Wx + b

The value of W is initialised to: W = [ 1 2 3 2 -4 -1 ] {\displaystyle {\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}}

The initial value of b is set to: b = [ 0 0.5 -0.5 ] {\displaystyle {\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}}

Calculating the Scores

There are 9 sets of input values for X, hence, Xi, i=[0..8]. Given the dataset, the s is calculated:
x0x1ys0 =
w0,0*x0 + w1,0*x1 + b0
s1 =
w0,1*x0 + w1,1*x1 + b1
s2 =
w0,2*x0 + w1,2*x1 + b2
0.50.401*0.5 + 2*0.4 + 0 =
1.3
2*0.5 + (-4)*0.4 + 0.5 =
 -0.1
3*0.5 + (-1)*0.4 + (-0.5) =
 0.6
0.80.301.4 0.91.6
0.30.801.9-2.1-0.4
-0.40.310.2-1.5-2
-0.30.711.1-2.9-2.1
-0.70.21-0.3-1.7-2.8
0.7-0.42-0.13.52
0.5-0.62-0.73.91.6
-0.4-0.52-1.41.7-1.2

Multiclass SVM Loss Functions

Weston Watkins 1999

Li = j yi max(0, sjsyi + 1)
where i=[0, number of input samples), j=[0, number of output classes)
since  sjWjxi + bj and syiWyixi + byi  
we have 
Li sj =1   ,      Li Wj =xi      and      Li bj =1 Li syi =–n   ,      Li Wyi =–xin      and      Li byi =–1n
where n is the number of times that syi appeared in Li (since Li is a sum of many terms).
n should be in the range of [0, number of output classes - 1).

ys0s1s2LiNote
01.3-0.10.6max(0, s1-s0+1) + max(0, s2-s0+1) =
max(0, -0.1-1.3+1) + max(0.6-1.3+1)=
0+0.3=
0.3
Syi=S0
01.40.91.6max(0, 0.9-1.4+1) + max(1.6-1.4+1) =
0.5 + 1.2 =
1.7

01.9-2.1-0.4max(0, -2.1-1.9+1) + max(0, -0.4-1.9+1) =
max(0, -3) + max(0, -1.3) =
0+0 =
0

10.2-1.5-2max(0, s0-s1+1) + max(0, s2-s1+1) =
max(0, 0.2+1.5+1) + max(0, -2+1.5+1) =
2.7+0.5 =
3.2
Syi=S1
11.1-2.9-2.1max(0, 1.1+2.9+1) + max(-2.1+2.9+1) =
5+1.8 =
6.8

1-0.3-1.7-2.8max(0, -0.3+1.7+1) + max(-2.8+1.7+1) =
2.4+0 =
2.4

2-0.13.52max(0, s0-s2+1) + max(0, s1-s2+1) =
max(0, -0.1-2+1) + max(0, 3.5-2+1) =
0 + 2.5 =
2.5
Syi=S2
2-0.73.91.6max(-0.7-1.6+1) + max(0, 3.9-1.6+1) =
0 + 3.3 =
3.3

2-1.41.7-1.2max(0, -1.4+1.2+1)+max(0, 1.7+1.2+1) =
0.8 + 3.9 =
4.7

L = 1/Li = 2.766666667

One vs. All

Li =max(0, –syi +1) + j yi max(0, sj+1)
The partial derivatives are the same as the Weston Watkins 1999 formula, with n=1 because there is only one Syi in the formula.

ys0s1s2LiNote
01.3-0.10.6max(0, 1-s0) + max(0, 1+s1) + max(0, 1+s2) =
max(0, 1-1.3) + max(0, 1-0.1) + max(0, 1+0.6) =
0 + 0.9 + 1.6 =
2.5
Syi=S0
01.40.91.6max(0, 1-1.4) + max(0, 1+0.9) + max(0, 1+1.6) =
0 + 1.9 + 2.6 =
4.5

01.9-2.1-0.40.6
10.2-1.5-2max(0, 1+s0) + max(0, 1-s1) + max(0, 1+s2) =
max(0, 1+0.2) + max(0, 1+1.5) + max(0, 1-2) =
1.2 + 2.5 + 0 =
3.7
Syi=S1
11.1-2.9-2.1max(0, 1+1.1) + max(0, 1+2.9) + max(0, 1-2.1) =
2.1 + 3.9 + 0 =
6

1-0.3-1.7-2.83.4
2-0.13.52max(0, 1+s0) + max(0, 1+s1) + max(0, 1-s2) =
max(0, 1-0.1) + max(0, 1+3.5) + max(0, 1-2) =
0.9 + 4.5 + 0 =
5.4
Syi=S2
2-0.73.91.65.2
2-1.41.7-1.24.9
L = 1/Li = 4.022222222

Structured SVM

Limax(0,  max( sj )  syi+ 1)    , where j  yi
The partial derivatives are the same as the One vs. All formula.

ys0s1s2LiNote
01.3-0.10.6max(0, max(s1, s2)-s0+1) =
max(0, max(-0.1, 0.6)-1.3+1) =
max(0, 0.6-1.3+1) =
0.3
Syi=S0
01.40.91.61.2
01.9-2.1-0.4max(0, max(-2.1, -0.4)-1.9+1) =
max(0, -0.4-1.9+1) =
0

10.2-1.5-2max(0, max(s0, s2)-s1+1) =
max(0, max(0.2, -2)+1.5+1) =
max(0, 0.2+1.5+1) =
2.7
Syi=S1
11.1-2.9-2.15
1-0.3-1.7-2.82.4
2-0.13.522.5Syi=S2
2-0.73.91.63.3
2-1.41.7-1.23.9
L = 1/Li = 2.366666667

Softmax

Li =–ln( P(Y=yi |X=xi)) =–ln( esyi j esj )
Finding partial derivatives using chain rule:
let    Li Wj = Li u × u v × v Wj     where    u=esj ,    v=sj=Wjxi+bj

solve for each term:
v/Wj = (Wjx+bj)/Wj = x
u/v (ev)/v = ev = esj
Li/u ∂[ln( esyi / ju )]/u  ∂[ln( esyi ) + ln(ju )]/u
    =  ∂[syi + ln(ju )]/u = 0∂[ln(ju )]/u
   =  1 /  ju = 1 /  jesj     
(chain rule steps omitted showing ∂[ln(A+B+C)]/A = 1/(A+B+C) )
therefore, Li Wj = Pjxi  

Solve for  Li/bj = Li/u ⋅ u/v ⋅ v/bj 
the only new term is v/bj = (Wjx+bj)/bj = 1
therefore,  Li bj = Pj 

Solve for Li/Wyi 
let ∂Li/Wyi = Li/u ⋅ u/v ⋅ v/Wyi   where u=esyi  , v=syi = Wyixi+byi
solve for each term:
v/Wyi = (Wyixi+byi)/Wyi xi
u/v (ev)/v = eesyi
Li/u ∂[ln( u / ju )]/u  ∂[ln( u ) + ln(ju )]/u 
    = ∂[ln( u )]/u + ∂[ln(ju )]/u
   =  1/u + 1 /  ju  = 1/esyi  + 1 /  jesj    
therefore,  Li Wyi = (Pyi–1)xi 

Solve for Li/byi =  Li/u ⋅ u/v ⋅ v/byi 
the only different term here is 
v/byi = (Wyixi+byi)/byi  = 1
therefore,  Li byi = Pyi–1  

ys0s1s2Li (using ln)Li 
(log10)
Note
01.3-0.10.6-ln( exp(s0) / (exp(s0)+exp(s1)+exp(s2) )=
0.3
2.5Syi=S0
01.40.91.61.74.5
01.9-2.1-0.400.6
10.2-1.5-2-ln( exp(s1) / (exp(s0)+exp(s1)+exp(s2) )=
3.2
3.7Syi=S1
11.1-2.9-2.16.86
1-0.3-1.7-2.82.43.4
2-0.13.52-ln( exp(s2) / (exp(s0)+exp(s1)+exp(s2) )=
2.5
5.4Syi=S2
2-0.73.91.63.35.2
2-1.41.7-1.24.74.9
L = 1/Li = 1.8366403940.7976427883