Romen's eSpace: June 2024

Saturday, 29 June 2024

First Art Sale

It's been a fruitful month. I joined the local art society this year and participated the Half-Price Sale for the first time. It was on the weekend of June 1st and 2nd. Despite the terrible weather on Saturday, we had a pretty good turn out overall. I managed to sell one of the four paintings that I entered in the sale. They are all landscapes from our Tasmania holiday 2 years ago.

However, the biggest fun I had this month was learning neural networks following cs231n course. I am up to the last 2 parts of assignment 2. I hope I will finish it this weekend!

Saturday, 22 June 2024

Calculating Gradient of Batch Normalisation

Part of the cs231n assignment 2 is to calculate the gradients of Batch Normalisation layer. Here are the equations calculating the BN:

X = [\begin{matrix} x_{1} \\ x_{2} \\ ... \\ x_{N} \end{matrix}] with dimension (N, D)

μ = \frac{1}{N} \sum_{k =1}^{N} x_{k} with dimension (D,)

v = \frac{1}{N} \sum_{k =1}^{N} (x_{k} - μ)^{2} with dimension (D,)

σ = \sqrt{v + ε} with dimension (D,)

y_{i} = \frac{x_{i} - μ}{σ} where yi is dimension (D,) and Y (or x_hat) is (N, D)

The basic partial derivatives of the above equations are as following. They are the building blocks to find the final ∂L/∂x.
L is loss function and

\frac{\partial L}{\partial Y} = γ \times dout, of dimension (N, D)

\frac{\partial μ}{\partial x_{i}} = \frac{1}{N} \sum_{k =1}^{N} 1 of dimension (D,).

\frac{\partial v}{\partial μ} = \frac{1}{N} \sum_{k =1}^{N} (2 x_{k} - 2 μ) = \frac{2}{N} \sum_{k =1}^{N} (x_{k} - μ) of dimension (D,) .

This turns out to be 0 because sum of xi and sum of mu are the same

However,

\frac{\partial v}{\partial x_{i}} = \frac{2}{N} (x_{i} - μ)

\frac{\partial σ}{\partial v} = 0.5 \times \frac{1}{\sqrt{v + ε}} = \frac{1}{2 σ}

\frac{\partial y_{i}}{\partial μ} = - \frac{1}{σ}

\frac{\partial y_{i}}{\partial σ} = - \frac{x_{i} - μ}{σ^{2}}

Thanks to this post I understand the processing using the computational graph. The following table shows the computational graph: top-down is the forward pass in black; bottom up is backward pass in red.

(1):= x (N,D) d(3)+d(2)	1/Nnp.ones((N,D)) =∂μ/∂x = ∂L/∂μ ∂μ/∂x + ∂L/∂v * ∂v/∂μ * ∂μ/∂x	(9):= γ (D,)	(11):= β (D,)
↓ ↘→	(2):= $mean= \frac{1}{N} \sum_{i =1}^{N} x_{i}$	↓	↓
(d(4)+d(8)) =...∂v/∂x - ∂L/∂μ = ∂L/∂v ∂v/∂x - ∂L/∂μ	(-1)*(d(4)+d(8)).sum(axis=0) = - ∑(-∂L/∂μ - ∂L/∂μ2) = ∑(∂L/∂μ + ∂L/∂μ2)	↓	↓
(3):= (1)-(2)	←↙	↓	↓
↓ ↘→	(4):= (3) *2 2(3) =(-∂v/∂μ) = - ∂L/∂μ2 =(∂v/∂x) = ∂L/∂v ∂v/∂x	↓	↓
↓	(5):= var = $\frac{1}{N} \sum_{i =1}^{N} {(4)}_{i}$ 1/Nnp.ones((N,D))	↓	↓
↓	(6): = std = sqrt((5)+ε) 0.51/std =*∂σ/∂v = ∂L/∂v	↓	↓
(7) =(-∂Y/∂μ) = -∂L/∂μ	(7):= 1/(6) [-1/((6)2)] =∂Y/∂σ =∂L/∂σ	↓	↓
(8):= (3) * (7)	←↙ [*(3)].sum(axis=0)	↓	↓
*γ= ∂L/∂Y		↓	↓
(10):= (8) * (9)	←←↙	*(8)	↓
dout			↓
(12):= (10) + (11)	←←←	←←←↙	dβ= dout.sum(axis=0)
out (N,D) Loss

In python:

    x, sample_mean, sample_var, sample_std, gamma, x_hat, eps = cache
    N, D=dout.shape
    
    dbeta = dout.sum(axis=0)
    dgamma = (dout * x_hat).sum(axis=0)

    # using computational graph in https://romenlaw.blogspot.com/2024/06/calculating-gradient-using-computation.html
    step3 = x-sample_mean

    d10 = dout * gamma
    d8_3 = d10 * (1/sample_std)          
    d8_7 = (d10 * step3).sum(axis=0)     
    d7 = - d8_7 / (sample_var + eps)     
    d6 = d7 * 0.5 / sample_std           
    d5 = d6 / N * np.ones(shape=(N,D))   
    d4 = d5 * 2 * step3                  
    d3_1 = d4 + d8_3                     # (N,D)
    d3_2 = -1 * (d4 + d8_3).sum(axis=0)  # (D,)
    d2 = d3_2 / N * np.ones(shape=(N,D)) # (N,D)
    dx = d2 + d3_1

Intuitively, it's like following the 3 paths from Y to X directed by the red arrows. Now doing it the analytical way using chain rule.

From page 4 of the original paper https://arxiv.org/pdf/1502.03167 we have the formulae for the derivatives. However, since the equations used in the assignment 2 is different from the paper (especially how the variance and mean is used in calculating x_hat or y), we can rewrite the derivatives using the notations in assignment 2:

\frac{\partial L}{\partial x_{i}} = (\frac{\partial L}{\partial μ} \frac{\partial μ}{\partial x_{i}} + \frac{\partial L}{\partial v} \frac{\partial v}{\partial μ} \frac{\partial μ}{\partial x_{i}}) + \frac{\partial L}{\partial v} \frac{\partial v}{\partial x_{i}} - \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial μ}

= \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial μ} \frac{\partial μ}{\partial x_{i}} + \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial σ} \frac{\partial σ}{\partial v} \frac{\partial v}{\partial μ} \frac{\partial μ}{\partial x_{i}} + \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial σ} \frac{\partial σ}{\partial v} \frac{\partial v}{\partial x_{i}} - \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial μ}

= [dout i γ (- \frac{1}{σ}) \frac{1}{N} \sum_{i =1}^{N} 1

- dout i γ (- \frac{x_{i} - μ}{σ^{2}}) \frac{1}{2 σ} \frac{2}{N} \sum_{i =1}^{N} (x_{i} - μ) \frac{1}{N} \sum_{k =1}^{N} 1]

+ dout i γ (- \frac{x_{i} - μ}{σ^{2}}) \frac{1}{2 σ} \frac{2}{N} \sum_{i =1}^{N} (x_{i} - μ)

- dout i γ (- \frac{1}{σ})

= \frac{1}{N} \sum_{i =1}^{N} [- \frac{dout i γ}{σ} + \frac{1}{N} (\sum_{i =1}^{N} \frac{dout i γ y_{i}}{σ}) y_{i}] ← in below python code, this is the dL_mu_x

- \frac{1}{N} (\sum_{i =1}^{N} \frac{dout i γ y_{i}}{σ}) y_{i} ← in below python code, this is the dL_v_x

+ \frac{dout i γ}{σ} ← in below python code, this is the dL_mu

In python code (modified slightly from here for readability):

    x, mean, var, std, gamma, x_hat, eps = cache
    S = lambda x: x.sum(axis=0)                     # helper function
    
    dbeta = dout.sum(axis=0)
    dgamma = (dout * x_hat).sum(axis=0)

    N = dout.shape[0]  # dout dimension (N,D)
    dx = dout * gamma / (N * std)          # temporarily initialize scale value
    
    dL_v_x = -S(dx*x_hat)*x_hat
    dL_mu = - N*dx

    dL_mu2 = -dL_v_x
    d_mu_x = S(-dx + dL_mu2)  #*np.ones(x.shape)
    #d_mu_x = -S(dx)

    dx = dL_v_x - dL_mu + d_mu_x

The dx difference between this and the above method is about 1e-10. Curiously, the standard answer ignores the dL_mu2 term but yields better result 5e-13. I wonder why. 2 months later, after watching Andrej Karpathy's lecture on back prop, I realised that dv/dμ is actually 0.

Sunday, 16 June 2024

Summary of a Fully Connected Neural Network

I usually spend my weekends on painting. For the last couple of weeks however, I have been learning Deep Learning following the cs231n course. Now that I have just finished Assignment 1, the two main things I have learned are the theory/maths taught in the course, as well as how to use numpy to implement them. Here is my summary of what I have learned using the 2 fully connected-layer neural network.

The architecture (Forward pass should be read from bottom up; Back propagation is top down):

Layers	Forward	Backward
Output number of nodes (classes): C scores: (C,)	Loss function: Softmax(f(x)) = $L_{i} = -ln (\frac{e^{s_{y_{i}}}}{\sum_{j}^{} e^{s_{j}}}) = - e^{s_{y_{i}}} + \sum_{j}^{} e^{s_{j}}$ $L = \frac{1}{N} \sum_{i =1}^{N} L_{i} + R (W)$ Regularisation: $R (W) = \frac{1}{2} λ \sum_{k}^{} \sum_{l}^{} {W_{k,l}}^{2}$ # x is the output of the previous layer N=x.shape[0] P = np.exp(x - x.max(axis=1, keepdims=True)) P /= P.sum(axis=1, keepdims=True) loss = -np.log(P[range(N), y]).sum() / N loss += 0.5 * self.reg * (np.sum(self.params['W2']2) + np.sum(self.params['W1']2) )	Gradients: $\frac{\partial L}{\partial S_{j}} = P_{j}$ $\frac{\partial L}{\partial S_{y_{i}}} = P_{y_{i}} - 1$ # x is the scores # P=exp(scores) / scores_exp_sum, dimention is (N,C) # grad x_j = Pj # grad x_yi = Pyi-1 N=x.shape[0] P = np.exp(x - x.max(axis=1, keepdims=True)) # numerically stable exponents P /= P.sum(axis=1, keepdims=True) # row-wise probabilities (softmax) P[range(N), y] -= 1 dx = P / N
Fully Connected Layer #2 W2: (H, C) b2: (C,)	f(x) = W₂x + b₂ # X is the output of the previous layer scores = X.dot(W)	Gradients FC2 $\frac{\partial R}{\partial W} = λ$ Tip: use dimension analysis! Note that you do not need to remember the expressions for `dW` and `dX` because they are easy to re-derive based on dimensions. # dout is the gradient passed in from the Output layer # i.e. the dx from above dx = dout.dot(w.T).reshape(x.shape) dw = x.reshape(x.shape[0], np.prod(x.shape[1:])).T.dot(dout) dw += dw * self.reg db = np.sum(dout, axis=0)
Fully Connected Layer #1 number of nodes: H W1: (D, H) b1: (H,)	Activation: ReLU(f(x)) out = np.maximum(0, out) f(x) = W₁x + b₁ out = input.dot(w) + b	Gradients FC1 ReLU backward: # dout is gradient from above layer FC2 # i.e. the dx from above x[x<0]=0 x[x>0]=1 dx = np.multiply(x, dout) f(x) backward: same as FC2 layer above
Input input data dimension: D number of input data/rows: N X: (N, D)	The input images are (32, 32, 3), which is reshaped into 32 x 32 x 3 = 3072 i.e. D = 3072 # reshape x into (N,D) input = x.reshape(x.shape[0], np.prod(x.shape[1:])) # or better input = x.reshape(x.shape[0], -1))

Here is a good summary of the different optimisation algorithms: https://www.youtube.com/watch?v=spbBQshdhL4

Some of my learnings from doing assignment 1:

method	pre-process	best accuracy
KNN	reshaping 32x32x3 into 3072	28% with K=10
1-layer SVM	reshaping 32x32x3 into 3072, zero center each image (by subtracting mean of training set), append bias (initialised to 1) as extra column for each image	training: 37% validation: 38% with lr=e-7 reg=5e4
1-layer Softmax	same as SVM above	training: 33% validation: 34% with lr=e-7 reg=2.5e4
2-layer	reshaping 32x32x3 into 3072	validation: 53.8% test: 52.7 with lr=e-3 reg=0.5 epochs=20 H size=100
1-layer SVM on features	extract 2 features (HOG, color histogram) for each image, zero-center the feature values, normalise the feature values, add bias dimension	SVM test = 41.4%
2-layer on features	same as above	test = 60.3% with lr=1.209071e-01 epochs=10 H=274 reg=0.000001

K-Nearest Neighbour (KNN)

The idea behind this approach is to compute the L2 distance between each test image and all the training images, then sum them up. There is no training involved. The distance calculation happens at test time.

Performance wise on Colab with CPU only, using 2 for loops took 43s, one loop took 51s (using sqrt()) or 38s(without sqrt()), using Numpy's broadcasting feature took less than 1s.

Two loops:

        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                # this takes 43s to run with sqrt, 35s without.
                dists[i,j]=np.sum((self.X_train[j]-X[i])**2)
       return dists

Vectorisation approach:

        # using (I1-I2)^2 = I1^2+I2^2-2*I1*I2
        # this takes 1s
        dists = np.sum(self.X_train ** 2, axis=1) \
          + (np.sum(X ** 2, axis=1))[:, np.newaxis] \
          -2 * np.dot(X, self.X_train.T)

        return dists

The output dists stores num_test rows of distances; each row contains num_train columns, which is the L2 distance between ith test image and jth training image.

dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

(500, 5000)

Using KNN to predict an image's classification is basically finding it's indices in the dists for the K shortest distances, then find the most frequent y-label in those K elements:

    def predict_labels(self, dists, k=1):
        """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # DONE:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort.                             #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            indices=np.argsort(dists[i])[:k]
            closest_y=self.y_train[indices]

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            #########################################################################
            # DONE:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            values, counts = np.unique(closest_y, return_counts=True)
            most_frequent_value = values[counts.argmax()]
            y_pred[i]=most_frequent_value

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        return y_pred

Testing with various K values:

From the chart, the value K=10 seems to yield the highest accuracy. The testing result using this K value is about 28%.

Linear Classifier SVM

Forward pass	Backward propagation
idx = np.random.choice(num_train, size=batch_size, replace=False) X_batch = X[idx] y_batch = y[idx] # evaluate loss and gradient loss, grad = svm_loss_vectorized(X_batch, y_batch, reg)	# perform parameter update self.W-=learning_rate * grad

The loss and gradient calculation:

def svm_loss_vectorized(W, X, y, reg):
    ####################
    # calculate loss
    ####################
    loss = 0.0
    dW = np.zeros(W.shape)  # initialize the gradient as zero

    num_train=X.shape[0]
    scores=X.dot(W)
    # extract all Syi into a 1xN matrix (a column)
    scores_yi=scores[np.arange(num_train) , y][: , np.newaxis]
    margins = np.maximum(0, scores - scores_yi + 1)  
    # set all yi elements to 0
    margins[np.arange(num_train),y] = 0
    
    loss = np.mean(np.sum(margins, axis=1))
    # Add regularization to the loss.
    loss += reg * np.sum(W * W)
 
    ####################
    # calculate gradient
    ####################
    mask = np.zeros(margins.shape)    
    # for positions where margins>0, the gradient at Sj is X[i]
    mask[margins > 0] = 1
    
    # for Yi positions, it's -nX[i], where n is number of times Syi appeared in
    # margins, which is the sum of all appearances of Sj
    row_sum = np.sum(mask, axis=1)
    mask[np.arange(num_train), y] = -row_sum.T

    dW += np.dot(X.T, mask)
    dW /= num_train

    # Regularize
    dW += reg*W
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW

visualising the learned weights:

Linear Classifier Softmax

The only difference here is the loss and gradient calculation:

def softmax_loss_vectorized(W, X, y, reg):

    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)


    # *****calculate scores*****
    num_train=X.shape[0]
    num_classes = W.shape[1]
    scores = X.dot(W)
    # scores is N x C matrix
    scores -= np.max(scores, axis=1)[:, np.newaxis]
    # scores_y and scores_sum are 1-dimentional with N elements
    scores_y = scores[np.arange(num_train),y]
    scores_exp_sum = np.sum(np.exp(scores), axis=1)

    # *****calculate loss*****
    losses = np.log(scores_exp_sum) - scores_y
    loss=np.sum(losses) / num_train
    loss += reg*np.sum(W**2)

    # *****calculate gradient*****
    # P=exp(scores) / scores_exp_sum, dimention is (N,C)
    # grad Wj = Pj * xi
    # grad Wyi = (Pyi-1) * xi
    P=np.exp(scores) / scores_exp_sum[:, np.newaxis]
    P[np.arange(num_train), y] -= 1 
    dW += X.T.dot(P)
    dW /= num_train
    dW += reg * 2 * W

    return loss, dW

visualising the weights:

2-Layer Neural Network

Visualising output of bad hyper parameters: slow learning rate, low accuracy, not distinct features (grainy, noisy)

Visualising output of better hyper parameters: (but the accuracy chart suggests overfitting)

Features

'Manually' extract 2 features for each image: Histogram of Oriented Gradients (HOG) and color histogram. Use these features as input for the networks.

The best accuracy results show that using features is more effective than the raw images alone.

Some interesting visuals:

I am having so much fun following this course, I am going to explore more of the AI related courses.

Saturday, 1 June 2024

Study Notes of a Simple Neuro Network

Thanks for Stanford University's online course of Loss Functions and Optimisation and its accompanying web demo, now I finally have a concrete simple example of several Loss Function implementations, which I can delve into the maths.

The Architecture

In this architecture, W ∈ $\mathbb {R}$ ^2×3, x ∈ $\mathbb {R}$ ^1×2, b ∈ $\mathbb {R}$ ^1×3, s ∈ $\mathbb {R}$ ^1×3.

s = f(x, W) = Wx + b

The value of W is initialised to: ${\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}$

The initial value of b is set to: ${\begin{bmatrix}1&9&-13\\20&5&-6\end{bmatrix}}$

Calculating the Scores

There are 9 sets of input values for X, hence, X_i, i=[0..8]. Given the dataset, the s is calculated:

x0	x1	y	s0 = w0,0x0 + w1,0x1 + b0	s1 = w0,1x0 + w1,1x1 + b1	s2 = w0,2x0 + w1,2x1 + b2
0.5	0.4	0	10.5 + 20.4 + 0 = 1.3	20.5 + (-4)0.4 + 0.5 = -0.1	30.5 + (-1)0.4 + (-0.5) = 0.6
0.8	0.3	0	1.4	0.9	1.6
0.3	0.8	0	1.9	-2.1	-0.4
-0.4	0.3	1	0.2	-1.5	-2
-0.3	0.7	1	1.1	-2.9	-2.1
-0.7	0.2	1	-0.3	-1.7	-2.8
0.7	-0.4	2	-0.1	3.5	2
0.5	-0.6	2	-0.7	3.9	1.6
-0.4	-0.5	2	-1.4	1.7	-1.2

Multiclass SVM Loss Functions

Weston Watkins 1999

L_{i} = \sum_{j \neq y_{i}}^{} max(0, s_{j} - s_{y_{i}} + 1)

where i=[0, number of input samples), j=[0, number of output classes)

since s_j = W_jx_i + b_j and s_{y_i}= W_{y_i}x_i + b_{y_i}

we have

\frac{\partial L_{i}}{\partial s_{j}} =1, \frac{\partial L_{i}}{\partial W_{j}} = x_{i} and \frac{\partial L_{i}}{\partial b_{j}} =1

\frac{\partial L_{i}}{\partial s_{y_{i}}} = –n, \frac{\partial L_{i}}{\partial W_{y_{i}}} = {–x}_{i} n and \frac{\partial L_{i}}{\partial b_{y_{i}}} =-1 n

where n is the number of times that s_{y_i} appeared in L_i (since L_i is a sum of many terms).
n should be in the range of [0, number of output classes - 1).

y	s0	s1	s2	L_i	Note
0	1.3	-0.1	0.6	max(0, s1-s0+1) + max(0, s2-s0+1) = max(0, -0.1-1.3+1) + max(0.6-1.3+1)= 0+0.3= 0.3	Syi=S0
0	1.4	0.9	1.6	max(0, 0.9-1.4+1) + max(1.6-1.4+1) = 0.5 + 1.2 = 1.7
0	1.9	-2.1	-0.4	max(0, -2.1-1.9+1) + max(0, -0.4-1.9+1) = max(0, -3) + max(0, -1.3) = 0+0 = 0
1	0.2	-1.5	-2	max(0, s0-s1+1) + max(0, s2-s1+1) = max(0, 0.2+1.5+1) + max(0, -2+1.5+1) = 2.7+0.5 = 3.2	Syi=S1
1	1.1	-2.9	-2.1	max(0, 1.1+2.9+1) + max(-2.1+2.9+1) = 5+1.8 = 6.8
1	-0.3	-1.7	-2.8	max(0, -0.3+1.7+1) + max(-2.8+1.7+1) = 2.4+0 = 2.4
2	-0.1	3.5	2	max(0, s0-s2+1) + max(0, s1-s2+1) = max(0, -0.1-2+1) + max(0, 3.5-2+1) = 0 + 2.5 = 2.5	Syi=S2
2	-0.7	3.9	1.6	max(-0.7-1.6+1) + max(0, 3.9-1.6+1) = 0 + 3.3 = 3.3
2	-1.4	1.7	-1.2	max(0, -1.4+1.2+1)+max(0, 1.7+1.2+1) = 0.8 + 3.9 = 4.7
				L = 1/N ⋅∑L_i = 2.766666667

One vs. All

L_{i} =max(0, {–s}_{y_{i}} +1) + \sum_{j \neq y_{i}}^{} max(0, s_{j} +1)

The partial derivatives are the same as the Weston Watkins 1999 formula, with n=1 because there is only one S_yi in the formula.

y	s0	s1	s2	L_i	Note
0	1.3	-0.1	0.6	max(0, 1-s0) + max(0, 1+s1) + max(0, 1+s2) = max(0, 1-1.3) + max(0, 1-0.1) + max(0, 1+0.6) = 0 + 0.9 + 1.6 = 2.5	Syi=S0
0	1.4	0.9	1.6	max(0, 1-1.4) + max(0, 1+0.9) + max(0, 1+1.6) = 0 + 1.9 + 2.6 = 4.5
0	1.9	-2.1	-0.4	0.6
1	0.2	-1.5	-2	max(0, 1+s0) + max(0, 1-s1) + max(0, 1+s2) = max(0, 1+0.2) + max(0, 1+1.5) + max(0, 1-2) = 1.2 + 2.5 + 0 = 3.7	Syi=S1
1	1.1	-2.9	-2.1	max(0, 1+1.1) + max(0, 1+2.9) + max(0, 1-2.1) = 2.1 + 3.9 + 0 = 6
1	-0.3	-1.7	-2.8	3.4
2	-0.1	3.5	2	max(0, 1+s0) + max(0, 1+s1) + max(0, 1-s2) = max(0, 1-0.1) + max(0, 1+3.5) + max(0, 1-2) = 0.9 + 4.5 + 0 = 5.4	Syi=S2
2	-0.7	3.9	1.6	5.2
2	-1.4	1.7	-1.2	4.9
				L = 1/N ⋅∑L_i = 4.022222222

Structured SVM

L_i= max(0, max( s_j ) – s_{y_i}+ 1) , where j ≠ y_i

The partial derivatives are the same as the One vs. All formula.

y	s0	s1	s2	L_i	Note
0	1.3	-0.1	0.6	max(0, max(s1, s2)-s0+1) = max(0, max(-0.1, 0.6)-1.3+1) = max(0, 0.6-1.3+1) = 0.3	Syi=S0
0	1.4	0.9	1.6	1.2
0	1.9	-2.1	-0.4	max(0, max(-2.1, -0.4)-1.9+1) = max(0, -0.4-1.9+1) = 0
1	0.2	-1.5	-2	max(0, max(s0, s2)-s1+1) = max(0, max(0.2, -2)+1.5+1) = max(0, 0.2+1.5+1) = 2.7	Syi=S1
1	1.1	-2.9	-2.1	5
1	-0.3	-1.7	-2.8	2.4
2	-0.1	3.5	2	2.5	Syi=S2
2	-0.7	3.9	1.6	3.3
2	-1.4	1.7	-1.2	3.9
				L = 1/N ⋅∑L_i = 2.366666667

Softmax

L_{i} = -ln(P(Y = y_{i} | X = x_{i})) = -ln (\frac{e^{s_{y_{i}}}}{\sum_{j}^{} e^{s_{j}}})

Finding partial derivatives using chain rule:

let \frac{\partial L_{i}}{\partial W_{j}} = \frac{\partial L_{i}}{\partial u} \times \frac{\partial u}{\partial v} \times \frac{\partial v}{\partial W_{j}}

where u = e^{s_{j}}, v = s_{j} = W_{j} \cdot x_{i} + b_{j}

solve for each term:

∂v/∂W_j = ∂(W_j⋅x+b_j)/∂W_j = x

∂u/∂v = ∂(e^v)/∂v = e^v = e^s_j

∂L_i/∂u = ∂[–ln( e^s_{y_i} / ∑_ju )]/∂u = ∂[–ln( e^s_{y_i} ) + ln(∑_ju )]/∂u

= ∂[–s_{y_i} + ln(∑_ju )]/∂u = –0 + ∂[ln(∑_ju )]/∂u

= 1 / ∑_ju = 1 / ∑_je^s_j

(chain rule steps omitted showing ∂[ln(A+B+C)]/∂A = 1/(A+B+C) )

therefore,

\frac{\partial L_{i}}{\partial W_{j}} = P_{j} \cdot x_{i}

Solve for ∂L_i/∂b_j = ∂L_i/∂u ⋅ ∂u/∂v ⋅ ∂v/∂b_j

the only new term is ∂v/∂b_j = ∂(W_j⋅x+b_j)/∂b_j = 1

therefore,

\frac{\partial L_{i}}{\partial b_{j}} = P_{j}

Solve for ∂L_i/∂W_{y_i}

let ∂L_i/∂W_{y_i} = ∂L_i/∂u ⋅ ∂u/∂v ⋅ ∂v/∂W_{y_i} where u=e^s_{y_i} , v=s_{y_i} = W_{y_i}⋅x_i+b_{y_i}

solve for each term:

∂v/∂W_{y_i} = ∂(W_{y_i}⋅x_i+b_{y_i})/∂W_{y_i} = x_i

∂u/∂v = ∂(e^v)/∂v = e^v= e^s_{y_i}

∂L_i/∂u = ∂[–ln( u / ∑_ju )]/∂u = ∂[–ln( u ) + ln(∑_ju )]/∂u

= ∂[–ln( u )]/∂u + ∂[ln(∑_ju )]/∂u

= –1/u + 1 / ∑_ju = –1/e^s_{y_i} + 1 / ∑_je^s_j

therefore,

\frac{\partial L_{i}}{\partial W_{y_{i}}} = (P_{y_{i}} -1) \cdot x_{i}

Solve for ∂L_i/∂b_{y_i} = ∂L_i/∂u ⋅ ∂u/∂v ⋅ ∂v/∂b_{y_i}

the only different term here is

∂v/∂b_{y_i} = ∂(W_{y_i}⋅x_i+b_{y_i})/∂b_{y_i} = 1

therefore,

\frac{\partial L_{i}}{\partial b_{y_{i}}} = P_{y_{i}} -1

y	s0	s1	s2	L_i (using ln)	L_i (log10)	Note
0	1.3	-0.1	0.6	-ln( exp(s0) / (exp(s0)+exp(s1)+exp(s2) )= 0.3	2.5	Syi=S0
0	1.4	0.9	1.6	1.7	4.5
0	1.9	-2.1	-0.4	0	0.6
1	0.2	-1.5	-2	-ln( exp(s1) / (exp(s0)+exp(s1)+exp(s2) )= 3.2	3.7	Syi=S1
1	1.1	-2.9	-2.1	6.8	6
1	-0.3	-1.7	-2.8	2.4	3.4
2	-0.1	3.5	2	-ln( exp(s2) / (exp(s0)+exp(s1)+exp(s2) )= 2.5	5.4	Syi=S2
2	-0.7	3.9	1.6	3.3	5.2
2	-1.4	1.7	-1.2	4.7	4.9
				L = 1/N ⋅∑L_i = 1.836640394	0.7976427883

Romen's eSpace

Saturday, 29 June 2024

First Art Sale

Saturday, 22 June 2024

Calculating Gradient of Batch Normalisation

Sunday, 16 June 2024

Summary of a Fully Connected Neural Network

Some of my learnings from doing assignment 1:

K-Nearest Neighbour (KNN)

Linear Classifier SVM

Linear Classifier Softmax

2-Layer Neural Network

Features

Saturday, 1 June 2024

Study Notes of a Simple Neuro Network

The Architecture

Calculating the Scores

Multiclass SVM Loss Functions

Weston Watkins 1999

One vs. All

Structured SVM

Softmax

Blog Archive

Tags/Labels

About Me

Archived Cluster Map

Ads

Followers