Sunday, 16 June 2024

Summary of a Fully Connected Neural Network

 I usually spend my weekends on painting. For the last couple of weeks however, I have been learning Deep Learning following the cs231n course. Now that I have just finished Assignment 1, the two main things I have learned are the theory/maths taught in the course, as well as how to use numpy to implement them. Here is my summary of what I have learned using the 2 fully connected-layer neural network.

The architecture (Forward pass should be read from bottom up; Back propagation is top down):


Layers Forward Backward
Output
number of nodes (classes): C
scores: (C,)
Loss function: Softmax(f(x)) = Li =–ln( esyi j esj ) = – esyi + j esj
L= 1 N i=1 N Li+R(W)
Regularisation: R(W)= 12 λ k l Wk,l2
# x is the output of the previous layer
N=x.shape[0]
P = np.exp(x - x.max(axis=1, keepdims=True))
P /= P.sum(axis=1, keepdims=True)          

loss = -np.log(P[range(N), y]).sum() / N  

loss += 0.5 * self.reg * (np.sum(self.params['W2']**2)
+ np.sum(self.params['W1']**2) )
Gradients:
L Sj =Pj

L Syi =Pyi- 1
# x is the scores
# P=exp(scores) / scores_exp_sum, dimention is (N,C)
# grad x_j = Pj
# grad x_yi = Pyi-1
N=x.shape[0]
P = np.exp(x - x.max(axis=1, keepdims=True)) # numerically stable exponents
P /= P.sum(axis=1, keepdims=True)            # row-wise probabilities (softmax)

P[range(N), y] -= 1
dx = P / N
Fully Connected Layer #2

W2: (H, C)
b2: (C,)
f(x) = W2x + b2
# X is the output of the previous layer
scores = X.dot(W)
Gradients FC2
R W =λ

Tip: use dimension analysis! Note that you do not need to remember the expressions for dW and dX because they are easy to re-derive based on dimensions.
# dout is the gradient passed in from the Output layer
# i.e. the dx from above
dx = dout.dot(w.T).reshape(x.shape)

dw = x.reshape(x.shape[0], np.prod(x.shape[1:])).T.dot(dout)
dw += dw * self.reg

db = np.sum(dout, axis=0)
Fully Connected Layer #1
number of nodes: H
W1: (D, H)
b1: (H,)
Activation: ReLU(f(x))
out = np.maximum(0, out)
 f(x) = W1x + b1
out = input.dot(w) + b
Gradients FC1
ReLU backward:
# dout is gradient from above layer FC2
# i.e. the dx from above
x[x<0]=0
x[x>0]=1
dx = np.multiply(x, dout)
f(x) backward: same as FC2 layer above
Input
input data dimension: D
number of input data/rows: N
X: (N, D)
The input images are (32, 32, 3), which is reshaped into 32 x 32 x 3 = 3072
i.e. D = 3072
# reshape x into (N,D)
input = x.reshape(x.shape[0], np.prod(x.shape[1:]))
# or better
input = x.reshape(x.shape[0], -1))
Here is a good summary of the different optimisation algorithms: https://www.youtube.com/watch?v=spbBQshdhL4

Some of my learnings from doing assignment 1:

method pre-process best accuracy
KNN reshaping 32x32x3 into 3072 28% with K=10
1-layer SVM reshaping 32x32x3 into 3072,
zero center each image (by subtracting mean of training set),
append bias (initialised to 1) as extra column for each image
training: 37%
validation: 38%
with lr=e-7
reg=5e4
1-layer Softmax same as SVM above training: 33%
validation: 34%
with lr=e-7
reg=2.5e4
2-layer reshaping 32x32x3 into 3072 validation: 53.8%
test: 52.7
with lr=e-3
reg=0.5
epochs=20
H size=100
1-layer SVM on features extract 2 features (HOG, color histogram) for each image,
zero-center the feature values,
normalise the feature values,
add bias dimension 
SVM test = 41.4%
2-layer on features same as abovetest = 60.3%
with lr=1.209071e-01
epochs=10
H=274
reg=0.000001

K-Nearest Neighbour (KNN)

The idea behind this approach is to compute the L2 distance between each test image and all the training images, then sum them up. There is no training involved. The distance calculation happens at test time.
Performance wise on Colab with CPU only, using 2 for loops took 43s, one loop took 51s (using sqrt()) or 38s(without sqrt()), using Numpy's broadcasting feature took less than 1s.
Two loops:
num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                # this takes 43s to run with sqrt, 35s without.
                dists[i,j]=np.sum((self.X_train[j]-X[i])**2)
      return dists
Vectorisation approach:
# using (I1-I2)^2 = I1^2+I2^2-2*I1*I2
        # this takes 1s
        dists = np.sum(self.X_train ** 2, axis=1) \
          + (np.sum(X ** 2, axis=1))[:, np.newaxis] \
          -2 * np.dot(X, self.X_train.T)

        return dists

The output dists stores num_test rows of distances; each row contains num_train columns, which is the L2 distance between ith test image and jth training image.

dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)
(500, 5000)

Using KNN to predict an image's classification is basically finding it's indices in the dists for the K shortest distances, then find the most frequent y-label in those K elements:
    def predict_labels(self, dists, k=1):
        """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # DONE:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort.                             #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            indices=np.argsort(dists[i])[:k]
            closest_y=self.y_train[indices]

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            #########################################################################
            # DONE:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            values, counts = np.unique(closest_y, return_counts=True)
            most_frequent_value = values[counts.argmax()]
            y_pred[i]=most_frequent_value

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        return y_pred

Testing with various K values:
From the chart, the value K=10 seems to yield the highest accuracy. The testing result using this K value is about 28%.

Linear Classifier SVM

Forward pass Backward propagation
idx = np.random.choice(num_train, size=batch_size, replace=False)
X_batch = X[idx]
y_batch = y[idx]

# evaluate loss and gradient
loss, grad = svm_loss_vectorized(X_batch, y_batch, reg)
# perform parameter update
self.W-=learning_rate * grad
The loss and gradient calculation:
def svm_loss_vectorized(W, X, y, reg):
####################
# calculate loss
####################
    loss = 0.0
    dW = np.zeros(W.shape)  # initialize the gradient as zero

    num_train=X.shape[0]
    scores=X.dot(W)
    # extract all Syi into a 1xN matrix (a column)
    scores_yi=scores[np.arange(num_train) , y][: , np.newaxis]
    margins = np.maximum(0, scores - scores_yi + 1)  
    # set all yi elements to 0
    margins[np.arange(num_train),y] = 0
   
    loss = np.mean(np.sum(margins, axis=1))
    # Add regularization to the loss.
    loss += reg * np.sum(W * W)
 
####################
# calculate gradient
####################
  mask = np.zeros(margins.shape)   
    # for positions where margins>0, the gradient at Sj is X[i]
    mask[margins > 0] = 1
   
    # for Yi positions, it's -nX[i], where n is number of times Syi appeared in
    # margins, which is the sum of all appearances of Sj
    row_sum = np.sum(mask, axis=1)
    mask[np.arange(num_train), y] = -row_sum.T

    dW += np.dot(X.T, mask)
    dW /= num_train

    # Regularize
    dW += reg*W
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW
visualising the learned weights:

Linear Classifier Softmax

The only difference here is the loss and gradient calculation:
def softmax_loss_vectorized(W, X, y, reg):

    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)


    # *****calculate scores*****
    num_train=X.shape[0]
    num_classes = W.shape[1]
    scores = X.dot(W)
    # scores is N x C matrix
    scores -= np.max(scores, axis=1)[:, np.newaxis]
    # scores_y and scores_sum are 1-dimentional with N elements
    scores_y = scores[np.arange(num_train),y]
    scores_exp_sum = np.sum(np.exp(scores), axis=1)

    # *****calculate loss*****
    losses = np.log(scores_exp_sum) - scores_y
    loss=np.sum(losses) / num_train
    loss += reg*np.sum(W**2)

# *****calculate gradient*****
    # P=exp(scores) / scores_exp_sum, dimention is (N,C)
    # grad Wj = Pj * xi
    # grad Wyi = (Pyi-1) * xi
    P=np.exp(scores) / scores_exp_sum[:, np.newaxis]
    P[np.arange(num_train), y] -= 1
    dW += X.T.dot(P)
    dW /= num_train
    dW += reg * 2 * W

    return loss, dW

visualising the weights:


2-Layer Neural Network

Visualising output of bad hyper parameters: slow learning rate, low accuracy, not distinct features (grainy, noisy)

Visualising output of better hyper parameters: (but the accuracy chart suggests overfitting)

Features

'Manually' extract 2 features for each image: Histogram of Oriented Gradients (HOG) and color histogram. Use these features as input for the networks.
The best accuracy results show that using features is more effective than the raw images alone.
Some interesting visuals:







I am having so much fun following this course, I am going to explore more of the AI related courses.


No comments: