I usually spend my weekends on painting. For the last couple of weeks however, I have been learning Deep Learning following the cs231n course. Now that I have just finished Assignment 1, the two main things I have learned are the theory/maths taught in the course, as well as how to use numpy to implement them. Here is my summary of what I have learned using the 2 fully connected-layer neural network.
The architecture (Forward pass should be read from bottom up; Back propagation is top down):
Layers | Forward | Backward |
---|---|---|
Output number of nodes (classes): C scores: (C,) |
Loss function: Softmax(f(x)) =
# x is the output of the previous layer N=x.shape[0] P = np.exp(x - x.max(axis=1, keepdims=True)) P /= P.sum(axis=1, keepdims=True) loss = -np.log(P[range(N), y]).sum() / N loss += 0.5 * self.reg * (np.sum(self.params['W2']**2) + np.sum(self.params['W1']**2) ) |
Gradients:
# x is the scores # P=exp(scores) / scores_exp_sum, dimention is (N,C) # grad x_j = Pj # grad x_yi = Pyi-1 N=x.shape[0] P = np.exp(x - x.max(axis=1, keepdims=True)) # numerically stable exponents P /= P.sum(axis=1, keepdims=True) # row-wise probabilities (softmax) P[range(N), y] -= 1 dx = P / N |
Fully Connected Layer #2 W2: (H, C) b2: (C,) |
f(x) = W2x + b2 # X is the output of the previous layer scores = X.dot(W)
|
Gradients FC2
Tip: use dimension analysis! Note that you do not need to remember the expressions for dW and dX because they are easy to re-derive based on dimensions.# dout is the gradient passed in from the Output layer # i.e. the dx from above dx = dout.dot(w.T).reshape(x.shape) dw = x.reshape(x.shape[0], np.prod(x.shape[1:])).T.dot(dout) dw += dw * self.reg db = np.sum(dout, axis=0) |
Fully Connected Layer #1 number of nodes: H W1: (D, H) b1: (H,) |
Activation: ReLU(f(x)) out = np.maximum(0, out)
f(x) = W1x + b1out = input.dot(w) + b
|
Gradients FC1
ReLU backward: # dout is gradient from above layer FC2 # i.e. the dx from above x[x<0]=0 x[x>0]=1 dx = np.multiply(x, dout) |
Input input data dimension: D number of input data/rows: N X: (N, D) |
The input images are (32, 32, 3), which is reshaped into 32 x 32 x 3 = 3072 i.e. D = 3072 # reshape x into (N,D) input = x.reshape(x.shape[0], np.prod(x.shape[1:])) # or better input = x.reshape(x.shape[0], -1)) |
Here is a good summary of the different optimisation algorithms: https://www.youtube.com/watch?v=spbBQshdhL4
Some of my learnings from doing assignment 1:
method | pre-process | best accuracy |
---|---|---|
KNN | reshaping 32x32x3 into 3072 | 28% with K=10 |
1-layer SVM | reshaping 32x32x3 into 3072, zero center each image (by subtracting mean of training set), append bias (initialised to 1) as extra column for each image |
training: 37% validation: 38% with lr=e-7 reg=5e4 |
1-layer Softmax | same as SVM above | training: 33% validation: 34% with lr=e-7 reg=2.5e4 |
2-layer | reshaping 32x32x3 into 3072 | validation: 53.8% test: 52.7 with lr=e-3 reg=0.5 epochs=20 H size=100 |
1-layer SVM on features | extract 2 features (HOG, color histogram) for each image, zero-center the feature values, normalise the feature values, add bias dimension | SVM test = 41.4% |
2-layer on features | same as above | test = 60.3% with lr=1.209071e-01 epochs=10 H=274 reg=0.000001 |
K-Nearest Neighbour (KNN)
The idea behind this approach is to compute the L2 distance between each test image and all the training images, then sum them up. There is no training involved. The distance calculation happens at test time.
Performance wise on Colab with CPU only, using 2 for loops took 43s, one loop took 51s (using sqrt()) or 38s(without sqrt()), using Numpy's broadcasting feature took less than 1s.
Two loops:
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
# this takes 43s to run with sqrt, 35s without.
dists[i,j]=np.sum((self.X_train[j]-X[i])**2)
return dists
Vectorisation approach:
# using (I1-I2)^2 = I1^2+I2^2-2*I1*I2
# this takes 1s
dists = np.sum(self.X_train ** 2, axis=1) \
+ (np.sum(X ** 2, axis=1))[:, np.newaxis] \
-2 * np.dot(X, self.X_train.T)
return dists
The output dists stores num_test rows of distances; each row contains num_train columns, which is the L2 distance between ith test image and jth training image.
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)
(500, 5000)
Using KNN to predict an image's classification is basically finding it's indices in the dists for the K shortest distances, then find the most frequent y-label in those K elements:
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# DONE: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
indices=np.argsort(dists[i])[:k]
closest_y=self.y_train[indices]
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
#########################################################################
# DONE: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
values, counts = np.unique(closest_y, return_counts=True)
most_frequent_value = values[counts.argmax()]
y_pred[i]=most_frequent_value
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return y_pred
Testing with various K values:
From the chart, the value K=10 seems to yield the highest accuracy. The testing result using this K value is about 28%.
Linear Classifier SVM
Forward pass | Backward propagation |
idx = np.random.choice(num_train, size=batch_size, replace=False) X_batch = X[idx] y_batch = y[idx] # evaluate loss and gradient loss, grad = svm_loss_vectorized(X_batch, y_batch, reg) |
# perform parameter update self.W-=learning_rate * grad |
def svm_loss_vectorized(W, X, y, reg):
####################
# calculate loss
####################
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train=X.shape[0]
scores=X.dot(W)
# extract all Syi into a 1xN matrix (a column)
scores_yi=scores[np.arange(num_train) , y][: , np.newaxis]
margins = np.maximum(0, scores - scores_yi + 1)
# set all yi elements to 0
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
# Add regularization to the loss.
loss += reg * np.sum(W * W)
####################
# calculate gradient
####################
mask = np.zeros(margins.shape)
# for positions where margins>0, the gradient at Sj is X[i]
mask[margins > 0] = 1
# for Yi positions, it's -nX[i], where n is number of times Syi appeared in
# margins, which is the sum of all appearances of Sj
row_sum = np.sum(mask, axis=1)
mask[np.arange(num_train), y] = -row_sum.T
dW += np.dot(X.T, mask)
dW /= num_train
# Regularize
dW += reg*W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
Linear Classifier Softmax
The only difference here is the loss and gradient calculation:
def softmax_loss_vectorized(W, X, y, reg):
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
# *****calculate scores*****
num_train=X.shape[0]
num_classes = W.shape[1]
scores = X.dot(W)
# scores is N x C matrix
scores -= np.max(scores, axis=1)[:, np.newaxis]
# scores_y and scores_sum are 1-dimentional with N elements
scores_y = scores[np.arange(num_train),y]
scores_exp_sum = np.sum(np.exp(scores), axis=1)
# *****calculate loss*****
losses = np.log(scores_exp_sum) - scores_y
loss=np.sum(losses) / num_train
loss += reg*np.sum(W**2)
# P=exp(scores) / scores_exp_sum, dimention is (N,C)
# grad Wj = Pj * xi
# grad Wyi = (Pyi-1) * xi
P=np.exp(scores) / scores_exp_sum[:, np.newaxis]
P[np.arange(num_train), y] -= 1
dW += X.T.dot(P)
dW /= num_train
dW += reg * 2 * W
return loss, dW
visualising the weights:
2-Layer Neural Network
Visualising output of bad hyper parameters: slow learning rate, low accuracy, not distinct features (grainy, noisy)
Visualising output of better hyper parameters: (but the accuracy chart suggests overfitting)Features
'Manually' extract 2 features for each image: Histogram of Oriented Gradients (HOG) and color histogram. Use these features as input for the networks.
The best accuracy results show that using features is more effective than the raw images alone.
Some interesting visuals:
I am having so much fun following this course, I am going to explore more of the AI related courses.
No comments:
Post a Comment