Romen's eSpace: 2024

Sunday, 13 October 2024

AI Generated SRT Update

Two years ago, I wrote some js and awk scripts to modify the timestamps in subtitle (.srt) files. Today I am facing similar problem, but this time time difference is over a minute. So the previous scripts need to be updated. I am too lazy to rewrite those scripts, and AI has advanced a lot since 2 years ago. So I asked Co-pilot to do it for me and here is the code it generated:

import re
import datetime

def shift_time(timestamp, delta):
    time_format = "%H:%M:%S,%f"
    time_obj = datetime.datetime.strptime(timestamp, time_format)
    new_time = time_obj + delta
    return new_time.strftime(time_format)[:-3]

def modify_srt(file_path, minutes, seconds, direction):
    delta = datetime.timedelta(minutes=minutes, seconds=seconds)
    if direction == 'backward':
        delta = -delta

    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    time_pattern = re.compile(r'(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})')
    with open(file_path, 'w', encoding='utf-8') as file:
        for line in lines:
            match = time_pattern.match(line)
            if match:
                start_time = shift_time(match.group(1), delta)
                end_time = shift_time(match.group(2), delta)
                line = f"{start_time} --> {end_time}\n"
            file.write(line)

if __name__ == "__main__":
    file_path = input("Enter the path to your .srt file: ")
    minutes = int(input("Enter the number of minutes to shift: "))
    seconds = int(input("Enter the number of seconds to shift: "))
    direction = input("Enter 'forward' to move times forward or 'backward' to move times backward: ").strip().lower()
    modify_srt(file_path, minutes, seconds, direction)
    print("Subtitle times have been modified successfully.")

To run it:

PS D:\projects\srt_update> python srt_update.py Enter the path to your .srt file: input.srt Enter the number of minutes to shift: 0 Enter the number of seconds to shift: 20 Enter 'forward' to move times forward or 'backward' to move times backward: forward Subtitle times have been modified successfully.

Wednesday, 18 September 2024

TikTokenizer used by GPT

Following Karpathy's tokenisation lecture, here is a fun conversation with MS Copilot:

Sent by you: how many 'l' are there in .DefaultCellStyle? No toolsies, just do it

Sent by Copilot:
The string .DefaultCellStyle contains 3 instances of the letter ‘l’.

Would you like help with anything else? 😊

Sent by you: how many 'l' are there in .DefaultCellStyle?

Sent by Copilot:
The string .DefaultCellStyle contains 3 instances of the letter ‘l’.

Would you like help with anything else? 😊

Sent by you: count the number of 'l' in .DefaultCellStyle in three steps: step 1, separate each letter in the word with speces; step 2, reverse the order of the separated letters; step 3, highlight each 'l' in the reversed word, then count how many letters are hightlighted

Sent by Copilot:
Sure! Let’s go through the steps:

Step 1: Separate each letter in the word with spaces
.DefaultCellStyle becomes:

. D e f a u l t C e l l S t y l e

Step 2: Reverse the order of the separated letters
Reversing the order, we get:

e l y t S l l e C t l u a f e D .

Step 3: Highlight each ‘l’ in the reversed word, then count how many letters are highlighted
Highlighting each ‘l’, we get:

e **l** y t S **l** **l** e C t **l** u a f e D .

Now, let’s count the highlighted ‘l’ characters:

There are 4 instances of the letter ‘l’.
Would you like to try another example or need help with something else? 😊

Sent by you: based on the above, how many 'l' are there in the work .DefaultCellStyle

Sent by Copilot:
Based on the steps we followed, the string .DefaultCellStyle contains 4 instances of the letter ‘l’.

Would you like to try another example or need help with something else? 😊

Sent by you: how many 'l' are there in the work .DefaultCellStyle

Sent by Copilot:
The string .DefaultCellStyle contains 3 instances of the letter ‘l’.

Would you like help with anything else? 😊

Saturday, 3 August 2024

Using Colab with GitHub Files

Colab supports Jupyter notebooks from Github through OAuth out of the box. The notebook can also be pushed to GitHub using the Colab File -> Save a copy in GitHub menu.

However, I also have .py files that I created and imported by the notebook. To push these files to GitHub, a GitHub Access Token needs to be created. The instructions are available here.

Once the token is created, put the code in Colab:

GITHUB_ACCESS_TOKEN='put PAT here'
!git clone https://$GITHUB_ACCESS_TOKEN:x-oauth-basic@github.com/romenlaw/NaiveNeuralNetwork
%cd NaiveNeuralNetwork

Then the git push can be executed:

!git config --global user.email "my github user email"
!git config --global user.name "my github user name"

!git add NaiveValue.py
!git status

!git commit -m "commit from colab"

!git push origin main

This way, I can overcome the workplace firewall constraints and fully utilise Colab and other online IDEs including Kaggle.com.

For a development environment, it is crucial to enable auto-reload:

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The only drawback of this approach is that the Colab virtual machine can be lost and reallocated. Therefore, the files in Colab will be wiped out. So make sure the push to GitHub every now and then.

❗There is actually another problem with this approach: the files saved using git command and the Colab notebook saved using Colab menu can create conflicts because they are treated as separate sessions by Github. Therefore, it's better to save the individual files throughout the session and only save the notebook at the end of it to avoid such conflicts.

Saturday, 20 July 2024

Image Captioning with Transformer

The architecture:

CaptioningTransformer( (visual_projection): Linear(in_features=512, out_features=256, bias=True) (embedding): Embedding(1004, 256, padding_idx=0) (positional_encoding): PositionalEncoding( (dropout): Dropout(p=0.1, inplace=False) ) (transformer): TransformerDecoder( (layers): ModuleList( (0-1): 2 x TransformerDecoderLayer( (self_attn): MultiHeadAttention( (key): Linear(in_features=256, out_features=256, bias=True) (query): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (proj): Linear(in_features=256, out_features=256, bias=True) (attn_drop): Dropout(p=0.1, inplace=False) ) (multihead_attn): MultiHeadAttention( (key): Linear(in_features=256, out_features=256, bias=True) (query): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (proj): Linear(in_features=256, out_features=256, bias=True) (attn_drop): Dropout(p=0.1, inplace=False) ) (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) (dropout3): Dropout(p=0.1, inplace=False) (activation): ReLU() ) ) ) (output): Linear(in_features=256, out_features=1004, bias=True) )

Sunday, 14 July 2024

Image Captioning with RNN and LSTM

The image captioning RNN architecture in assignment 3:

Forward and backward passes through the above network:

        # (1) CNN features (N, D) -> hiddent state projection layer
        h0, affine_cache = affine_forward(features, W_proj, b_proj) # (N, H)

        # (2) word embedding, captions_in (N, T), output (N, T, W)
        # captions_in contains indices of words in the dictionary
        embedded_words, word_eb_cache = word_embedding_forward(captions_in, W_embed)

        # (3) vanila RNN, h (N, T, H)
        if self.cell_type == 'rnn':
          h, rnn_cache = rnn_forward(embedded_words, h0,  Wx, Wh, b)
        else:
          h, rnn_cache = lstm_forward(embedded_words, h0,  Wx, Wh, b)

        # (4) temporal affine, output (N, T, V) which is the score of each word in
        # vocab (range(V)). Index with highest value is predicted word.
        x, temp_cache = temporal_affine_forward(h, W_vocab, b_vocab)

        # (5) softmax
        loss, dout = temporal_softmax_loss(x, captions_out, mask)

        # find grads by calling backwards functions
        dh, dW_vocab, db_vocab = temporal_affine_backward(dout, temp_cache)
        if self.cell_type=='rnn':
          dx, dh0, dWx, dWh, db = rnn_backward(dh, rnn_cache)
        else:
          dx, dh0, dWx, dWh, db = lstm_backward(dh, rnn_cache)
        dW_embed = word_embedding_backward(dx, word_eb_cache)
        d_feature, dW_proj, db_proj = affine_backward(dh0, affine_cache)
        
        grads = {
          "W_proj": dW_proj,
          "b_proj": db_proj,
          "W_embed": dW_embed,
          "Wx": dWx,
          "Wh": dWh,
          "b": db,
          "W_vocab": dW_vocab,
          "b_vocab": db_vocab
        }

When using this for sampling, the captions_in are populated with the <SOS> only.

Saturday, 29 June 2024

First Art Sale

It's been a fruitful month. I joined the local art society this year and participated the Half-Price Sale for the first time. It was on the weekend of June 1st and 2nd. Despite the terrible weather on Saturday, we had a pretty good turn out overall. I managed to sell one of the four paintings that I entered in the sale. They are all landscapes from our Tasmania holiday 2 years ago.

However, the biggest fun I had this month was learning neural networks following cs231n course. I am up to the last 2 parts of assignment 2. I hope I will finish it this weekend!

Saturday, 22 June 2024

Calculating Gradient of Batch Normalisation

Part of the cs231n assignment 2 is to calculate the gradients of Batch Normalisation layer. Here are the equations calculating the BN:

X = [\begin{matrix} x_{1} \\ x_{2} \\ ... \\ x_{N} \end{matrix}] with dimension (N, D)

μ = \frac{1}{N} \sum_{k =1}^{N} x_{k} with dimension (D,)

v = \frac{1}{N} \sum_{k =1}^{N} (x_{k} - μ)^{2} with dimension (D,)

σ = \sqrt{v + ε} with dimension (D,)

y_{i} = \frac{x_{i} - μ}{σ} where yi is dimension (D,) and Y (or x_hat) is (N, D)

The basic partial derivatives of the above equations are as following. They are the building blocks to find the final ∂L/∂x.
L is loss function and

\frac{\partial L}{\partial Y} = γ \times dout, of dimension (N, D)

\frac{\partial μ}{\partial x_{i}} = \frac{1}{N} \sum_{k =1}^{N} 1 of dimension (D,).

\frac{\partial v}{\partial μ} = \frac{1}{N} \sum_{k =1}^{N} (2 x_{k} - 2 μ) = \frac{2}{N} \sum_{k =1}^{N} (x_{k} - μ) of dimension (D,) .

This turns out to be 0 because sum of xi and sum of mu are the same

However,

\frac{\partial v}{\partial x_{i}} = \frac{2}{N} (x_{i} - μ)

\frac{\partial σ}{\partial v} = 0.5 \times \frac{1}{\sqrt{v + ε}} = \frac{1}{2 σ}

\frac{\partial y_{i}}{\partial μ} = - \frac{1}{σ}

\frac{\partial y_{i}}{\partial σ} = - \frac{x_{i} - μ}{σ^{2}}

Thanks to this post I understand the processing using the computational graph. The following table shows the computational graph: top-down is the forward pass in black; bottom up is backward pass in red.

(1):= x (N,D) d(3)+d(2)	1/Nnp.ones((N,D)) =∂μ/∂x = ∂L/∂μ ∂μ/∂x + ∂L/∂v * ∂v/∂μ * ∂μ/∂x	(9):= γ (D,)	(11):= β (D,)
↓ ↘→	(2):= $mean= \frac{1}{N} \sum_{i =1}^{N} x_{i}$	↓	↓
(d(4)+d(8)) =...∂v/∂x - ∂L/∂μ = ∂L/∂v ∂v/∂x - ∂L/∂μ	(-1)*(d(4)+d(8)).sum(axis=0) = - ∑(-∂L/∂μ - ∂L/∂μ2) = ∑(∂L/∂μ + ∂L/∂μ2)	↓	↓
(3):= (1)-(2)	←↙	↓	↓
↓ ↘→	(4):= (3) *2 2(3) =(-∂v/∂μ) = - ∂L/∂μ2 =(∂v/∂x) = ∂L/∂v ∂v/∂x	↓	↓
↓	(5):= var = $\frac{1}{N} \sum_{i =1}^{N} {(4)}_{i}$ 1/Nnp.ones((N,D))	↓	↓
↓	(6): = std = sqrt((5)+ε) 0.51/std =*∂σ/∂v = ∂L/∂v	↓	↓
(7) =(-∂Y/∂μ) = -∂L/∂μ	(7):= 1/(6) [-1/((6)2)] =∂Y/∂σ =∂L/∂σ	↓	↓
(8):= (3) * (7)	←↙ [*(3)].sum(axis=0)	↓	↓
*γ= ∂L/∂Y		↓	↓
(10):= (8) * (9)	←←↙	*(8)	↓
dout			↓
(12):= (10) + (11)	←←←	←←←↙	dβ= dout.sum(axis=0)
out (N,D) Loss

In python:

    x, sample_mean, sample_var, sample_std, gamma, x_hat, eps = cache
    N, D=dout.shape
    
    dbeta = dout.sum(axis=0)
    dgamma = (dout * x_hat).sum(axis=0)

    # using computational graph in https://romenlaw.blogspot.com/2024/06/calculating-gradient-using-computation.html
    step3 = x-sample_mean

    d10 = dout * gamma
    d8_3 = d10 * (1/sample_std)          
    d8_7 = (d10 * step3).sum(axis=0)     
    d7 = - d8_7 / (sample_var + eps)     
    d6 = d7 * 0.5 / sample_std           
    d5 = d6 / N * np.ones(shape=(N,D))   
    d4 = d5 * 2 * step3                  
    d3_1 = d4 + d8_3                     # (N,D)
    d3_2 = -1 * (d4 + d8_3).sum(axis=0)  # (D,)
    d2 = d3_2 / N * np.ones(shape=(N,D)) # (N,D)
    dx = d2 + d3_1

Intuitively, it's like following the 3 paths from Y to X directed by the red arrows. Now doing it the analytical way using chain rule.

From page 4 of the original paper https://arxiv.org/pdf/1502.03167 we have the formulae for the derivatives. However, since the equations used in the assignment 2 is different from the paper (especially how the variance and mean is used in calculating x_hat or y), we can rewrite the derivatives using the notations in assignment 2:

\frac{\partial L}{\partial x_{i}} = (\frac{\partial L}{\partial μ} \frac{\partial μ}{\partial x_{i}} + \frac{\partial L}{\partial v} \frac{\partial v}{\partial μ} \frac{\partial μ}{\partial x_{i}}) + \frac{\partial L}{\partial v} \frac{\partial v}{\partial x_{i}} - \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial μ}

= \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial μ} \frac{\partial μ}{\partial x_{i}} + \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial σ} \frac{\partial σ}{\partial v} \frac{\partial v}{\partial μ} \frac{\partial μ}{\partial x_{i}} + \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial σ} \frac{\partial σ}{\partial v} \frac{\partial v}{\partial x_{i}} - \frac{\partial L}{\partial y_{i}} \frac{\partial y_{i}}{\partial μ}

= [dout i γ (- \frac{1}{σ}) \frac{1}{N} \sum_{i =1}^{N} 1

- dout i γ (- \frac{x_{i} - μ}{σ^{2}}) \frac{1}{2 σ} \frac{2}{N} \sum_{i =1}^{N} (x_{i} - μ) \frac{1}{N} \sum_{k =1}^{N} 1]

+ dout i γ (- \frac{x_{i} - μ}{σ^{2}}) \frac{1}{2 σ} \frac{2}{N} \sum_{i =1}^{N} (x_{i} - μ)

- dout i γ (- \frac{1}{σ})

= \frac{1}{N} \sum_{i =1}^{N} [- \frac{dout i γ}{σ} + \frac{1}{N} (\sum_{i =1}^{N} \frac{dout i γ y_{i}}{σ}) y_{i}] ← in below python code, this is the dL_mu_x

- \frac{1}{N} (\sum_{i =1}^{N} \frac{dout i γ y_{i}}{σ}) y_{i} ← in below python code, this is the dL_v_x

+ \frac{dout i γ}{σ} ← in below python code, this is the dL_mu

In python code (modified slightly from here for readability):

    x, mean, var, std, gamma, x_hat, eps = cache
    S = lambda x: x.sum(axis=0)                     # helper function
    
    dbeta = dout.sum(axis=0)
    dgamma = (dout * x_hat).sum(axis=0)

    N = dout.shape[0]  # dout dimension (N,D)
    dx = dout * gamma / (N * std)          # temporarily initialize scale value
    
    dL_v_x = -S(dx*x_hat)*x_hat
    dL_mu = - N*dx

    dL_mu2 = -dL_v_x
    d_mu_x = S(-dx + dL_mu2)  #*np.ones(x.shape)
    #d_mu_x = -S(dx)

    dx = dL_v_x - dL_mu + d_mu_x

The dx difference between this and the above method is about 1e-10. Curiously, the standard answer ignores the dL_mu2 term but yields better result 5e-13. I wonder why. 2 months later, after watching Andrej Karpathy's lecture on back prop, I realised that dv/dμ is actually 0.

Sunday, 16 June 2024

Summary of a Fully Connected Neural Network

I usually spend my weekends on painting. For the last couple of weeks however, I have been learning Deep Learning following the cs231n course. Now that I have just finished Assignment 1, the two main things I have learned are the theory/maths taught in the course, as well as how to use numpy to implement them. Here is my summary of what I have learned using the 2 fully connected-layer neural network.

The architecture (Forward pass should be read from bottom up; Back propagation is top down):

Layers	Forward	Backward
Output number of nodes (classes): C scores: (C,)	Loss function: Softmax(f(x)) = $L_{i} = -ln (\frac{e^{s_{y_{i}}}}{\sum_{j}^{} e^{s_{j}}}) = - e^{s_{y_{i}}} + \sum_{j}^{} e^{s_{j}}$ $L = \frac{1}{N} \sum_{i =1}^{N} L_{i} + R (W)$ Regularisation: $R (W) = \frac{1}{2} λ \sum_{k}^{} \sum_{l}^{} {W_{k,l}}^{2}$ # x is the output of the previous layer N=x.shape[0] P = np.exp(x - x.max(axis=1, keepdims=True)) P /= P.sum(axis=1, keepdims=True) loss = -np.log(P[range(N), y]).sum() / N loss += 0.5 * self.reg * (np.sum(self.params['W2']2) + np.sum(self.params['W1']2) )	Gradients: $\frac{\partial L}{\partial S_{j}} = P_{j}$ $\frac{\partial L}{\partial S_{y_{i}}} = P_{y_{i}} - 1$ # x is the scores # P=exp(scores) / scores_exp_sum, dimention is (N,C) # grad x_j = Pj # grad x_yi = Pyi-1 N=x.shape[0] P = np.exp(x - x.max(axis=1, keepdims=True)) # numerically stable exponents P /= P.sum(axis=1, keepdims=True) # row-wise probabilities (softmax) P[range(N), y] -= 1 dx = P / N
Fully Connected Layer #2 W2: (H, C) b2: (C,)	f(x) = W₂x + b₂ # X is the output of the previous layer scores = X.dot(W)	Gradients FC2 $\frac{\partial R}{\partial W} = λ$ Tip: use dimension analysis! Note that you do not need to remember the expressions for `dW` and `dX` because they are easy to re-derive based on dimensions. # dout is the gradient passed in from the Output layer # i.e. the dx from above dx = dout.dot(w.T).reshape(x.shape) dw = x.reshape(x.shape[0], np.prod(x.shape[1:])).T.dot(dout) dw += dw * self.reg db = np.sum(dout, axis=0)
Fully Connected Layer #1 number of nodes: H W1: (D, H) b1: (H,)	Activation: ReLU(f(x)) out = np.maximum(0, out) f(x) = W₁x + b₁ out = input.dot(w) + b	Gradients FC1 ReLU backward: # dout is gradient from above layer FC2 # i.e. the dx from above x[x<0]=0 x[x>0]=1 dx = np.multiply(x, dout) f(x) backward: same as FC2 layer above
Input input data dimension: D number of input data/rows: N X: (N, D)	The input images are (32, 32, 3), which is reshaped into 32 x 32 x 3 = 3072 i.e. D = 3072 # reshape x into (N,D) input = x.reshape(x.shape[0], np.prod(x.shape[1:])) # or better input = x.reshape(x.shape[0], -1))

Here is a good summary of the different optimisation algorithms: https://www.youtube.com/watch?v=spbBQshdhL4

Some of my learnings from doing assignment 1:

method	pre-process	best accuracy
KNN	reshaping 32x32x3 into 3072	28% with K=10
1-layer SVM	reshaping 32x32x3 into 3072, zero center each image (by subtracting mean of training set), append bias (initialised to 1) as extra column for each image	training: 37% validation: 38% with lr=e-7 reg=5e4
1-layer Softmax	same as SVM above	training: 33% validation: 34% with lr=e-7 reg=2.5e4
2-layer	reshaping 32x32x3 into 3072	validation: 53.8% test: 52.7 with lr=e-3 reg=0.5 epochs=20 H size=100
1-layer SVM on features	extract 2 features (HOG, color histogram) for each image, zero-center the feature values, normalise the feature values, add bias dimension	SVM test = 41.4%
2-layer on features	same as above	test = 60.3% with lr=1.209071e-01 epochs=10 H=274 reg=0.000001

K-Nearest Neighbour (KNN)

The idea behind this approach is to compute the L2 distance between each test image and all the training images, then sum them up. There is no training involved. The distance calculation happens at test time.

Performance wise on Colab with CPU only, using 2 for loops took 43s, one loop took 51s (using sqrt()) or 38s(without sqrt()), using Numpy's broadcasting feature took less than 1s.

Two loops:

        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                # this takes 43s to run with sqrt, 35s without.
                dists[i,j]=np.sum((self.X_train[j]-X[i])**2)
       return dists

Vectorisation approach:

        # using (I1-I2)^2 = I1^2+I2^2-2*I1*I2
        # this takes 1s
        dists = np.sum(self.X_train ** 2, axis=1) \
          + (np.sum(X ** 2, axis=1))[:, np.newaxis] \
          -2 * np.dot(X, self.X_train.T)

        return dists

The output dists stores num_test rows of distances; each row contains num_train columns, which is the L2 distance between ith test image and jth training image.

dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

(500, 5000)

Using KNN to predict an image's classification is basically finding it's indices in the dists for the K shortest distances, then find the most frequent y-label in those K elements:

    def predict_labels(self, dists, k=1):
        """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # DONE:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort.                             #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            indices=np.argsort(dists[i])[:k]
            closest_y=self.y_train[indices]

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            #########################################################################
            # DONE:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            values, counts = np.unique(closest_y, return_counts=True)
            most_frequent_value = values[counts.argmax()]
            y_pred[i]=most_frequent_value

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        return y_pred

Testing with various K values:

From the chart, the value K=10 seems to yield the highest accuracy. The testing result using this K value is about 28%.

Linear Classifier SVM

Forward pass	Backward propagation
idx = np.random.choice(num_train, size=batch_size, replace=False) X_batch = X[idx] y_batch = y[idx] # evaluate loss and gradient loss, grad = svm_loss_vectorized(X_batch, y_batch, reg)	# perform parameter update self.W-=learning_rate * grad

The loss and gradient calculation:

def svm_loss_vectorized(W, X, y, reg):
    ####################
    # calculate loss
    ####################
    loss = 0.0
    dW = np.zeros(W.shape)  # initialize the gradient as zero

    num_train=X.shape[0]
    scores=X.dot(W)
    # extract all Syi into a 1xN matrix (a column)
    scores_yi=scores[np.arange(num_train) , y][: , np.newaxis]
    margins = np.maximum(0, scores - scores_yi + 1)  
    # set all yi elements to 0
    margins[np.arange(num_train),y] = 0
    
    loss = np.mean(np.sum(margins, axis=1))
    # Add regularization to the loss.
    loss += reg * np.sum(W * W)
 
    ####################
    # calculate gradient
    ####################
    mask = np.zeros(margins.shape)    
    # for positions where margins>0, the gradient at Sj is X[i]
    mask[margins > 0] = 1
    
    # for Yi positions, it's -nX[i], where n is number of times Syi appeared in
    # margins, which is the sum of all appearances of Sj
    row_sum = np.sum(mask, axis=1)
    mask[np.arange(num_train), y] = -row_sum.T

    dW += np.dot(X.T, mask)
    dW /= num_train

    # Regularize
    dW += reg*W
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return loss, dW

visualising the learned weights:

Linear Classifier Softmax

The only difference here is the loss and gradient calculation:

def softmax_loss_vectorized(W, X, y, reg):

    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)


    # *****calculate scores*****
    num_train=X.shape[0]
    num_classes = W.shape[1]
    scores = X.dot(W)
    # scores is N x C matrix
    scores -= np.max(scores, axis=1)[:, np.newaxis]
    # scores_y and scores_sum are 1-dimentional with N elements
    scores_y = scores[np.arange(num_train),y]
    scores_exp_sum = np.sum(np.exp(scores), axis=1)

    # *****calculate loss*****
    losses = np.log(scores_exp_sum) - scores_y
    loss=np.sum(losses) / num_train
    loss += reg*np.sum(W**2)

    # *****calculate gradient*****
    # P=exp(scores) / scores_exp_sum, dimention is (N,C)
    # grad Wj = Pj * xi
    # grad Wyi = (Pyi-1) * xi
    P=np.exp(scores) / scores_exp_sum[:, np.newaxis]
    P[np.arange(num_train), y] -= 1 
    dW += X.T.dot(P)
    dW /= num_train
    dW += reg * 2 * W

    return loss, dW

visualising the weights:

2-Layer Neural Network

Visualising output of bad hyper parameters: slow learning rate, low accuracy, not distinct features (grainy, noisy)

Visualising output of better hyper parameters: (but the accuracy chart suggests overfitting)

Features

'Manually' extract 2 features for each image: Histogram of Oriented Gradients (HOG) and color histogram. Use these features as input for the networks.

The best accuracy results show that using features is more effective than the raw images alone.

Some interesting visuals: