Saturday 20 July 2024

Image Captioning with Transformer

 The architecture:



CaptioningTransformer( (visual_projection): Linear(in_features=512, out_features=256, bias=True) (embedding): Embedding(1004, 256, padding_idx=0) (positional_encoding): PositionalEncoding( (dropout): Dropout(p=0.1, inplace=False) ) (transformer): TransformerDecoder( (layers): ModuleList( (0-1): 2 x TransformerDecoderLayer( (self_attn): MultiHeadAttention( (key): Linear(in_features=256, out_features=256, bias=True) (query): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (proj): Linear(in_features=256, out_features=256, bias=True) (attn_drop): Dropout(p=0.1, inplace=False) ) (multihead_attn): MultiHeadAttention( (key): Linear(in_features=256, out_features=256, bias=True) (query): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (proj): Linear(in_features=256, out_features=256, bias=True) (attn_drop): Dropout(p=0.1, inplace=False) ) (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.1, inplace=False) (dropout3): Dropout(p=0.1, inplace=False) (activation): ReLU() ) ) ) (output): Linear(in_features=256, out_features=1004, bias=True) )






Sunday 14 July 2024

Image Captioning with RNN and LSTM

 The image captioning RNN architecture in assignment 3:

Forward and backward passes through the above network:
        # (1) CNN features (N, D) -> hiddent state projection layer
        h0, affine_cache = affine_forward(features, W_proj, b_proj) # (N, H)

        # (2) word embedding, captions_in (N, T), output (N, T, W)
        # captions_in contains indices of words in the dictionary
        embedded_words, word_eb_cache = word_embedding_forward(captions_in, W_embed)

        # (3) vanila RNN, h (N, T, H)
        if self.cell_type == 'rnn':
          h, rnn_cache = rnn_forward(embedded_words, h0,  Wx, Wh, b)
        else:
          h, rnn_cache = lstm_forward(embedded_words, h0,  Wx, Wh, b)

        # (4) temporal affine, output (N, T, V) which is the score of each word in
        # vocab (range(V)). Index with highest value is predicted word.
        x, temp_cache = temporal_affine_forward(h, W_vocab, b_vocab)

        # (5) softmax
        loss, dout = temporal_softmax_loss(x, captions_out, mask)

        # find grads by calling backwards functions
        dh, dW_vocab, db_vocab = temporal_affine_backward(dout, temp_cache)
        if self.cell_type=='rnn':
          dx, dh0, dWx, dWh, db = rnn_backward(dh, rnn_cache)
        else:
          dx, dh0, dWx, dWh, db = lstm_backward(dh, rnn_cache)
        dW_embed = word_embedding_backward(dx, word_eb_cache)
        d_feature, dW_proj, db_proj = affine_backward(dh0, affine_cache)
       
        grads = {
          "W_proj": dW_proj,
          "b_proj": db_proj,
          "W_embed": dW_embed,
          "Wx": dWx,
          "Wh": dWh,
          "b": db,
          "W_vocab": dW_vocab,
          "b_vocab": db_vocab
        }

When using this for sampling, the captions_in are populated with the <SOS> only.