Transformers Hichem Felouat hichemfel@nii.ac.jp
Hichem Felouat - hichemfel@nii.ac.jp - 2024 2 Contents 1.Natural Language Processing NLP 2.Self-Attention 3.Transformer 4.Vision Transformer (ViT) 5.Large Language Models 6.Vision Language Models
Hichem Felouat - hichemfel@nii.ac.jp - 2024 3 Natural Language Processing NLP • Natural language processing (NLP) is a subfield of artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. • Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 4 Natural Language Processing NLP
Hichem Felouat - hichemfel@nii.ac.jp - 2024 5 Natural Language Processing NLP
Hichem Felouat - hichemfel@nii.ac.jp - 2024 6 Generic NLP Pipeline
Hichem Felouat - hichemfel@nii.ac.jp - 2024 7 Texts to Sequence/Matrix • In natural language processing (NLP), texts can be represented as a sequence or a matrix, depending on the task and the model type. texts = ["I love Algeria", "machine learning", "Artificial intelligence", "AI"] • The total number of documents: 4 • The number of distinct words (Tokenization ): 8 • word_index : {'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8} • texts_to_sequences : input [Algeria love AI] [3, 2, 8] • sequences_to_texts : input [3, 4, 7, 2, 8, 1, 3] ['algeria machine intelligence love ai i algeria']
Hichem Felouat - hichemfel@nii.ac.jp - 2024 8 Texts to Sequence/Matrix • binary: Whether or not each word is present in the document. This is the default. • count : The count of each word in the document. • freq : The frequency of each word as a ratio of words within each document. • tfidf : The Text Frequency-Inverse Document Frequency (TF-IDF) scoring for each word in the document. texts = [ "blue car and blue window", "black crow in the window","i see my reflection in the window" ]
Hichem Felouat - hichemfel@nii.ac.jp - 2024 9 Texts to Sequence/Matrix
Hichem Felouat - hichemfel@nii.ac.jp - 2024 10 Sequence Padding • Sequence padding is the process of adding zeroes or other filler tokens to sequences of variable length so that all sequences have the same length. • Many machine learning models require fixed-length inputs, and variable-length sequences can't be fed directly into these models. sequences = [ [1, 2, 3, 4], [1, 2, 3], [1] ] maxlen= 4 result: [[1 2 3 4] [0 1 2 3] [0 0 0 1]]
Hichem Felouat - hichemfel@nii.ac.jp - 2024 11 Deep Learning-Based NLP
Hichem Felouat - hichemfel@nii.ac.jp - 2024 12 • Word embedding is a technique used in NLP to represent words as numerical vectors in a high-dimensional space. • Word embedding aims to capture the meaning and context of words in a way that is useful for downstream NLP tasks, such as text classification, sentiment analysis, and machine translation. • There are several popular algorithms for creating word embeddings, such as Word2Vec, GloVe, and fastText. Word Embedding
Hichem Felouat - hichemfel@nii.ac.jp - 2024 13 Word Embedding
Hichem Felouat - hichemfel@nii.ac.jp - 2024 14 Recurrent Neural Network(RNN) • The simplest possible RNN composed of one neuron receiving inputs, producing an output, and sending that output back to itself (figure -left). • We can represent this tiny network against the time axis, as shown in (figure - right). This is called unrolling the network through time.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 15 Recurrent Neural Network(RNN) • You can easily create a layer of recurrent neurons. At each time step t, every neuron receives both the input vector x(t) and the output vector from the previous time step y(t–1). A recurrent neuron (left) unrolled through time (right)
Hichem Felouat - hichemfel@nii.ac.jp - 2024 16 Recurrent Neural Network(RNN) • Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder (bottom right) networks.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 17 Recurrent Neural Network(RNN) Deep RNN (left) unrolled through time (right)
Hichem Felouat - hichemfel@nii.ac.jp - 2024 18 Long Short-Term Memory (LSTM) • When the data traversing an RNN, some information is lost at each time step. After a while, the RNN’s state contains virtually no trace of the first inputs.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 19 Gated Recurrent Unit (GRU)
Hichem Felouat - hichemfel@nii.ac.jp - 2024 20 Bidirectional RNNs For example: in Neural Machine Translation, it is often preferable to look ahead at the next words before encoding a given word. • Consider the phrases "the queen of the United Kingdom", "the queen of hearts", and "the queen bee": to properly encode the word “queen”, you need to look ahead.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 21 Bidirectional RNNs • To implement this, run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left. Then simply concatenating them.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 22 Self-Attention
Hichem Felouat - hichemfel@nii.ac.jp - 2024 23 Self-Attention • The following sentence is an input sentence we want to translate: "The animal didn't cross the street because it was too tired“ • What does "it" in this sentence refer to? • Is it referring to the street or to the animal? It’s a simple question to a human but not as simple to an algorithm. • When the model is processing the word "it", self-attention allows it to associate "it" with "animal".
Hichem Felouat - hichemfel@nii.ac.jp - 2024 24 Self-Attention As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".
Hichem Felouat - hichemfel@nii.ac.jp - 2024 25 Self-Attention in Detail Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence. Weights
Hichem Felouat - hichemfel@nii.ac.jp - 2024 26 Self-Attention in Detail dot product
Hichem Felouat - hichemfel@nii.ac.jp - 2024 27 Matrix Calculation of Self-Attention Every row in the X matrix corresponds to a word in the input sentence.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 28 The Attention Mechanism from Scratch
Hichem Felouat - hichemfel@nii.ac.jp - 2024 29 Matrix Calculation of Self-Attention
Hichem Felouat - hichemfel@nii.ac.jp - 2024 30 Multi-Headed Attention Multi-Headed Attention improves the performance of the attention layer in two ways: • It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. • It gives the attention layer multiple representation subspaces.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 31 Multi-Headed Attention
Hichem Felouat - hichemfel@nii.ac.jp - 2024 32 Multi-Headed Attention As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" , in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired". If we add all the attention heads to the picture, however, things can be harder to interpret.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 33 Transformer Positional Encoding: The transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 34 Transformer
Hichem Felouat - hichemfel@nii.ac.jp - 2024 35 Transformer
Hichem Felouat - hichemfel@nii.ac.jp - 2024 36 Transformer Attention Is All You Need https://arxiv.org/abs/1706.03762
Hichem Felouat - hichemfel@nii.ac.jp - 2024 37 The Annotated Transformer a line-by-line implementation http://nlp.seas.harvard.edu/annotated-transformer/ Transformer
Hichem Felouat - hichemfel@nii.ac.jp - 2024 38 Transformer
Hichem Felouat - hichemfel@nii.ac.jp - 2024 39 Vision Transformer (ViT)
Hichem Felouat - hichemfel@nii.ac.jp - 2024 40 Vision Transformers (ViTs) vs CNNs Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained from scratch on ImageNet.
Hichem Felouat - hichemfel@nii.ac.jp - 2024 41 The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased towards recognizing textures rather than shapes. Below is an excellent example of such a case: [1]: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/1811.12231 Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 42 • Neuroscience studies (The importance of shape in early lexical learning [1]) showed that object shape is the single most important cue for human object recognition. • By studying the visual pathway of humans regarding image recognition, researchers identified that the perception of object shape is invariant to most perturbations. So as far as we know, the shape is the most reliable cue. • Intuitively, the object shape remains relatively stable, while other cues can be easily distorted by all sorts of noise [2]. 1: https://psycnet.apa.org/doi/10.1016/0885-2014(88)90014-7 2: https://arxiv.org/abs/1811.12231 Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 43 Accuracies and example stimuli for five different experiments without cue conflict. Source: https://arxiv.org/abs/1811.12231 Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 44 • The texture is not sufficient for determining whether the zebra is rotated. Thus, predicting rotation requires modeling shape, to some extent. • The object's shape can be invariant to rotations. Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 45 The self-attention captures long-range dependencies and contextual information in the input data. The self-attention mechanism allows a ViT model to attend to different regions of the input data based on their relevance to the task at hand. Raw images (Left) and attention maps of ViT-S/16 with (Right) and without (Middle). https://arxiv.org/abs/2106.01548 Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 46 The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images. 1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294 Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 47 ViTs and ResNets process their inputs very differently. https://arxiv.org/abs/2103.14586 • The adversarial perturbations computed for a ViT and a ResNet model. • The adversarial perturbations are qualitatively very different even though both models may perform similarly in image recognition. Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 48 • The transformer can attend to all the tokens (image patches) at each block by design. The originally proposed ViT model in [1] already demonstrated that heads from early layers tend to attend to far-away pixels, while heads from later layers do not. How heads of different layers attend to their surrounding pixels [1]. [1]: https://arxiv.org/abs/2010.11929 Vision Transformers (ViTs) vs CNNs
Hichem Felouat - hichemfel@nii.ac.jp - 2024 49 How the Vision Transformer Works: 1. Split an image into patches 2. Flatten the patches 3. Produce lower-dimensional linear embeddings from the flattened patches 4. Add positional embeddings 5. Feed the sequence as an input to a standard transformer encoder 6. Pretrain the model with image labels (fully supervised on a huge dataset) 7. Finetune on the downstream dataset for image classification Vision Transformers (ViTs)
Hichem Felouat - hichemfel@nii.ac.jp - 2024 50 Vision Transformers (ViTs) https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Vision_Transformer_(ViT)_for_Image_Classification_(cifar10_dataset).ipynb
Hichem Felouat - hichemfel@nii.ac.jp - 2024 51 Vision Transformers (ViTs) Global Context Vision Transformer (GC ViT): https://github.com/NVlabs/GCViT
Hichem Felouat - hichemfel@nii.ac.jp - 2024 52 Large Language Models A Survey of Large Language Models: https://arxiv.org/abs/2303.18223
Hichem Felouat - hichemfel@nii.ac.jp - 2024 53 Vision Language Models The architecture of MiniGPT-4 https://minigpt-4.github.io
Hichem Felouat - hichemfel@nii.ac.jp - 2024 54 Vision Language Models https://github.com/Vision-CAIR/MiniGPT-4
Hichem Felouat - hichemfel@nii.ac.jp - 2024 55 Thank You For Attending Q&A Hichem Felouat …

Natural Language Processing NLP (Transformers)

  • 1.
  • 2.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 2 Contents 1.Natural Language Processing NLP 2.Self-Attention 3.Transformer 4.Vision Transformer (ViT) 5.Large Language Models 6.Vision Language Models
  • 3.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 3 Natural Language Processing NLP • Natural language processing (NLP) is a subfield of artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. • Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
  • 4.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 4 Natural Language Processing NLP
  • 5.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 5 Natural Language Processing NLP
  • 6.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 6 Generic NLP Pipeline
  • 7.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 7 Texts to Sequence/Matrix • In natural language processing (NLP), texts can be represented as a sequence or a matrix, depending on the task and the model type. texts = ["I love Algeria", "machine learning", "Artificial intelligence", "AI"] • The total number of documents: 4 • The number of distinct words (Tokenization ): 8 • word_index : {'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8} • texts_to_sequences : input [Algeria love AI] [3, 2, 8] • sequences_to_texts : input [3, 4, 7, 2, 8, 1, 3] ['algeria machine intelligence love ai i algeria']
  • 8.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 8 Texts to Sequence/Matrix • binary: Whether or not each word is present in the document. This is the default. • count : The count of each word in the document. • freq : The frequency of each word as a ratio of words within each document. • tfidf : The Text Frequency-Inverse Document Frequency (TF-IDF) scoring for each word in the document. texts = [ "blue car and blue window", "black crow in the window","i see my reflection in the window" ]
  • 9.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 9 Texts to Sequence/Matrix
  • 10.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 10 Sequence Padding • Sequence padding is the process of adding zeroes or other filler tokens to sequences of variable length so that all sequences have the same length. • Many machine learning models require fixed-length inputs, and variable-length sequences can't be fed directly into these models. sequences = [ [1, 2, 3, 4], [1, 2, 3], [1] ] maxlen= 4 result: [[1 2 3 4] [0 1 2 3] [0 0 0 1]]
  • 11.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 11 Deep Learning-Based NLP
  • 12.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 12 • Word embedding is a technique used in NLP to represent words as numerical vectors in a high-dimensional space. • Word embedding aims to capture the meaning and context of words in a way that is useful for downstream NLP tasks, such as text classification, sentiment analysis, and machine translation. • There are several popular algorithms for creating word embeddings, such as Word2Vec, GloVe, and fastText. Word Embedding
  • 13.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 13 Word Embedding
  • 14.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 14 Recurrent Neural Network(RNN) • The simplest possible RNN composed of one neuron receiving inputs, producing an output, and sending that output back to itself (figure -left). • We can represent this tiny network against the time axis, as shown in (figure - right). This is called unrolling the network through time.
  • 15.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 15 Recurrent Neural Network(RNN) • You can easily create a layer of recurrent neurons. At each time step t, every neuron receives both the input vector x(t) and the output vector from the previous time step y(t–1). A recurrent neuron (left) unrolled through time (right)
  • 16.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 16 Recurrent Neural Network(RNN) • Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder (bottom right) networks.
  • 17.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 17 Recurrent Neural Network(RNN) Deep RNN (left) unrolled through time (right)
  • 18.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 18 Long Short-Term Memory (LSTM) • When the data traversing an RNN, some information is lost at each time step. After a while, the RNN’s state contains virtually no trace of the first inputs.
  • 19.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 19 Gated Recurrent Unit (GRU)
  • 20.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 20 Bidirectional RNNs For example: in Neural Machine Translation, it is often preferable to look ahead at the next words before encoding a given word. • Consider the phrases "the queen of the United Kingdom", "the queen of hearts", and "the queen bee": to properly encode the word “queen”, you need to look ahead.
  • 21.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 21 Bidirectional RNNs • To implement this, run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left. Then simply concatenating them.
  • 22.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 22 Self-Attention
  • 23.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 23 Self-Attention • The following sentence is an input sentence we want to translate: "The animal didn't cross the street because it was too tired“ • What does "it" in this sentence refer to? • Is it referring to the street or to the animal? It’s a simple question to a human but not as simple to an algorithm. • When the model is processing the word "it", self-attention allows it to associate "it" with "animal".
  • 24.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 24 Self-Attention As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".
  • 25.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 25 Self-Attention in Detail Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence. Weights
  • 26.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 26 Self-Attention in Detail dot product
  • 27.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 27 Matrix Calculation of Self-Attention Every row in the X matrix corresponds to a word in the input sentence.
  • 28.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 28 The Attention Mechanism from Scratch
  • 29.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 29 Matrix Calculation of Self-Attention
  • 30.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 30 Multi-Headed Attention Multi-Headed Attention improves the performance of the attention layer in two ways: • It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. • It gives the attention layer multiple representation subspaces.
  • 31.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 31 Multi-Headed Attention
  • 32.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 32 Multi-Headed Attention As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" , in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired". If we add all the attention heads to the picture, however, things can be harder to interpret.
  • 33.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 33 Transformer Positional Encoding: The transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
  • 34.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 34 Transformer
  • 35.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 35 Transformer
  • 36.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 36 Transformer Attention Is All You Need https://arxiv.org/abs/1706.03762
  • 37.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 37 The Annotated Transformer a line-by-line implementation http://nlp.seas.harvard.edu/annotated-transformer/ Transformer
  • 38.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 38 Transformer
  • 39.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 39 Vision Transformer (ViT)
  • 40.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 40 Vision Transformers (ViTs) vs CNNs Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained from scratch on ImageNet.
  • 41.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 41 The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased towards recognizing textures rather than shapes. Below is an excellent example of such a case: [1]: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/1811.12231 Vision Transformers (ViTs) vs CNNs
  • 42.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 42 • Neuroscience studies (The importance of shape in early lexical learning [1]) showed that object shape is the single most important cue for human object recognition. • By studying the visual pathway of humans regarding image recognition, researchers identified that the perception of object shape is invariant to most perturbations. So as far as we know, the shape is the most reliable cue. • Intuitively, the object shape remains relatively stable, while other cues can be easily distorted by all sorts of noise [2]. 1: https://psycnet.apa.org/doi/10.1016/0885-2014(88)90014-7 2: https://arxiv.org/abs/1811.12231 Vision Transformers (ViTs) vs CNNs
  • 43.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 43 Accuracies and example stimuli for five different experiments without cue conflict. Source: https://arxiv.org/abs/1811.12231 Vision Transformers (ViTs) vs CNNs
  • 44.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 44 • The texture is not sufficient for determining whether the zebra is rotated. Thus, predicting rotation requires modeling shape, to some extent. • The object's shape can be invariant to rotations. Vision Transformers (ViTs) vs CNNs
  • 45.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 45 The self-attention captures long-range dependencies and contextual information in the input data. The self-attention mechanism allows a ViT model to attend to different regions of the input data based on their relevance to the task at hand. Raw images (Left) and attention maps of ViT-S/16 with (Right) and without (Middle). https://arxiv.org/abs/2106.01548 Vision Transformers (ViTs) vs CNNs
  • 46.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 46 The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images. 1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294 Vision Transformers (ViTs) vs CNNs
  • 47.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 47 ViTs and ResNets process their inputs very differently. https://arxiv.org/abs/2103.14586 • The adversarial perturbations computed for a ViT and a ResNet model. • The adversarial perturbations are qualitatively very different even though both models may perform similarly in image recognition. Vision Transformers (ViTs) vs CNNs
  • 48.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 48 • The transformer can attend to all the tokens (image patches) at each block by design. The originally proposed ViT model in [1] already demonstrated that heads from early layers tend to attend to far-away pixels, while heads from later layers do not. How heads of different layers attend to their surrounding pixels [1]. [1]: https://arxiv.org/abs/2010.11929 Vision Transformers (ViTs) vs CNNs
  • 49.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 49 How the Vision Transformer Works: 1. Split an image into patches 2. Flatten the patches 3. Produce lower-dimensional linear embeddings from the flattened patches 4. Add positional embeddings 5. Feed the sequence as an input to a standard transformer encoder 6. Pretrain the model with image labels (fully supervised on a huge dataset) 7. Finetune on the downstream dataset for image classification Vision Transformers (ViTs)
  • 50.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 50 Vision Transformers (ViTs) https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Vision_Transformer_(ViT)_for_Image_Classification_(cifar10_dataset).ipynb
  • 51.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 51 Vision Transformers (ViTs) Global Context Vision Transformer (GC ViT): https://github.com/NVlabs/GCViT
  • 52.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 52 Large Language Models A Survey of Large Language Models: https://arxiv.org/abs/2303.18223
  • 53.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 53 Vision Language Models The architecture of MiniGPT-4 https://minigpt-4.github.io
  • 54.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 54 Vision Language Models https://github.com/Vision-CAIR/MiniGPT-4
  • 55.
    Hichem Felouat -hichemfel@nii.ac.jp - 2024 55 Thank You For Attending Q&A Hichem Felouat …