6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 3: Convolutional Neural Networks Prof. Manolis Kellis http://mit6874.github.io Slides credit: 6.S191, Dana Erlich, Param Vir Singh, David Gifford, Alexander Amini, Ava Soleimany
Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
1a. What do you see, and how? Can we teach machines to see?
What do you see?
How do you see? How can we help computers see?
What computers ‘see’: Images as Numbers What the computer "sees" Levin Image Processing & Computer Vision An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image. Question: is this Lincoln?Washington? Jefferson? Obama? How can the computer answer this question? What you see Input Image Input Image + values Pixel intensity values (“pix-el”=picture-element) What you both see Can I just do classification on the 1,166400-long image vector directly? No. Instead: exploit image spatial structure. Learn patches. Build them up
1b. Classical machine vision roots in study of human/animal brains
Inspiration: human/animal visual cortex • Layers of neurons: pixels, edges, shapes, primitives, scenes • E.g. Layer 4 responds to bands w/ given slant, contrasting edges
Primitives: Neurons & action potentials •Chemical accumulation across dendritic connections •Pre-synaptic axon  post-synaptic dendrite  neuronal cell body •Each neuron receives multiple signals from its many dendrites •When threshold crossed, it fires •Its axon then sends outgoing signal to downstream neurons •Weak stimuli ignored •Sufficiently strong cross activation threshold •Non-linearity within each neuronal level •Neurons connected into circuits (neural networks): emergent properties, learning, memory •Simple primitives arranged in simple, repetitive, and extremely large networks •86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
Abstraction layers: edges, bars, dir., shapes, objects, scenes LGN: Small dots V1: Orientation, disparity, some color V4: Color, basic shapes, 2D/3D, curvature VTC: Complex features and objects(VTC: ventral temporal cortex •Abstraction layers  visual cortex layers •Complex concepts from simple parts, hierarchy •Primitives of visual concepts encoded in neuronal connection in early cortical layers
• Massive recent expanse of human brain has re-used a relatively simple but general learning architecture General “learning machine”, reused widely • Hearing, taste, smell, sight, touch all re- use similar learning architecture Motor Cortex Visual Cortex • Interchangeable circuitry • Auditory cortex learns to ‘see’ if sent visual signals • Injury area tasks shift to uninjured areas • Not fully-general learning, but well-adapted to our world • Humans co-opted this circuitry to many new applications • Modern tasks accessible to any homo sapiens (<70k years) • ML primitives not too different from animals: more to come? human chimp Hardware expansion
Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
2a. Spatial structure for image recognition
Using Spatial Structure Idea: connect patches of input to neurons in hidden layer. Neuron connected to region of input. Only “sees”these values. Input: 2D image. Array of pixel values
Using Spatial Structure Connect patch in input layer to a single neuron in subsequent layer. Use a sliding window to define connections. How can we weight the patch to detect particular features?
Feature Extraction with Convolution - Filter of size 4x4 :16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This“patchy” operation isconvolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
Fully Connected Neural Network Fully Connected: • Each neuron in hidden layer connected to all neurons in input layer • No spatial information • Many, many parameters Input: • 2D image • Vector of pixel values Key idea: Use spatial structure in input to inform architecture of the network
High Level Feature Detection Let’s identify key features in each image category Wheels,License Plate, Headlights Door,Windows,Steps Nose,Eyes,Mouth
Fully Connected Neural Network
2b. Convolutions and filters
Convolution operation is element wise multiply and add Filter / Kernel
Producing Feature Maps Original Sharpen Edge Detect “Strong” Edge Detect
A simple pattern: Edges How can we detect edges with a kernel? Input -1 -1 Filter Output (Goodfellow 2016)
Simple Kernels / Filters
X or X? Image is represented as matrix of pixel values… and computers are literal! We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed. Rohrer How do CNNs work?
There are three approaches to edge cases in convolution
(Goodfellow 2016) Zero Padding Controls Output Size • Full convolution: zero pad input so output is produced whenever an output value contains at least one input value (expands output) • Valid-only convolution: output only when entire kernel contained in input (shrinks output) • Same convolution: zero pad input so output is same size as input dimensions x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME') • TF convolution operator takes stride and zero fill option as parameters • Stride is distance between kernel applications in each dimension • Padding can be SAME or VALID
Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3a. Learning Visual Features de novo
Key idea: learn hierarchy of features directly from the data (rather than hand-engineering them) Low level features Mid level features High level features Lee+ ICML 2009 Eyes,ears,nose Edges,dark spots Facial structure
Key idea: re-use parameters Convolution shares parameters Example 3x3 convolution on a 5x5 image
Feature Extraction with Convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
LeNet-5 ⋮ ⋮ � 𝑦𝑦 32×32×1 28×28×6 14×14×6 10×10×16 5×5×16 120 84 5 × 5 s = 1 f = 2 s = 2 avg pool 5 × 5 s = 1 avg pool f = 2 s = 2 . . . . . . Reminder: Output size = (N+2P-F)/stride + 1 10 conv conv FC FC [LeCun et al., 1998] This slide is taken from Andrew Ng
LeNet-5 • Only 60K parameters • As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑ • General structure: conv->pool->conv->pool->FC->FC->output • Different filters look at different channels • Sigmoid and Tanh nonlinearity [LeCun et al., 1998]
Backpropagation of convolution Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs Mid level features Low level features High level features Edges,dark spots Conv Layer 1 Lee+ ICML 2009 Eyes,ears,nose Conv Layer 2 Facial structure Conv Layer 3
CNNs for Classification 1. Convolution:Apply filters to generate feature maps. 2. Non-linearity:Often ReLU. 3. Pooling:Downsampling operation on each feature map. Trainmodel with image data. Learn weights of filters in convolutional layers. tf.keras.layers.Conv2 D tf.keras.activations. * tf.keras.layers.MaxPool2 D
Example – Six convolutional layers
Convolutional Layers: Local Connectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias tf.keras.layers. Conv2D
Convolutional Layers: Local Connectivity For a neuron in hidden layer: • Take inputs from patch • Compute weighted sum • Apply bias 4x4 filter: matrix of weights wij for neuron (p,q) in hidden layer 1) applying a window of weights 2) computing linear combinations 3) activating with non-linear function tf.keras.layers.Conv2D
CNNs: Spatial Arrangement of Output Volume depth width height Layer Dimensions: ℎ  w d where h and w are spatial dimensions d (depth) = number of filters Stride: Filter step size Receptive Field: Locations in input image that a node is path connected to tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity Rectified Linear Unit (ReLU) - Apply after every convolution operation (i.e.,after convolutional layers) - ReLU:pixel-by-pixel operation that replaces all negative values by zero. - Non-linear operation tf.keras.layers.ReLU Karn Intuitive CNNs
Pooling Max Pooling,average pooling 1) Reduced dimensionality 2) Spatial invariance tf.keras.layers.Max Pool2D( pool_size=(2,2), strides=2 )
The REctified Linear Unit (RELU) is a common non-linear detector stage after convolution x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) x= tf.nn.relu(x) f(x) = max(0, x) When will we backpropagate through this? Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up spatial location • max pooling reports the maximum output within a defined neighborhood • Padding can be SAME or VALID x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME') Output Input Pooling Batch H W Input channel Neighborhood [batch, height, width, channels]
Dilated Convolution
91 CNNs for Classification: Feature Learning 1. Learn features in input image through convolution 2. Introduce non-linearity through activation function (real-world data is non-linear!) 3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities - CONV and POOL layers output high-level features of input - Fully connected layer uses these features for classifying input image - Express output as probability of image belonging to a particular class
Putting it all together import tensorflow as tf def generate_model(): model = tf.keras.Sequential([ # first convolutional layer tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # second convolutional layer tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # fully connected classifier tf.keras.layers.Flatten(), tf.keras.layers.Dense(1024, activation='relu’), tf.keras.layers.Dense(10, activation=‘softmax’) # 10 outputs ]) return model
Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is hard
How can computers recognize objects?
How can computers recognize objects? Challenge: • Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc. • How can we overcome this challenge? Answer: • Learn a ton of features (millions) from the bottom up • Learn the convolutional filters, rather than pre-computing them
Detect features to classify Li/Johnson/Yeung C231n Feature invariance to perturbation is hard
Next-generation models explode # of parameters
LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet • ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012 • Facilitated by GPUs, highly optimized convolution implementation and large datasets (ImageNet) • One of the largest CNNs to date • Has 60 Million parameter compared to 60k parameter of LeNet-5 [Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners • The annual “Olympics” of computer vision. • Teams from across the world compete to see who has the best computer vision model for tasks such as classification, localization, detection, and more. • 2012 marked the first year where a CNN was used to achieve a top 5 test error rate of 15.3%. • The next best entry achieved an error of 26.2%.
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet [Krizhevsky et al., 2012] Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8 • Input: 227x227x3 images (224x224 before padding) • First layer: 96 11x11 filters applied at stride 4 • Output volume size? (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96] • Number of parameters in this layer? (11*11*3)*96 = 35K Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet [Krizhevsky et al., 2012]
AlexNet [Krizhevsky et al., 2012] • Input: 227x227x3 images (224x224 before padding) • After CONV1: 55x55x96 • Second layer: 3x3 filters applied at stride 2 • Output volume size? (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96] • Number of parameters in this layer? 0! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
AlexNet . . . 227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . . This slide is taken from Andrew Ng
AlexNet . . . 4096 4096 Softmax 1000 ⋮ ⋮ [Krizhevsky et al., 2012] FC FC This slide is taken from Andrew Ng
AlexNet [Krizhevsky et al., 2012] Details/Retrospectives: • first use of ReLU • used Norm layers (not common anymore) • heavy data augmentation • dropout 0.5 • batch size 128 • 7 CNN ensemble Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet [Krizhevsky et al., 2012] • Trained on GTX 580 GPU with only 3 GB of memory. • Network spread across 2 GPUs, half the neurons (feature maps) on each GPU. • CONV1, CONV2, CONV4, CONV5: Connections only with feature maps on same GPU. • CONV3, FC6, FC7, FC8: Connections with all feature maps in preceding layer, communication across GPUs. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition. [Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet • Very Deep Convolutional Networks For Large Scale Image Recognition - Karen Simonyan and Andrew Zisserman; 2015 • The runner-up at the ILSVRC 2014 competition • Significantly deeper than AlexNet • 140 million parameters [Simonyan and Zisserman, 2014]
VGGNet • Smaller filters Only 3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2 • Deeper network AlexNet: 8 layers VGGNet: 16 - 19 layers • ZFNet: 11.7% top 5 error in ILSVRC’13 • VGGNet: 7.3% top 5 error in ILSVRC’14 Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014] Input 3x3 conv, 64 3x3 conv, 64 Pool 1/2 3x3 conv, 128 3x3 conv, 128 Pool 1/2 3x3 conv, 256 3x3 conv, 256 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 FC 4096 FC 4096 FC 1000 Softmax
VGGNet [Simonyan and Zisserman, 2014] • Why use smaller filters? (3x3 conv) Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer. • What is the effective receptive field of three 3x3 conv (stride 1) layers? 7x7 But deeper, more non-linearities And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet [Simonyan and Zisserman, 2014] VGG16: TOTAL memory: 24M * 4 bytes ~= 96MB / image TOTAL params: 138M parameters Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input 3x3 conv, 64 3x3 conv, 64 Pool 3x3 conv, 128 3x3 conv, 128 Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool FC 4096 FC 4096 FC 1000 Softmax
[Simonyan and Zisserman, 2014] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input memory: 224*224*3=150K params: 0 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Pool memory: 112*112*64=800K params: 0 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 Pool memory: 56*56*128=400K params: 0 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 Pool memory: 28*28*256=200K params: 0 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 Pool memory: 14*14*512=100K params: 0 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool memory: 7*7*512=25K params: 0 FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448 FC 4096 memory: 4096 params: 4096*4096 = 16,777,216 FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
VGGNet [Simonyan and Zisserman, 2014] Details/Retrospectives : • ILSVRC’14 2nd in classification, 1st in localization • Similar training procedure as AlexNet • No Local Response Normalisation (LRN) • Use VGG16 or VGG19 (VGG19 only slightly better, more memory) • Use ensembles for best results • FC7 features generalize well to other tasks • Trained on 4 Nvidia Titan Black GPUs for two to three weeks. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet VGG Net reinforced the notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work. Keep it deep. Keep it simple. [Simonyan and Zisserman, 2014]
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet • Going Deeper with Convolutions - Christian Szegedy et al.; 2015 • ILSVRC 2014 competition winner • Also significantly deeper than AlexNet • x12 less parameters than AlexNet • Focused on computational efficiency [Szegedy et al., 2014]
GoogleNet • 22 layers • Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure • No FC layers • Only 5 million parameters! • ILSVRC’14 classification winner (6.7% top 5 error) [Szegedy et al., 2014]
GoogleNet “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] Filter concatenation Previous layer 1x1 convolution 3x3 convolution 5x5 convolution 1x1 convolution 1x1 convolution 1x1 convolution 3x3 max pooling
GoogleNet Details/Retrospectives : • Deeper networks, with computational efficiency • 22 layers • Efficient “Inception” module • No FC layers • 12x less params than AlexNet • ILSVRC’14 classification winner (6.7% top 5 error) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. [Szegedy et al., 2014]
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet • Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015 • Extremely deep network – 152 layers • Deeper neural networks are more difficult to train. • Deep networks suffer from vanishing and exploding gradients. • Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. [He et al., 2015]
ResNet • ILSVRC’15 classification winner (3.57% top 5 error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet • What happens when we continue stacking deeper layers on a convolutional neural network? • 56-layer model performs worse on both training and test error -> The deeper model performs worse (not caused by overfitting)! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet • Hypothesis: The problem is an optimization problem. Very deep networks are harder to optimize. • Solution: Use network layers to fit residual mapping instead of directly trying to fit a desired underlying mapping. • We will use skip connections allowing us to take the activation from one layer and feed it into another layer, much deeper into the network. • Use layers to fit residual F(x) = H(x) – x instead of H(x) directly Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet Residual Block Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x. [He et al., 2015]
ResNet Short cut/ skip connection 𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2] 𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏]) 𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐]) 𝑎𝑎[𝑙𝑙+1] a[l] a[l+1] 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐 + 𝐚𝐚 𝐥𝐥 = 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] + 𝐚𝐚 𝐥𝐥 ) [He et al., 2015]
ResNet Full ResNet architecture: • Stack residual blocks • Every residual block has two 3x3 conv layers • Periodically, double # of filters and downsample spatially using stride 2 (in each dimension) • Additional conv layer at the beginning • No FC layers at the end (only FC 1000 to output classes) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet • Total depths of 34, 50, 101, or 152 layers for ImageNet • For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet Experimental Results: • Able to train very deep networks without degrading • Deeper networks now achieve lower training errors as expected [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Even better than human performance! [He et al., 2015]
Accuracy comparison The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power consumption The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Countless applications
An Architecture for Many Applications Detection Semantic segmentation End-to-end robotic control
Semantic Segmentation: Fully Convolutional Networks FCN:Fully Convolutional Network. Network designed with all convolutional layers,with downsampling and upsampling operations tf.keras.layers.Conv2DTranspose Long+ CVPR 2015
Facial Detection & Recognition
Self-Driving Cars Amini+ ICRA 2019.
Self-Driving Cars: Navigation from Visual Perception Raw Perception I (ex.camera) Coarse Maps M (ex.GPS) Possible Control Commands Amini+ ICRA 2019
End-to-End Framework for Autonomous Navigation Entire model trained end-to-end without any human labelling or annotations Amini+ ICRA 2019
Automatic Colorization of Black and White Images
Optimizing Images Post Processing Feature Optimization (Illumination) Post Processing Feature Optimization (Color Curves and Details) Post Processing Feature Optimization (Color Tone: Warmness)
Up-scaling low-resolution images
Medicine, Biology, Healthcare Gulshan+ JAMA 2016.
Breast Cancer Screening 6 . Breast cancer case missed by radiologist but detected byAI AI MD Readers AI MD Readers CNN-based system outperformed expert radiologists at detecting breast cancer from mammograms
Semantic Segmentation: Biomedical Image Analysis BrainTumors Dong+ MIUA 2017. Malaria Infection Soleimany+ arXiv 2019. Dong+ MIUA 2017;Soleimany+ arXiv 2019 Origi nal Ground Truth Segmenta tion Uncertai nty
DeepBind [Alipanahi et al., 2015]
Predicting disease mutations [Alipanahi et al., 2015]
Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Deep Learning for Computer Vision: Summary Foundations • Why computer vision? • Representing images • Convolutions for feature extraction CNNs • CNN architecture • Application to classification • ImageNet Applications • Segmentation,image captioning,control • Security,medicine, robotics

CNN Algorithm

  • 1.
    6.874, 6.802, 20.390,20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 3: Convolutional Neural Networks Prof. Manolis Kellis http://mit6874.github.io Slides credit: 6.S191, Dana Erlich, Param Vir Singh, David Gifford, Alexander Amini, Ava Soleimany
  • 2.
    Today: Convolutional NeuralNetworks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 3.
    1a. What doyou see, and how? Can we teach machines to see?
  • 4.
  • 5.
    How do yousee? How can we help computers see?
  • 6.
    What computers ‘see’:Images as Numbers What the computer "sees" Levin Image Processing & Computer Vision An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image. Question: is this Lincoln?Washington? Jefferson? Obama? How can the computer answer this question? What you see Input Image Input Image + values Pixel intensity values (“pix-el”=picture-element) What you both see Can I just do classification on the 1,166400-long image vector directly? No. Instead: exploit image spatial structure. Learn patches. Build them up
  • 7.
    1b. Classical machinevision roots in study of human/animal brains
  • 8.
    Inspiration: human/animal visualcortex • Layers of neurons: pixels, edges, shapes, primitives, scenes • E.g. Layer 4 responds to bands w/ given slant, contrasting edges
  • 9.
    Primitives: Neurons &action potentials •Chemical accumulation across dendritic connections •Pre-synaptic axon  post-synaptic dendrite  neuronal cell body •Each neuron receives multiple signals from its many dendrites •When threshold crossed, it fires •Its axon then sends outgoing signal to downstream neurons •Weak stimuli ignored •Sufficiently strong cross activation threshold •Non-linearity within each neuronal level •Neurons connected into circuits (neural networks): emergent properties, learning, memory •Simple primitives arranged in simple, repetitive, and extremely large networks •86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
  • 10.
    Abstraction layers: edges,bars, dir., shapes, objects, scenes LGN: Small dots V1: Orientation, disparity, some color V4: Color, basic shapes, 2D/3D, curvature VTC: Complex features and objects(VTC: ventral temporal cortex •Abstraction layers  visual cortex layers •Complex concepts from simple parts, hierarchy •Primitives of visual concepts encoded in neuronal connection in early cortical layers
  • 11.
    • Massive recentexpanse of human brain has re-used a relatively simple but general learning architecture General “learning machine”, reused widely • Hearing, taste, smell, sight, touch all re- use similar learning architecture Motor Cortex Visual Cortex • Interchangeable circuitry • Auditory cortex learns to ‘see’ if sent visual signals • Injury area tasks shift to uninjured areas • Not fully-general learning, but well-adapted to our world • Humans co-opted this circuitry to many new applications • Modern tasks accessible to any homo sapiens (<70k years) • ML primitives not too different from animals: more to come? human chimp Hardware expansion
  • 12.
    Today: Convolutional NeuralNetworks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 13.
    2a. Spatial structure forimage recognition
  • 14.
    Using Spatial Structure Idea:connect patches of input to neurons in hidden layer. Neuron connected to region of input. Only “sees”these values. Input: 2D image. Array of pixel values
  • 15.
    Using Spatial Structure Connectpatch in input layer to a single neuron in subsequent layer. Use a sliding window to define connections. How can we weight the patch to detect particular features?
  • 16.
    Feature Extraction withConvolution - Filter of size 4x4 :16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This“patchy” operation isconvolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 17.
    Fully Connected NeuralNetwork Fully Connected: • Each neuron in hidden layer connected to all neurons in input layer • No spatial information • Many, many parameters Input: • 2D image • Vector of pixel values Key idea: Use spatial structure in input to inform architecture of the network
  • 18.
    High Level FeatureDetection Let’s identify key features in each image category Wheels,License Plate, Headlights Door,Windows,Steps Nose,Eyes,Mouth
  • 19.
  • 20.
  • 21.
    Convolution operation iselement wise multiply and add Filter / Kernel
  • 22.
    Producing Feature Maps OriginalSharpen Edge Detect “Strong” Edge Detect
  • 23.
    A simple pattern:Edges How can we detect edges with a kernel? Input -1 -1 Filter Output (Goodfellow 2016)
  • 24.
  • 25.
    X or X? Imageis represented as matrix of pixel values… and computers are literal! We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed. Rohrer How do CNNs work?
  • 26.
    There are threeapproaches to edge cases in convolution
  • 27.
    (Goodfellow 2016) Zero PaddingControls Output Size • Full convolution: zero pad input so output is produced whenever an output value contains at least one input value (expands output) • Valid-only convolution: output only when entire kernel contained in input (shrinks output) • Same convolution: zero pad input so output is same size as input dimensions x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME') • TF convolution operator takes stride and zero fill option as parameters • Stride is distance between kernel applications in each dimension • Padding can be SAME or VALID
  • 29.
    Today: Convolutional NeuralNetworks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 30.
    3a. Learning VisualFeatures de novo
  • 31.
    Key idea: learn hierarchyof features directly from the data (rather than hand-engineering them) Low level features Mid level features High level features Lee+ ICML 2009 Eyes,ears,nose Edges,dark spots Facial structure
  • 32.
    Key idea: re-useparameters Convolution shares parameters Example 3x3 convolution on a 5x5 image
  • 33.
    Feature Extraction withConvolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 34.
    LeNet-5 • Gradient BasedLearning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 35.
    LeNet-5 ⋮ ⋮ � 𝑦𝑦 32×32×1 28×28×614×14×6 10×10×16 5×5×16 120 84 5 × 5 s = 1 f = 2 s = 2 avg pool 5 × 5 s = 1 avg pool f = 2 s = 2 . . . . . . Reminder: Output size = (N+2P-F)/stride + 1 10 conv conv FC FC [LeCun et al., 1998] This slide is taken from Andrew Ng
  • 36.
    LeNet-5 • Only 60Kparameters • As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑ • General structure: conv->pool->conv->pool->FC->FC->output • Different filters look at different channels • Sigmoid and Tanh nonlinearity [LeCun et al., 1998]
  • 37.
    Backpropagation of convolution Slidetaken from Forward And Backpropagation in Convolutional Neural Network. - Medium
  • 38.
  • 39.
  • 40.
    Representation Learning inDeep CNNs Mid level features Low level features High level features Edges,dark spots Conv Layer 1 Lee+ ICML 2009 Eyes,ears,nose Conv Layer 2 Facial structure Conv Layer 3
  • 41.
    CNNs for Classification 1.Convolution:Apply filters to generate feature maps. 2. Non-linearity:Often ReLU. 3. Pooling:Downsampling operation on each feature map. Trainmodel with image data. Learn weights of filters in convolutional layers. tf.keras.layers.Conv2 D tf.keras.activations. * tf.keras.layers.MaxPool2 D
  • 42.
    Example – Sixconvolutional layers
  • 43.
    Convolutional Layers: LocalConnectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias tf.keras.layers. Conv2D
  • 44.
    Convolutional Layers: LocalConnectivity For a neuron in hidden layer: • Take inputs from patch • Compute weighted sum • Apply bias 4x4 filter: matrix of weights wij for neuron (p,q) in hidden layer 1) applying a window of weights 2) computing linear combinations 3) activating with non-linear function tf.keras.layers.Conv2D
  • 45.
    CNNs: Spatial Arrangementof Output Volume depth width height Layer Dimensions: ℎ  w d where h and w are spatial dimensions d (depth) = number of filters Stride: Filter step size Receptive Field: Locations in input image that a node is path connected to tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
  • 46.
    Introducing Non-Linearity Rectified LinearUnit (ReLU) - Apply after every convolution operation (i.e.,after convolutional layers) - ReLU:pixel-by-pixel operation that replaces all negative values by zero. - Non-linear operation tf.keras.layers.ReLU Karn Intuitive CNNs
  • 47.
    Pooling Max Pooling,average pooling 1)Reduced dimensionality 2) Spatial invariance tf.keras.layers.Max Pool2D( pool_size=(2,2), strides=2 )
  • 48.
    The REctified LinearUnit (RELU) is a common non-linear detector stage after convolution x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) x= tf.nn.relu(x) f(x) = max(0, x) When will we backpropagate through this? Once it “dies” what happens to it?
  • 49.
    Pooling reduces dimensionalityby giving up spatial location • max pooling reports the maximum output within a defined neighborhood • Padding can be SAME or VALID x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME') Output Input Pooling Batch H W Input channel Neighborhood [batch, height, width, channels]
  • 50.
  • 51.
    91 CNNs for Classification:Feature Learning 1. Learn features in input image through convolution 2. Introduce non-linearity through activation function (real-world data is non-linear!) 3. Reduce dimensionality and preserve spatial invariance with pooling
  • 52.
    CNNs for Classification:Class Probabilities - CONV and POOL layers output high-level features of input - Fully connected layer uses these features for classifying input image - Express output as probability of image belonging to a particular class
  • 53.
    Putting it alltogether import tensorflow as tf def generate_model(): model = tf.keras.Sequential([ # first convolutional layer tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # second convolutional layer tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # fully connected classifier tf.keras.layers.Flatten(), tf.keras.layers.Dense(1024, activation='relu’), tf.keras.layers.Dense(10, activation=‘softmax’) # 10 outputs ]) return model
  • 54.
    Today: Convolutional NeuralNetworks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 55.
    4a. Real-world featureinvariance is hard
  • 56.
    How can computersrecognize objects?
  • 57.
    How can computersrecognize objects? Challenge: • Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc. • How can we overcome this challenge? Answer: • Learn a ton of features (millions) from the bottom up • Learn the convolutional filters, rather than pre-computing them
  • 58.
  • 59.
  • 60.
    LeNet-5 • Gradient BasedLearning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 61.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 62.
    AlexNet • ImageNet Classificationwith Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012 • Facilitated by GPUs, highly optimized convolution implementation and large datasets (ImageNet) • One of the largest CNNs to date • Has 60 Million parameter compared to 60k parameter of LeNet-5 [Krizhevsky et al., 2012]
  • 63.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners • The annual “Olympics” of computer vision. • Teams from across the world compete to see who has the best computer vision model for tasks such as classification, localization, detection, and more. • 2012 marked the first year where a CNN was used to achieve a top 5 test error rate of 15.3%. • The next best entry achieved an error of 26.2%.
  • 64.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 65.
    AlexNet [Krizhevsky et al.,2012] Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8 • Input: 227x227x3 images (224x224 before padding) • First layer: 96 11x11 filters applied at stride 4 • Output volume size? (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96] • Number of parameters in this layer? (11*11*3)*96 = 35K Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 66.
  • 67.
    AlexNet [Krizhevsky et al.,2012] • Input: 227x227x3 images (224x224 before padding) • After CONV1: 55x55x96 • Second layer: 3x3 filters applied at stride 2 • Output volume size? (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96] • Number of parameters in this layer? 0! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
  • 68.
    AlexNet . . . 227×227×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . . This slide is taken from Andrew Ng
  • 69.
    AlexNet . . . 40964096 Softmax 1000 ⋮ ⋮ [Krizhevsky et al., 2012] FC FC This slide is taken from Andrew Ng
  • 70.
    AlexNet [Krizhevsky et al.,2012] Details/Retrospectives: • first use of ReLU • used Norm layers (not common anymore) • heavy data augmentation • dropout 0.5 • batch size 128 • 7 CNN ensemble Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 71.
    AlexNet [Krizhevsky et al.,2012] • Trained on GTX 580 GPU with only 3 GB of memory. • Network spread across 2 GPUs, half the neurons (feature maps) on each GPU. • CONV1, CONV2, CONV4, CONV5: Connections only with feature maps on same GPU. • CONV3, FC6, FC7, FC8: Connections with all feature maps in preceding layer, communication across GPUs. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 72.
    AlexNet AlexNet was thecoming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition. [Krizhevsky et al., 2012]
  • 73.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 74.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 75.
    VGGNet • Very DeepConvolutional Networks For Large Scale Image Recognition - Karen Simonyan and Andrew Zisserman; 2015 • The runner-up at the ILSVRC 2014 competition • Significantly deeper than AlexNet • 140 million parameters [Simonyan and Zisserman, 2014]
  • 76.
    VGGNet • Smaller filters Only3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2 • Deeper network AlexNet: 8 layers VGGNet: 16 - 19 layers • ZFNet: 11.7% top 5 error in ILSVRC’13 • VGGNet: 7.3% top 5 error in ILSVRC’14 Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014] Input 3x3 conv, 64 3x3 conv, 64 Pool 1/2 3x3 conv, 128 3x3 conv, 128 Pool 1/2 3x3 conv, 256 3x3 conv, 256 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 FC 4096 FC 4096 FC 1000 Softmax
  • 77.
    VGGNet [Simonyan and Zisserman,2014] • Why use smaller filters? (3x3 conv) Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer. • What is the effective receptive field of three 3x3 conv (stride 1) layers? 7x7 But deeper, more non-linearities And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 78.
    VGGNet [Simonyan and Zisserman,2014] VGG16: TOTAL memory: 24M * 4 bytes ~= 96MB / image TOTAL params: 138M parameters Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input 3x3 conv, 64 3x3 conv, 64 Pool 3x3 conv, 128 3x3 conv, 128 Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool FC 4096 FC 4096 FC 1000 Softmax
  • 79.
    [Simonyan and Zisserman,2014] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input memory: 224*224*3=150K params: 0 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Pool memory: 112*112*64=800K params: 0 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 Pool memory: 56*56*128=400K params: 0 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 Pool memory: 28*28*256=200K params: 0 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 Pool memory: 14*14*512=100K params: 0 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool memory: 7*7*512=25K params: 0 FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448 FC 4096 memory: 4096 params: 4096*4096 = 16,777,216 FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
  • 80.
    VGGNet [Simonyan and Zisserman,2014] Details/Retrospectives : • ILSVRC’14 2nd in classification, 1st in localization • Similar training procedure as AlexNet • No Local Response Normalisation (LRN) • Use VGG16 or VGG19 (VGG19 only slightly better, more memory) • Use ensembles for best results • FC7 features generalize well to other tasks • Trained on 4 Nvidia Titan Black GPUs for two to three weeks. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 81.
    VGGNet VGG Net reinforcedthe notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work. Keep it deep. Keep it simple. [Simonyan and Zisserman, 2014]
  • 82.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 83.
    GoogleNet • Going Deeperwith Convolutions - Christian Szegedy et al.; 2015 • ILSVRC 2014 competition winner • Also significantly deeper than AlexNet • x12 less parameters than AlexNet • Focused on computational efficiency [Szegedy et al., 2014]
  • 84.
    GoogleNet • 22 layers •Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure • No FC layers • Only 5 million parameters! • ILSVRC’14 classification winner (6.7% top 5 error) [Szegedy et al., 2014]
  • 85.
    GoogleNet “Inception module”: designa good local network topology (network within a network) and then stack these modules on top of each other Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] Filter concatenation Previous layer 1x1 convolution 3x3 convolution 5x5 convolution 1x1 convolution 1x1 convolution 1x1 convolution 3x3 max pooling
  • 86.
    GoogleNet Details/Retrospectives : • Deepernetworks, with computational efficiency • 22 layers • Efficient “Inception” module • No FC layers • 12x less params than AlexNet • ILSVRC’14 classification winner (6.7% top 5 error) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
  • 87.
    GoogleNet Introduced the ideathat CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. [Szegedy et al., 2014]
  • 88.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 89.
    ResNet • Deep ResidualLearning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015 • Extremely deep network – 152 layers • Deeper neural networks are more difficult to train. • Deep networks suffer from vanishing and exploding gradients. • Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. [He et al., 2015]
  • 90.
    ResNet • ILSVRC’15 classificationwinner (3.57% top 5 error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 91.
    ResNet • What happenswhen we continue stacking deeper layers on a convolutional neural network? • 56-layer model performs worse on both training and test error -> The deeper model performs worse (not caused by overfitting)! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 92.
    ResNet • Hypothesis: Theproblem is an optimization problem. Very deep networks are harder to optimize. • Solution: Use network layers to fit residual mapping instead of directly trying to fit a desired underlying mapping. • We will use skip connections allowing us to take the activation from one layer and feed it into another layer, much deeper into the network. • Use layers to fit residual F(x) = H(x) – x instead of H(x) directly Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 93.
    ResNet Residual Block Input xgoes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x. [He et al., 2015]
  • 94.
    ResNet Short cut/ skipconnection 𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2] 𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏]) 𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐]) 𝑎𝑎[𝑙𝑙+1] a[l] a[l+1] 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐 + 𝐚𝐚 𝐥𝐥 = 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] + 𝐚𝐚 𝐥𝐥 ) [He et al., 2015]
  • 95.
    ResNet Full ResNet architecture: •Stack residual blocks • Every residual block has two 3x3 conv layers • Periodically, double # of filters and downsample spatially using stride 2 (in each dimension) • Additional conv layer at the beginning • No FC layers at the end (only FC 1000 to output classes) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 96.
    ResNet • Total depthsof 34, 50, 101, or 152 layers for ImageNet • For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 97.
    ResNet Experimental Results: • Ableto train very deep networks without degrading • Deeper networks now achieve lower training errors as expected [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 98.
    ResNet The best CNNarchitecture that we currently have and is a great innovation for the idea of residual learning. Even better than human performance! [He et al., 2015]
  • 99.
    Accuracy comparison The bestCNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 100.
    Forward pass timeand power consumption The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 101.
    ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 102.
    Today: Convolutional NeuralNetworks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 103.
  • 104.
    An Architecture forMany Applications Detection Semantic segmentation End-to-end robotic control
  • 105.
    Semantic Segmentation: FullyConvolutional Networks FCN:Fully Convolutional Network. Network designed with all convolutional layers,with downsampling and upsampling operations tf.keras.layers.Conv2DTranspose Long+ CVPR 2015
  • 106.
    Facial Detection &Recognition
  • 107.
  • 108.
    Self-Driving Cars: Navigationfrom Visual Perception Raw Perception I (ex.camera) Coarse Maps M (ex.GPS) Possible Control Commands Amini+ ICRA 2019
  • 109.
    End-to-End Framework forAutonomous Navigation Entire model trained end-to-end without any human labelling or annotations Amini+ ICRA 2019
  • 110.
    Automatic Colorization ofBlack and White Images
  • 111.
    Optimizing Images Post ProcessingFeature Optimization (Illumination) Post Processing Feature Optimization (Color Curves and Details) Post Processing Feature Optimization (Color Tone: Warmness)
  • 112.
  • 113.
  • 114.
    Breast Cancer Screening 6 . Breastcancer case missed by radiologist but detected byAI AI MD Readers AI MD Readers CNN-based system outperformed expert radiologists at detecting breast cancer from mammograms
  • 115.
    Semantic Segmentation: BiomedicalImage Analysis BrainTumors Dong+ MIUA 2017. Malaria Infection Soleimany+ arXiv 2019. Dong+ MIUA 2017;Soleimany+ arXiv 2019 Origi nal Ground Truth Segmenta tion Uncertai nty
  • 116.
  • 117.
  • 118.
    Today: Convolutional NeuralNetworks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 119.
    Deep Learning forComputer Vision: Summary Foundations • Why computer vision? • Representing images • Convolutions for feature extraction CNNs • CNN architecture • Application to classification • ImageNet Applications • Segmentation,image captioning,control • Security,medicine, robotics