Presentation by UFMFEV-30-M
Dr. Wenhao Zhang
Core AI
Senior Lecturer in
Machine Vision
Week 8: Machine Learning
22/03/2021
What is machine learning?
Algorithms and statistical models that can perform a
task without using explicit instructions and that
can improve their performance using training data
The history of machine learning is not in the scope of this lecture
The traditional five stages of computer vision
light
Scene Image Pre-
Segmentation
constraints acquisition processing
Interaction
Classification Feature
/interpretation extraction
What is machine learning
Images of bananas Images of oranges
human experts design rules
elongated? spherical?
no Model
yes yes
others Query image
bananas oranges
Orange!
Heuristics based Learning based
Why machine learning
• Machine learning is often considered when it is very
challenging for human experts to derive explicit
instructions.
Why machine learning
• Examples:
o Face recognition (note that this is different to face detection)
̶ What features to use?
o Email Spam and Malware Filtering
̶ A large list of rules
o Disease diagnosis
̶ Lung cancer, for example (widened
mediastinum? Reduced vascularity? …)
Types of machine learning
• Supervised learning Whether or not trained with
• Unsupervised learning human labelled data
• Semi-supervised learning ()
• Reinforcement learning (keywords: agent, policy, reward, penalty…)
…
Supervised vs unsupervised learning
• Training data and labels are provided – supervised learning
• Only training data (but not labels) are provided – unsupervised learning
Feature 1
Feature 1 (e.g. colour)
Feature 2
Feature 2 (e.g. shape)
Unsupervised learning Supervised learning Class 1
Class 2
Unknown
Supervised learning
• K Nearest Neighbours (KNN)
• Artificial Neural Network (Multilayer Perceptron)
Specific unsupervised learning techniques are not introduced in this lecture
KNN
o Non-parametric method used for classification and regression
o Only the classification scenario is discussed in this lecture
o A test sample is assigned to the class most common among its K
nearest neighbours, where K is a positive integer value.
Feature 1
Class 1
Class 2
Feature 2 Unknown
Limitations of KNN
• Large dataset and/or high dimensionality limit efficiency
• “Curse of dimensionality”
• Highly dependent on data quality/features
o Outliers
o Different scales
MLP
• A type of feedforward Artificial Neural Network (ANN)
• “Vanilla" neural networks
• Consisting of an input and an output layer with one or more hidden layers
• For simplicity, there is only one hidden layer in this example
• MLP is fully connected
Hidden layer
𝑊
𝑗 ,𝑘
Output layer
flattens Input
𝑊
𝑖, 𝑗
MLP
• An example – forward path
𝑥1
4
𝑎1
𝑥2 8
4 8 = 0.5
= 1.0 24 = 0.2
0 1 𝑥3 0 = 7.0
a2 = 2.0 ???
𝑥 4 = 4.0
0 5 1 = 0.5
a3 Output layer
𝑥5 0
𝑥6 5
24*0.2 + a *2.0 + a *0.5
MLP
• What happens next?
o Update weights until the difference between the output and the label
(i.e. the ground truth) is zero (or small enough), or does not decrease.
o A loss function is used to measure this difference. Training a model is
the process of minimising this loss function (Gradient descent,
Backpropagation)
Images
INPUT
Labels
MLP
• What happens next?
o When the weights (hyper-parameters) are fixed, the model is trained
o For each query image, use these hyper-parameters to perform forward
path calculation to determine the output (i.e. prediction)
Adding e.g.
nonlinear function
nonlinearity
• Linear models may struggle to represent complex problems
X)
• Adding nonlinearities
activation function
𝑊
𝑗 ,𝑘 Hidden layer
Output layer
f
Input
𝑊
𝑖, 𝑗
Question
• What does it mean when the weights represented by red lines have zero values?
𝑊
𝑘, 𝑗 Hidden layer
Output layer
a1 =0
Input
a2
=0
=0
a3
x5 𝑊
𝑗 ,𝑖
does not contribute to the results
Convolutional neural networks (CNN)
• Convolution 0 1 3 1 1 1
2 200 5 ⊗ 1 1 1 26
7 10 4 1 1 1
Local image data kernel Modified image data
• e.g. the Sobel operator
Convolutional neural networks (CNN)
• Convolution 0 1 3 1 1 1
2 200 5 ⊗ 1 1 1 26
7 10 4 1 1 1
local image data kernel features
• Learning of weights
0 1 3 ? ? ?
2 200 5
⊗ ? ? ? …
7 10 4 ? ? ?
local image data kernel
ReLu
Layers in a CNN
1. Convolutional layers
2. Nonlinear layers/Activation layers
CONV
4
Nonlinear activation
𝑎 × 𝑎
e.g.
a ≠ b?
Refer to the convolution lecture, hint: use of padding
a = b?
Layers in a CNN
CONV
Nonlinear activation
…
another set of filters (6)
6
𝑎 × 𝑎 4
Note that the size of each filter is 4 (n < b)
Feature extraction in CNN
…
1. Convolutional layers
2. Nonlinear layers/Activation layers
Layers in a CNN
3. Pooling layers: progressively reduce the spatial size of the representation to
reduce the amount of parameters and computation in the network
Commonly used after convolutional layers
1 3 2 7
6 2 6 5 6 7
1 5 7 2 8 9
4 8 5 9
e.g. max pooling with a 2×2 window and a stride of 2
Layers in a CNN
3. Pooling layers
• smaller number of parameters
• does not affect depth
Pooling
𝑎 × 𝑎 4 6 6
𝑣𝑒𝑐𝑡𝑜𝑟
𝑙𝑒𝑛𝑔𝑡h: 6 ∗𝑑 2
Layers in a CNN A high parameter count
4. Fully connected layers
𝑎 × 𝑎 4 6 6
Feature extraction and
classification in a CNN
𝑎 × 𝑎 4 6 6
FEATURE EXTRACTION CLASSIFICATION
Feature extraction + classification
Previously with a MLP
Features:
• Raw pixel values
• More commonly seen:
handcrafted features
weights updating…
Now with a CNN
Features:
• Automatically
extracted by filters
• End-to-end
e.g. the VGG architecture
Applications
Deep learning
Eye centre localisation
• Eye tracking
• Human-computer interaction
• Psychology studies and medical applications
Directed Advertising
Eye centre localisation
Gaze gesture recognition
Attention monitoring
Gender group classification
Age group classification
3D face reconstruction
Shown in the video a (2D+3D) face being automatically classified as a YOUNG MALE. Real-time eye tracking allows the user to issue gaze gestures to
interact with the system. Personalised advertisements are being displayed, which are also being manipulated by gaze gestures
Eye morphology
• Gradient based voting
Zhang, W., Smith, M.L., Smith, L.N. and Farooq, A., 2016. Gender and gaze gesture recognition for
human-computer interaction. Computer Vision and Image Understanding, 149, pp.32-50.
Eye saccade analysis for dementia diagnosis
Input: Output:
(Image data)
(eye coordinates) (predicted coordinates)
𝑐
𝑥𝑙
𝑐 𝑦𝑙
𝑐 𝑥𝑟
…
𝑐 𝑦𝑟
…
Zhang, W. and Smith, M., 2019, July. Eye Centre Localisation with Convolutional Neural Network Based Regression.
In 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC) (pp. 88-94). IEEE.
Visualising features/filters
• Not all pixels make an equal contribution
• A different filter learns something different
Handcrafted features vs.
automatically extracted features
handcrafted features: automatically extracted features:
Can have physical meanings (e.g. face Data driven and learning based, therefore tailed to
landmarks) a specific problem
Can be explicitly modelled Do not require human experts to extract a given
Often do not require a large training set set of carefully chosen characteristics
Intuitive to visualise and analyse features Generate multiple levels of representation (e.g.
Dimensionality are often lower compared high-level and low-level) at the same time
to automatically extracted features Often generated by a single model rather than
following a multi-step process