Deep Learning: Introduction & Chapter 5 Machine Learning Basics

Deep Learning 101, Dec. 9th, 2016 By Jason Tsai (蔡志順) & Frank Wu (吳孟倫)

*Note: Most materials from this presentation are taken from the book “Deep Learning” authored by Ian Goodfellow, Yoshua Bengio, & Aaron Courville. The other quoted sources are mentioned in the respective slides. Activation Function

Action Potential (動作電位) Figure from https://en.wikipedia.org/wiki/Action_potential

Softmax Sigmoid tanh ReLU Common Used Activation Functions

Rounding Error Underflow: It occurs when numbers near zero are rounded to zero. Overflow: It occurs when numbers with large magnitude are approximated as ∞ or -∞ .

Softmax Function To avoid rounding error:

Cross-Entropy Loss Cross-entropy:

Logistic Sigmoid (Sigmoid Function) Figure from “A Tutorial on Deep Learning, Part 1: Nonlinear Classifiers and The Backpropagation Algorithm, by Quoc V. Le”

Hyperbolic Tangent (tanh) Figure from http://mathworld.wolfram.com/HyperbolicTangent.html

Stochastic Gradient Descent (SGD)

Objective / Cost / Loss Function To minimize or maximize, J: Objective function / Criterion Specific to minimize, L: Cost Function / Loss Function / Error Function x* = arg min f(x)

Epoch / Batch / Iteration *One Epoch: One forward pass and one backward pass of all the training examples *Batch Size: The number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need. Minibatch: Take a small number of examples at a time, ranging from 1 to a few hundred, during one iteration. *Number of Iterations: Number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass. *Quote from http://stackoverflow.com/questions/4752626/epoch-vs-iteration- when-training-neural-networks/31842945#31842945

Machine Learning Algorithms Algorithms that are able to learn from data From Machine Learning (1997), page 2, by Tom M. Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P, improves with experience E.”

Tasks in Machine Learning (Categories of Applications) Classification Classifications with missing inputs Regression Transcription Machine Translation Structure Output Anomaly detection Synthesis and sampling Imputation of missing values Denoising Density Estimation or Probability Mass Function Estimation … Refer to chapter 12 for details of several practical applications

The Difficulties of Performance Measure In some cases, it is difficult to decide what should be measured, e.g. transcription. In some other cases, we know what quantity we would ideally like to measure, but measuring it is impractical, e.g. density estimation.

Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target. Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Reinforcement Learning algorithms interact with an environment and learn what to do by trial and error to maximize a numerical reward but without told prior which actions to take. Experience: Supervised vs. Unsupervised vs. Reinforcement

One example is a collection of features. 1 example (data point)  features 1 image  the values of the pixels in that image A dataset is collection of many examples. Experience: Dataset / Examples / Features

Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. (學不像) Overfitting occurs when the gap between the training error and test error is too large. (學的太像了不可能是真的)

Test Error a.k.a Generalization Error

Choose Relevant Learning Algorithms The No Free Lunch theorem for Machine Learning: Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. As the training size increases to infinity, the training error of any fixed-capacity model must rise to at least the Bayes error. As the training set size increases, the optimal capacity increases. Choose the optimal capacity after reaching sufficient complexity to solve the task. Regularization: Refer to chapter 7

Define KFoldXV(D, A, L, k): Require: D, the given dataset, with elements z(i) Require: A, the learning algorithm, seen as a function that takes a dataset as input and outputs a learned function Require: L, the loss function, seen as a function from a learned function f and an example z(i) ∈ D to a scalar ∈ R Require: k, the number of folds Split D into k mutually exclusive subsets Di, whose union is D. for i from 1 to k do fi = A(DDi) for z(j) in Di do ej = L(fi, z(j)) end for end for Return e K-fold Cross-Validation Algorithm

*Estimation of the relation between input and target variables. *We want to estimate a parameter *The true value of the parameter : 𝜽 (fixed and unknown) *The estimated value of parameter : 𝜽 (a function of the data, random variable) *Let { 𝑥 1 , … , 𝑥 𝑚 } be a set of 𝑚 independent and identically distributed (i.i.d.) data points *A point estimator or statistic is any function of the data: 𝜃 𝑚 = 𝑔(𝑥(1) , … , 𝑥(𝑚) ) Point Estimation

*Predict a variable 𝑦 given an input vector 𝑥. *Assume that 𝑦 = 𝑓 𝑥 + 𝜖 * 𝜖 stands for the part of 𝑦 that is not predictable from 𝑥. *Approximating 𝑓 with a model or estimate 𝑓 *Estimator 𝑓 is a point estimator in function space. *Estimating a parameter 𝑤 or estimating a function 𝑓 mapping from 𝑥 to 𝑦 *Linear regression *Polynomial regression Function Estimation

*Error = Bias + Variance *Example: *Use a gun to shot at 10 targets, 7 targets hit. *Error = 10 - 7 = 3 *Only targeted at 9 targets, assuming the gun is the model… *Bias = 10 - 9 = 1 (The difference between the model and real target) *Variance = 9 – 7 = 2 (The model is not stable enough for targeting) Error

*Variance is denoted as: 𝑉𝑎𝑟( 𝜃) *Standard error is denoted as: 𝑆𝐸( 𝜃) (Square root of the variance) *Standard error of the mean 𝑆𝐸 𝜇 𝑚 = 𝑉𝑎𝑟 1 𝑚 𝑖=1 𝑚 𝑥(𝑖) = 𝜎 𝑚 *The bias of an estimator is defined as: 𝑏𝑖𝑎𝑠 𝜃 𝑚 = 𝐸 𝜃 𝑚 − 𝜃 Variance and Standard Error

*MSE = 𝐸 𝜃 𝑚 − 𝜃 2 = 𝐵𝑖𝑎𝑠 𝜃 𝑚 2 + 𝑉𝑎𝑟( 𝜃 𝑚) Mean Squared Error 𝐵𝑖𝑎𝑠 𝑓(𝑥) = 𝐸 𝑓 𝑥 − 𝑓 𝑥 𝑉𝑎𝑟 𝑓 𝑥 = 𝐸 𝑓 𝑥 2 − 𝐸 𝑓 𝑥 2

*A set of examples 𝑋 = 𝑥 1 , … , 𝑥 𝑚 generating 𝑝 𝑑𝑎𝑡𝑎 𝑥 *Drawn independently from True but unknown data. * 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥; 𝜃) maps any configuration of 𝑥 to a real number estimating the true probability 𝑝 𝑑𝑎𝑡𝑎 𝑥 . *The maximum likelihood estimator for 𝜃 is defined as 𝜃 𝑀𝐿 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝜽 𝒑 𝒎𝒐𝒅𝒆𝒍(𝑿; 𝜽) = argmax 𝜃 𝑖=1 𝑚 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖 ; 𝜃) *= argmax 𝜃 𝑖=1 𝑚 log 𝑝 𝑚𝑜𝑑𝑒𝑙(𝑥 𝑖 ; 𝜃) Maximum Likelihood Estimation

*Example: *5 bags with infinite cookies with two flavors, the proportion of flavor are: *Cherry 100% *Cherry 75% + Lemon 25% *Cherry 50% + Lemon 50% *Cherry 25% + Lemon 75% *Lemon 100% *Which bag is most likely that we take two cookies with both lemon flavor? *0% * 0% = 0% *25% * 25% = 6.25% *50% * 50% = 25% *75% * 75% = 56.25% *100% * 100% = 100%  this bag Maximum Likelihood Estimation

* 𝜃 𝑀𝐴𝑃 = argmax 𝜃 𝑝(𝜃|𝑥) = argmax 𝜃 𝑝 𝑥 𝜃 𝑝(𝜃) *Example: *5 bags with infinite cookies with two flavors, the proportion of flavor are: *10% choose  Cherry 100% *20% choose  Cherry 75% + Lemon 25% *40% choose  Cherry 50% + Lemon 50% *20% choose  Cherry 25% + Lemon 75% *10% choose  Lemon 100% *Which bag is most likely that we take two cookies with both lemon flavor? *0% * 0% * 10% = 0% *25% * 25% * 20% = 1.25% *50% * 50% * 40% = 10% *75% * 75% * 20% = 13.05%  this bag *100% * 100% * 10% = 10% Maximum-a-Posteriori (MAP)

* 𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚 = 𝑝 𝑥 1 , … , 𝑥 𝑚 𝜃 ∙ 𝑝 𝜃 𝑝 𝑥 1 , … , 𝑥 𝑚 *Example: Female + age 31~40 + worker + medium income  buys car? Age Gender Occupation Income Buy car < 21 Male Student Low No 21 ~ 30 Female Student Low Yes 21 ~ 30 Male Worker Low Yes > 41 Male Worker High Yes 31 ~ 40 Male No Job Low No 31 ~ 40 Male Worker Medium No 31 ~ 40 Female No Job Low Yes 31 ~ 40 Female Worker Medium Yes 31 ~ 40 Female Worker High No < 21 Female Student Low Yes 𝑝 Yes = 6 10 𝑝 Female|Yes = 4 6 𝑝 Worker|Yes = 3 6 𝑝 Medium|Yes = 1 6 𝑝 31~40|Yes = 2 6 𝑝 Female = 5 10 𝑝 31~40 = 5 10 𝑝 Worker = 5 10 𝑝 Medium = 2 10 𝑝 Yes|𝑥 1 , … , 𝑥 𝑚 = ( 4 6 × 2 6 × 3 6 × 1 6 )× 6 10 5 10 × 5 10 × 5 10 × 2 10

* argmax 𝜃 𝑝 𝜃 𝑥 1 , … , 𝑥 𝑚 *Example: Female + age 31~40 + worker + medium income  buys car?Age Gender Occupation Income Buy car < 21 Male Student Low No 21 ~ 30 Female Student Low Yes 21 ~ 30 Male Worker Low Yes > 41 Male Worker High Yes 31 ~ 40 Male No Job Low No 31 ~ 40 Male Worker Medium No 31 ~ 40 Female No Job Low Yes 31 ~ 40 Female Worker Medium Yes 31 ~ 40 Female Worker High No < 21 Female Student Low Yes 𝑝 Yes = 6 10 𝑝 Female|Yes = 4 6 𝑝 Worker|Yes = 3 6 𝑝 Medium|Yes = 1 6 𝑝 31~40|Yes = 2 6 𝑝 Yes 𝑥 1 , … , 𝑥 𝑚 = 4 6 × 2 6 × 3 6 × 1 6 × 6 10 = 0.0111 𝑝 Female|No = 1 4 𝑝 Worker|No = 2 4 𝑝 Medium|No = 1 4 𝑝 31~40|No = 3 4 𝑝 No = 4 10 𝑝 𝐍𝐨|𝑥 1 , … , 𝑥 𝑚 = 1 4 × 3 4 × 2 4 × 1 4 × 4 10 = 𝟎. 𝟎𝟏𝟖𝟕𝟓

Use maximum likelihood estimation to find the best parameter vector 𝜃 for a parametric family of distributions 𝑝(𝑦|𝑥; 𝜃) Linear regression corresponds to the family 𝑝 𝑦 𝑥; 𝜃 = 𝑁(𝑦; 𝜃 𝑇 𝑥, 𝐼) Logistic regression 𝑝 𝑦 = 1 𝑥; 𝜃 = 𝜎(𝜃 𝑇 𝑥)

*Find the optimal hyperplane for classification *Driven by a linear function 𝑤 𝑇 𝑥 + 𝑏 * 𝑤 𝑇 𝑥 + 𝑏 = 𝑏 + 𝑖=1 𝑚 𝛼𝑖 ∙ 𝑥 𝑇 𝑥 𝑖 * 𝑥 𝑖 is a training sample. * 𝛼 is a vector of coefficients. *Predicted as positive class: 𝑤 𝑇 𝑥 + 𝑏 is positive *Predicted as negative class: 𝑤 𝑇 𝑥 + 𝑏 is negative

*Replace 𝑥 by the output of a given feature function 𝜙(𝑥) *Replacing dot products with kernel evaluations 𝑓 𝑥 = 𝑖 𝛼𝑖 𝒌(𝒙, 𝒙 𝒊 ) + 𝑏 𝒌 𝒙, 𝒙 𝒊 = 𝜙 𝑥 𝑇 𝜙 𝑥 𝑖 *Commonly used kernel *Gaussian kernel 𝑘 𝑢, 𝑣 = 𝑁(𝑢 − 𝑣; 0, 𝜎2 𝐼) * 𝑁(𝑥; 𝜇, Σ) is the standard normal density. * Its value decrease along lines in 𝑣 space radiating outward from 𝑢. * Also known as radial basis function (RBF)

𝑥1 𝑥2 class 2 1 Yes 3 2 Yes 4 1 Yes 1 2 No 2 3 No 1 4 No 𝑥1 𝑥2 ≤ 2 ≤ 1 Yes No > 1 > 2 Yes Trained model using decision tree: Ground truth Axis-aligned result

Unsupervised Learning Algorithms

Simpler Representation Lower dimensional representations Sparse representations (Refer to chapter 7.10) Independent representations

K-means Clustering The k-means algorithm works by initializing k different centroids {µ(1), . . . , µ(k)} to different values, then alternating between two different steps until convergence. A. In one step, each training example is assigned to cluster i, where i is the index of the nearest centroid µ(i). B. In the other step, each centroid µ(i) is updated to the mean of all training examples x(j) assigned to cluster i.

Principle Components Analysis (PCA) Apply lossy compression to a collection of m points {x(1), . . . , x(m)} in n. For each point x(i) ∈ n we will find a corresponding code vector c(i) ∈ l. Encoding function: f(x) = c Reconstructing (decoding) function: g(f(x)) = g(c) = Dc, where D ∈ Rn×l All columns of D are orthogonal to each other.

PCA cont’d Find where Var[z] is diagonal given

PCA cont’d When we project the data x to z, via the linear transformation W, the resulting representation has a diagonal covariance matrix (as given by Σ2) which immediately implies that the individual elements of z are mutually uncorrelated.

Challenges motivate Deep Learning

The curse of Dimensionality (維度的詛咒) Local Constancy and Smoothness Regularization Manifold Learning

Curse of Dimensionality The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases. The number of possible configurations is much larger than the available number of training examples as the number of dimensions increases.

Smoothness Prior / Local Constancy Prior The smoothness prior or local constancy prior states that the function we learn should not change very much within a small region. Many simpler algorithms rely exclusively on this prior to generalize well, and as a result they fail to scale to the statistical challenges involved in solving AI level tasks. The core idea in deep learning is that we assume that the data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy.

Manifold Learning Manifold (流形): A set of points associated with a neighborhood around each point From any given point, the manifold locally appears to be a Euclidean space (歐基里德空間). Manifold hypothesis: Assuming most of Rn consists of invalid inputs and interesting inputs occur only along a collection of manifolds containing a small subset of points.  The probability distribution over images, text strings, and sounds that occur in real life is highly concentrated.  Examples are connected in neighborhoods and may be reachable by transformations.

Data sampled in 2D space but actually concentrated like a 1-D twisted string.

Examples of a face training dataset (QMUL dataset) that subject moved in two angles of rotation. The goal is to get learning algorithms to discover and disentangle such manifold coordinates.

Visualizations of learned data manifold for generative models with two-dimensional latent space. Learned by a variational autoencoder (VAE). Figure from “Auto-encoding Variational Bayes (2014), by D. Kingma & M. Welling”

Deep Learning: Introduction & Chapter 5 Machine Learning Basics

Deep Learning: Introduction & Chapter 5 Machine Learning Basics

More Related Content

What's hot

Viewers also liked

Similar to Deep Learning: Introduction & Chapter 5 Machine Learning Basics

More from Jason Tsai

Recently uploaded

Deep Learning: Introduction & Chapter 5 Machine Learning Basics