Skip to main content
added 4 characters in body
Source Link
Michael Hardy
  • 1
  • 12
  • 89
  • 131

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^k \pi(x)_v = 1$ and $\pi(x)_v\in [0,1]$ is the probability that the object with features $x$ belongs to the $v$th class.

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

for some weight vector $\lambda_v\in\mathbb{R}^k$. If $k = 2$ and $\lambda_2 = 0$, then this is just the sigmoid function in $\lambda_1\cdot x$. If our training data consists of feature vectors $x(1),\ldots,x(m)\in\mathbb{R}^n$ and classifications $y(1),\ldots,y(n)\in\{1,\ldots,k\}$, then logistic regression would have us choose the weights $\lambda_v$ to maximize the product $$\prod_{i=1}^m\pi(x(i))_{y(i)}$$ which is the probability that the model correctly classifies each item in the training data. Apparently, under mild hypotheses, the maximum value is attained at a unique $\lambda = (\lambda_v)_{v = 1,..,k}$$\lambda = (\lambda_v)_{v = 1,\ldots,k}$.

My basic question is: In what sense is this an optimal model? One answer, which I learned from these notes by John Mount, is that when restricted to a subclass of classifiers satisfying a certain "balance condition" (described below), this model is the one that maximizes entropy over the training set. If the training set consists of the feature vectors $x(1),\ldots,x(m)$, then this model maximizes the quantity $$-\sum_{i=1}^m\sum_{v=1}^k \pi(x(i))_v\log \pi(x)_v$$ I accept that maximizing entropy makes sense philosophically, but this shifts our attention to the balance condition, which would require: $$\sum_{i=1}^m \pi(x(i))_u x(i)_j = \sum_{i=1}^m \delta(u,y(i))x(i)_j\qquad\text{for all $u,j$}$$ where $\delta(a,b) = 1$ if $a = b$, and 0 otherwise.

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^k \pi(x)_v = 1$ and $\pi(x)_v\in [0,1]$ is the probability that the object with features $x$ belongs to the $v$th class.

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

for some weight vector $\lambda_v\in\mathbb{R}^k$. If $k = 2$ and $\lambda_2 = 0$, then this is just the sigmoid function in $\lambda_1\cdot x$. If our training data consists of feature vectors $x(1),\ldots,x(m)\in\mathbb{R}^n$ and classifications $y(1),\ldots,y(n)\in\{1,\ldots,k\}$, then logistic regression would have us choose the weights $\lambda_v$ to maximize the product $$\prod_{i=1}^m\pi(x(i))_{y(i)}$$ which is the probability that the model correctly classifies each item in the training data. Apparently, under mild hypotheses, the maximum value is attained at a unique $\lambda = (\lambda_v)_{v = 1,..,k}$.

My basic question is: In what sense is this an optimal model? One answer, which I learned from these notes by John Mount, is that when restricted to a subclass of classifiers satisfying a certain "balance condition" (described below), this model is the one that maximizes entropy over the training set. If the training set consists of the feature vectors $x(1),\ldots,x(m)$, then this model maximizes the quantity $$-\sum_{i=1}^m\sum_{v=1}^k \pi(x(i))_v\log \pi(x)_v$$ I accept that maximizing entropy makes sense philosophically, but this shifts our attention to the balance condition, which would require: $$\sum_{i=1}^m \pi(x(i))_u x(i)_j = \sum_{i=1}^m \delta(u,y(i))x(i)_j\qquad\text{for all $u,j$}$$ where $\delta(a,b) = 1$ if $a = b$, and 0 otherwise.

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^k \pi(x)_v = 1$ and $\pi(x)_v\in [0,1]$ is the probability that the object with features $x$ belongs to the $v$th class.

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

for some weight vector $\lambda_v\in\mathbb{R}^k$. If $k = 2$ and $\lambda_2 = 0$, then this is just the sigmoid function in $\lambda_1\cdot x$. If our training data consists of feature vectors $x(1),\ldots,x(m)\in\mathbb{R}^n$ and classifications $y(1),\ldots,y(n)\in\{1,\ldots,k\}$, then logistic regression would have us choose the weights $\lambda_v$ to maximize the product $$\prod_{i=1}^m\pi(x(i))_{y(i)}$$ which is the probability that the model correctly classifies each item in the training data. Apparently, under mild hypotheses, the maximum value is attained at a unique $\lambda = (\lambda_v)_{v = 1,\ldots,k}$.

My basic question is: In what sense is this an optimal model? One answer, which I learned from these notes by John Mount, is that when restricted to a subclass of classifiers satisfying a certain "balance condition" (described below), this model is the one that maximizes entropy over the training set. If the training set consists of the feature vectors $x(1),\ldots,x(m)$, then this model maximizes the quantity $$-\sum_{i=1}^m\sum_{v=1}^k \pi(x(i))_v\log \pi(x)_v$$ I accept that maximizing entropy makes sense philosophically, but this shifts our attention to the balance condition, which would require: $$\sum_{i=1}^m \pi(x(i))_u x(i)_j = \sum_{i=1}^m \delta(u,y(i))x(i)_j\qquad\text{for all $u,j$}$$ where $\delta(a,b) = 1$ if $a = b$, and 0 otherwise.

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?

Source Link

Why is the logistic regression model good? (and its relation with maximizing entropy)

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^k \pi(x)_v = 1$ and $\pi(x)_v\in [0,1]$ is the probability that the object with features $x$ belongs to the $v$th class.

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

for some weight vector $\lambda_v\in\mathbb{R}^k$. If $k = 2$ and $\lambda_2 = 0$, then this is just the sigmoid function in $\lambda_1\cdot x$. If our training data consists of feature vectors $x(1),\ldots,x(m)\in\mathbb{R}^n$ and classifications $y(1),\ldots,y(n)\in\{1,\ldots,k\}$, then logistic regression would have us choose the weights $\lambda_v$ to maximize the product $$\prod_{i=1}^m\pi(x(i))_{y(i)}$$ which is the probability that the model correctly classifies each item in the training data. Apparently, under mild hypotheses, the maximum value is attained at a unique $\lambda = (\lambda_v)_{v = 1,..,k}$.

My basic question is: In what sense is this an optimal model? One answer, which I learned from these notes by John Mount, is that when restricted to a subclass of classifiers satisfying a certain "balance condition" (described below), this model is the one that maximizes entropy over the training set. If the training set consists of the feature vectors $x(1),\ldots,x(m)$, then this model maximizes the quantity $$-\sum_{i=1}^m\sum_{v=1}^k \pi(x(i))_v\log \pi(x)_v$$ I accept that maximizing entropy makes sense philosophically, but this shifts our attention to the balance condition, which would require: $$\sum_{i=1}^m \pi(x(i))_u x(i)_j = \sum_{i=1}^m \delta(u,y(i))x(i)_j\qquad\text{for all $u,j$}$$ where $\delta(a,b) = 1$ if $a = b$, and 0 otherwise.

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?