Return to Question

added 4 characters in body

edited Sep 15, 2023 at 1:19

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^k \pi(x)_v = 1$ and $\pi(x)_v\in [0,1]$ is the probability that the object with features $x$ belongs to the $v$th class.

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

for some weight vector $\lambda_v\in\mathbb{R}^k$. If $k = 2$ and $\lambda_2 = 0$, then this is just the sigmoid function in $\lambda_1\cdot x$. If our training data consists of feature vectors $x(1),\ldots,x(m)\in\mathbb{R}^n$ and classifications $y(1),\ldots,y(n)\in\{1,\ldots,k\}$, then logistic regression would have us choose the weights $\lambda_v$ to maximize the product $$\prod_{i=1}^m\pi(x(i))_{y(i)}$$ which is the probability that the model correctly classifies each item in the training data. Apparently, under mild hypotheses, the maximum value is attained at a unique $\lambda = (\lambda_v)_{v = 1,..,k}$$\lambda = (\lambda_v)_{v = 1,\ldots,k}$.

My basic question is: In what sense is this an optimal model? One answer, which I learned from these notes by John Mount, is that when restricted to a subclass of classifiers satisfying a certain "balance condition" (described below), this model is the one that maximizes entropy over the training set. If the training set consists of the feature vectors $x(1),\ldots,x(m)$, then this model maximizes the quantity $$-\sum_{i=1}^m\sum_{v=1}^k \pi(x(i))_v\log \pi(x)_v$$ I accept that maximizing entropy makes sense philosophically, but this shifts our attention to the balance condition, which would require: $$\sum_{i=1}^m \pi(x(i))_u x(i)_j = \sum_{i=1}^m \delta(u,y(i))x(i)_j\qquad\text{for all $u,j$}$$ where $\delta(a,b) = 1$ if $a = b$, and 0 otherwise.

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?

Source Link

asked Sep 15, 2023 at 0:17

stupid_question_bot

7.8k
1
13
53

Why is the logistic regression model good? (and its relation with maximizing entropy)

The standard logistic model for such a classifier uses the function $\pi(x)$ whose $v$th component is given by

$$\pi(x)_v = \frac{e^{\lambda_v\cdot x}}{\sum_{u=1}^k e^{\lambda_u\cdot x}}$$

Why is it reasonable to restrict our attention to functions which satisfy these balance conditions?