6
$\begingroup$

In Machine Learning, given some features $X$, we aim to predict a target variable $Y$. If we consider the squared loss function, the optimal predictor is the Bayes predictor: $\mathbb{E}[Y \mid X].$ If we know the joint distribution $P(X, Y)$, then we can directly compute $\mathbb{E}[Y \mid X]$.

Thus, under squared error loss, Machine Learning can be viewed as approximating $P(X, Y)$ as accurately as possible from a given set of observations $(X_i, Y_i)$.

Now, in the hypothetical case where the observed samples are i.i.d., we can apply results from doi.org/10.1016/j.spl.2021.109088 ("On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality"). This paper suggests that, given enough samples, the empirical probability distribution converges rapidly to the true probability distribution, even in high dimensions.

If this is the case doesn't the DKW inequality imply that, in the hypothetical i.i.d. setting, Machine Learning is trivially solved? That is, since the empirical distribution quickly approximates the true one, we should be able to compute the Bayes predictor directly, which is optimal in this scenario.

Given this, why is Machine Learning so focused on developing new models (e.g., parametric models, decision trees, XGBoost, neural networks) to estimate $\mathbb{E}[Y \mid X]$? If the above argument holds, can't we simply approximate the true distribution empirically?

I suspect the reason this reasoning does not directly apply in practice is that real-world data is dynamic over time:

  • The distribution $P(X, Y)$ may change.
  • The samples are not truly i.i.d.
  • The process $Y$ may depend on its own past values (e.g., stock returns)

However, these issues are not necessarily solved by inventing new models such as neural networks. These models are simply function classes designed to approximate $\mathbb{E}[Y \mid X]$ more effectively.

Thus, my main question is: Am I correct in my analysis? And if that's the case, why is Machine Learning focused on developing new models rather than solving these fundamental issues?

$\endgroup$
2
  • 4
    $\begingroup$ Did you try and figure out how many samples precisely are "enough" using this inequality for some simple machine learning problem, e.g. image classification? $\endgroup$ Commented Feb 19 at 13:18
  • $\begingroup$ Dup on Cross Validated stats.stackexchange.com/questions/661444/… $\endgroup$ Commented Feb 20 at 16:31

3 Answers 3

25
$\begingroup$

There are many other results in the literature of this sort. For instance, there are various "no free lunch" theorems which say e.g. that no machine learning model outperforms multinomial logistic regression on a generic classification task. It's hard to explain the proliferation of architectures and training methods from pure theory - you have to look at the empirical side.

It starts to make more sense when you consider the absurdity of the probability distributions in question. Consider the image recognition task: the input features are pixels in images and the outputs are labels like "cat" or "sunset" (let's assume we're not trying to find bounding boxes...). In the Open Image dataset the input images have around 300000 pixels and there are 20000 output labels, so the joint distribution lives in a space of dimension ~$10^{11}$. And that is infinitesimally small in comparison to the spaces relevant for large language models with decent context lengths, or for multimodal text / image models.

Even if you could represent such a distribution in a computer's memory in a way that would allow efficient computations (you can't), the dataset has only 9 million images in it. This is a horrendously underspecified inference problem - if you simply fit a generic joint distribution to the data then the posterior distributions will be essentially useless for making predictions. In order to address this you need to drastically reduce the space by making some assumptions. There's a lot of ways to do this, but there are two types that have taken off:

  • Constraints on conditional distributions: E.g. you might assume that nearby pixels are more strongly correlated than distant pixels. This is the premise of convolution layers.
  • Information-theoretic compression: Assume that the distributions can be projected onto a much lower dimensional submanifold without loss in accuracy. This is the point of encoding / decoding layers.

This is to say nothing of the highly nontrivial engineering challenges. Even if you come up with a computational approach to statistical inference which is fundamentally different from running gradient descent over some parameter space, the reality is that literally billions of dollars have been spent optimizing the software and hardware implementations of gradient descent, so your approach has a high practical bar to clear. Talk to somebody with a lot of practical experience training machine learning models, and they'll tell you how much time they spent begging / coercing / tricking gradient descent into converging more quickly.

$\endgroup$
8
$\begingroup$

The assumption of independent and identically distributed (i.i.d.) training data fails in typical machine learning settings. Generalization Error Bounds for Learning under Censored Feedback studies how the DKW bounds should be modified to account for non-i.i.d. data.

$\endgroup$
1
$\begingroup$

The question is full of assumptions about AI/ML. Stochastic models for ML are hardly justifiable, even when we have nice numeric data. Validation is a huge issue that is glossed over. Where are the proofs?

In general, uncertainty takes many other forms and there is a massive gap between strict theory, and computational oversimplifications.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.