Skip to main content

Questions tagged [machine-learning]

5 votes
1 answer
146 views

Bounding the VC-dimension of a class whose coordinate-evaluations have bounded VC dimension

Fix a positive integer $d$, let $\mathcal{F}$ be a set of functions from $[0,\infty)^d$ to $\mathbb{R}$, and for each $i\in[d]:=\{1,\dots,d\}$, let $\mathcal{F}_i$ be a set of functions from $[0,\...
AB_IM's user avatar
  • 4,942
2 votes
0 answers
106 views

Optimization problem for threshold functions

Let $S_m=\{-1,1\}^m$ be the hypercube of signs. Define the set of "admissible weights" $W_m$ as the subset of $\{w\in\mathbb{R}_+^m : \|w\|^2=m\}$ with a "support property" of the ...
Veit Elser's user avatar
  • 1,173
1 vote
1 answer
120 views

Universal approximation theorem for nonnegative functions [closed]

The classical universal approximation theorem says that functions of the form $f(x)=\sum_{i=1}^N\alpha_i\sigma(w_i^Tx_i+b_i)$ can approximate any continuous function from a compact subset of $\mathbb{...
Mango's user avatar
  • 11
2 votes
0 answers
89 views

Designing a loss function in the form of a linear combination

I have functions $f : \Bbb{R}^n \to \Bbb{R}_{\geq 0}$ and $g : \Bbb{R}^n \to \Bbb{R}_{\leq 0}$ and would like to find a $\mathbf{x}_* \in \Bbb{R}^n$ where $g(\mathbf{x}_*)=0$ and $f \left( {\bf x}_* \...
KhashF's user avatar
  • 3,659
6 votes
3 answers
2k views

Does the DKW Inequality Make Machine Learning Trivial?

In Machine Learning, given some features $X$, we aim to predict a target variable $Y$. If we consider the squared loss function, the optimal predictor is the Bayes predictor: $\mathbb{E}[Y \mid X].$ ...
Thinking's user avatar
  • 177
1 vote
0 answers
148 views

Confusion about learning theory terminology

Let $\mathcal{f}:\mathcal{X}\to \mathcal{Y}$ be a measurable (target) functions from a measurable space $\mathcal{X}$ to a TVS $\mathcal{Y}$, and let $\ell:\mathcal{Y}^2\to [0,\infty)$ be a Borel ...
Sam The Sampler's user avatar
1 vote
0 answers
76 views

Algorithms for stochastic nonconvex optimization with stochastic constraints: is there something out there?

For convex problems with stochastic constraints and objective there seem to be a number of online algorithms that can solve them. In Lan's "First-order and Stochastic Optimization Methods for ...
Jim's user avatar
  • 340
3 votes
0 answers
82 views

Why minimization of squared distance matrices difference corresponds to minimization of Gram matrix difference?

I was reading the article "Some Properties of Classical Multi-Dimesional Scaling" by Mardia (1978). He mentions that the classical Multi Dimensional Scaling corresponds to the minimization ...
guest's user avatar
  • 131
2 votes
0 answers
65 views

Surjectivity of MLP functions

Suppose that $\mathcal{F}(H,L)$ is a function class of all functions implemented by a multilayer perceptron with ReLU activation, width $H$ and depth $L$. Also, there is a set of all possible value ...
Iris's user avatar
  • 73
1 vote
0 answers
86 views

How computationally efficient are kernel tricks? [closed]

"If we compare to non-kernel polynomial regression it is O(Tnp) where is p is dimension of polynomial while kernel polynomial is O(n^2d) + O(T*n^2) where d is original number of attributes, ...
Saransh Gupta's user avatar
1 vote
0 answers
103 views

Conditions on LR in Gradient Descent

In Introductory Lectures in Convex Optimization by Yurii Nesterov, Section 1.2.3 shows that gradient descent is guaranteed to converge if the step size is chosen either with a fixed step size or ...
Kyle's user avatar
  • 51
1 vote
0 answers
138 views

PageRank in directed graphs: equivalence of iterative and eigenvalue methods

Given a directed graph $ G $ with $ n $ nodes, we can represent this graph using an adjacency matrix $ A $. The stochastic matrix $ S $ can be derived from the adjacency matrix using the following ...
ABB's user avatar
  • 4,140
3 votes
0 answers
206 views

Convergence of gradient descent to critical point

Does there exist a generalization of this theorem by Yurii Nesterov in Introductory Lectures on Convex Optimization (2004) which relaxes the convexity assumption and shows that gradient descent ...
Kyle's user avatar
  • 51
11 votes
2 answers
1k views

Information theoretical interpretation of Free Energy

When exploring the concept of free energy from an information-theoretic perspective, I often come across statements like: "Free energy measures the degree of surprise an agent experiences when ...
user3072048's user avatar
2 votes
0 answers
68 views

Cluster minimizing sum of cost of clusters

Given a dataset $X,$ having $p$ features, organize the units $x_i \in X $ into fixed number of clusters $g,$ with fixed cluster size $B.$ Clustering policy: minimize the sum of a linear combination of ...
BiasedBayes's user avatar
1 vote
0 answers
45 views

Symplectic aggregation over times steps

I am trying to achieve the following: given a sequence of phase space points $\left\{z_j\right\}=\left\{\left(q_j, p_j\right)\right\}$ for $j=1, \ldots, T$. Goal: Project this sequence to a single ...
Ian's user avatar
  • 29
2 votes
0 answers
189 views

How good is approximation of distance function on the Cayley graph by "Fourier" basis coming from the irreducible representations?

Consider finite group $G$ , symmetric set of its elements $S$, construct a Cayley graph. Consider $d(g)$ - word metric or distance on the Cayley graph from identity to $g$. As any function on a group ...
Alexander Chervov's user avatar
1 vote
0 answers
78 views

From a constraint satisfaction problem (CSP) to a sudoku grid [closed]

one of the existing methods of solvin a sudoku grid is via constraints satisfaction (CSP), but can we do the inverse ie convert a CSP problem into a sudoku grid and then solve it ?
youssef Lmourabite's user avatar
2 votes
0 answers
190 views

How to choose N policemen positions to catch a drunk driver in the most effective way (on a Cayley graph of a finite group)?

Consider a Cayley graph of some big finite group. Consider random walk on such a graph - think of it as drunk driver. Fix some number $N$ which is much smaller than group size. Question 1: How to ...
Alexander Chervov's user avatar
2 votes
0 answers
221 views

Should a neural network architecture change after a pass in gradient descent?

I'm trying to understand neural networks formally a little better and it was always my understanding that the miracle that happens during backpropagation is that "performing a pass in gradient ...
Louis's user avatar
  • 81
1 vote
0 answers
85 views

Interpolation in convex hull

I'm reading a paper, Learning in High Dimension Always Amounts to Extrapolation, that provides a result I don't understand. It provides this theorem which I do understand: Theorem 1: (Bárány and ...
Christopher D'Arcy's user avatar
1 vote
1 answer
165 views

Bayes classifiers with cost of misclassification

A minimum ECM classifier disciminate the features $\underline{x}$ to belong to class $t$ ($\delta(\underline{x}) = t$) if $\forall j \ne t$: $$\sum_{k\ne t} c(t|k) f_k(\underline{x})p_k \le \sum_{k\ne ...
BiasedBayes's user avatar
2 votes
1 answer
227 views

How to lower bound the absolute value of the difference of two Kullback-Leibler divergences given the constrains below?

Given that $\min (D[p_1||p_3],D[p_2||p_4])=a$, how to find a lower bound of the difference $\left \vert D[p_1\parallel p_2]-D[p_3\parallel p_4] \right\vert$ with respect to $a$? If the condition is ...
Richard Ben's user avatar
3 votes
0 answers
176 views

Short path problem on Cayley graphs as language translation task (from "Permutlandski" to "Cayleylandski"(s) :). Reference/suggestion request

Context: Algorithms to find short paths on Cayley graphs of (finite) groups are of some interest - see below. There can be several approaches to that task. One of ideas coming to my mind - in some ...
Alexander Chervov's user avatar
4 votes
1 answer
599 views

Difference between deep neural networks and expectation maximization algorithm

Having had a short encounter with deep neural networks, it seems to boil down to the task of determining the values of a vast amount of parameters. The expectation maximization algorithm, of which I ...
Manfred Weis's user avatar
  • 13.9k
7 votes
0 answers
243 views

What are compact manifolds such that GROWTH (of spheres volumes) is well approximated by the Gaussian normal distribution?

Consider some compact Riemannian manifold $M$. Fix some point $p$. Consider a "sub-sphere of radius $r$" - i.e. set of points on distance $r$ from $p$. Consider growth function $g(r)$ to be ...
Alexander Chervov's user avatar
5 votes
0 answers
183 views

Does the permutohedron satisfy any minimal distortion property for graph metric vs Euclidean distance?

We can look on the permutohedron as a kind of "embedding" of the Cayley graph of $S_n$ to the Euclidean space. (That Cayley graph is constructed by the standard generators, i.e. ...
Alexander Chervov's user avatar
7 votes
1 answer
324 views

What Cayley graphs arise as nodes+edges from "nice" polytopes and when are these polytopes convex?

The Permutohedron is a remarkable convex polytope in $R^n$, such that its nodes are indexed by permutations and edges correspond to the Cayley graph of $S_n$ with respect to the standard generators, i....
Alexander Chervov's user avatar
2 votes
0 answers
55 views

Continuity of Kernel Mean Embeddings

Given some kernel $k: X \times X \to \mathbb{R}$ with RKHS $H_k$ we say that $k$ is characteristic on the space of signed Radon measures over $X$, denoted by $\mathcal{M}(X)$, if the kernel mean ...
Gaspar's user avatar
  • 181
1 vote
0 answers
59 views

A network to transform/predict one probability distribution to another

I have a random variable of a particular density (e.g., normal), and a known probability distribution (e.g., mixture Gaussian). I used a simple KL measure to predict/transform one another. Now I need ...
user524691's user avatar
14 votes
2 answers
3k views

Soft question: Deep learning and higher categories

Recently, I have stumbled upon certain articles and lecture videos that use category theory to explain certain aspects of machine learning or deep learning (e.g. Cats for AI and the paper An enriched ...
h3fr43nd's user avatar
  • 367
67 votes
2 answers
12k views

What mathematical problems can be attacked using DeepMind's recent mathematical breakthroughs?

I am a research mathematician at a university in the United States. My training is in pure mathematics (geometry). However, for the past couple of months, I have been supervising some computer science ...
Ryan Hendricks's user avatar
1 vote
0 answers
197 views

Locally "unshortable" paths in graphs

Setup: Consider a connected graph G, with diameter "d". Informally: Trivially (by definition of diameter), taking any path $P$ any nodes $P(i) , P(i+k)$ for $k>d$ can be connected by a ...
Alexander Chervov's user avatar
29 votes
6 answers
3k views

Open problems which might benefit from computational experiments

Question: I wonder what are the open problems , where computational experiments might be helpful? (Setting some bounds, excluding some cases, shaping some expectations ). Grant program: The context of ...
3 votes
1 answer
143 views

When does the optimal model exist in learning theory?

In the context of learning theory, we usually have: data $(x,y)\sim P(x,y)$, with $x\in\mathcal{X}\subseteq\mathbb{R}^d$ and $y\in\mathcal{Y}\subseteq\mathbb{R}^k$, a hypothesis class $\mathcal{F}\...
rick's user avatar
  • 199
9 votes
1 answer
956 views

Sufficient condition for linear separability of a boolean function on $n$ variables

This is a cross-post of two recent questions at math.stackexchange without answers: Q1 and Q2. A boolean function on an $n$-dimensional hypercube is linearly separable when the convex hulls of the ...
Fabius Wiesner's user avatar
2 votes
1 answer
190 views

expectation of the product of Gaussian kernels and their input

I was wondering if anybody knows how to solve: $$\mathbb{E}{\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}\left[ (\mathbf{x}{i} - \mathbf{z})(\mathbf{x}{j} - \mathbf{z})^\top \exp\left( - (\...
patchouli's user avatar
  • 285
7 votes
2 answers
731 views

Upper bound on VC-dimension of partitioned class

Fix $n,k\in \mathbb{N}_+$. Let $\mathcal{H}$ be a set of functions from $\mathbb{R}^n$ to $\mathbb{R}$ with finite VC-dimension $d\in \mathbb{N}$. Let $\mathcal{H}_k$ denote the set of maps of the ...
Mathematical-Semi_N00b's user avatar
4 votes
0 answers
196 views

Known relations between mutual information and covering number?

This is a question about statistical learning theory. Consider a hypothesis class $\mathcal{F}$, parameterized by real vectors $w \in \mathbb{R}^p$. Suppose I have a data distribution $D \sim \mu$ and ...
Tanishq Kumar's user avatar
2 votes
1 answer
114 views

Non-linear transforms of RKHS question

I was reading the paper Norm Inequalities in Nonlinear Transforms (referenced in this question) but ran into difficulties, so I was wondering if anyone could help? I think I follow the paper until I ...
Mat's user avatar
  • 41
62 votes
10 answers
13k views

A clear map of mathematical approaches to Artificial Intelligence

I have recently become interested in Machine Learning and AI as a student of theoretical physics and mathematics, and have gone through some of the recommended resources dealing with statistical ...
1 vote
0 answers
127 views

Approximation of continuous function by multilayer Relu neural network

For continuous/holder function $f$ defined on a compact set K, a fix $L$ and $m_1,m_2,\dots,m_L$, can we find a multilayer Relu fully connected network g with depth $L$ and each $i$-th layer has width ...
Hao Yu's user avatar
  • 819
1 vote
2 answers
285 views

Beating the $1/\sqrt n$ rate of uniform-convergence over a linear function class

Let $P$ be a probability distribution on $\mathbb R^d \times \mathbb R$, and let $(x_1,y_1), \ldots, (x_n,y_n)$ be an iid sample of size $n$ from $P$. Fix $\epsilon,t\gt 0$. For any unit-vector $w \in ...
dohmatob's user avatar
  • 7,033
1 vote
0 answers
179 views

Matrix valued word embeddings for natural language processing

In natural language processing, an area of machine learning, one would like to represent words as objects that can easily be understood and manipulated using machine learning. A word embedding is a ...
Joseph Van Name's user avatar
3 votes
1 answer
203 views

Why is the logistic regression model good? (and its relation with maximizing entropy)

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^...
stupid_question_bot's user avatar
10 votes
1 answer
400 views

Who introduced the term hyperparameter?

I am trying to find the earliest use of the term hyperparameter. Currently, it is used in machine learning but it must have had earlier uses in statistics or optimization theory. Even the multivolume ...
ACR's user avatar
  • 873
3 votes
0 answers
165 views

Equivalence of score function expressions in SDE-based generative modeling

I am studying the paper "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) by Yang et al. The authors use the following loss function (Equation 7 ...
Po-Hung Yeh's user avatar
9 votes
1 answer
675 views

Geometric formulation of the subject of machine learning

Question: what is the geometric interpretation of the subject of machine learning and/or deep learning? Being "forced" to have a closer look at the subject, I have the impression that it ...
Manfred Weis's user avatar
  • 13.9k
1 vote
0 answers
139 views

Problems Correction of "Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "' [closed]

Where I can find the problems correction of this book " Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "
zdo0x0's user avatar
  • 11
3 votes
0 answers
130 views

Prove the convergence of the LASSO model in the presence of limited eigenvalues

I am researching the properties of the Lasso model $\hat \beta:= \operatorname{argmin} \{\|Y-X\beta\|_2^2/n+\lambda\|\beta\|_1\}$, specifically its convergence when the data satisfies restricted ...
GGbond's user avatar
  • 39

1
2 3 4 5