5
$\begingroup$

Motivation

The goal of this work is to develop a unified geometric framework for finite probability distributions and finite random variables, utilizing differential geometry and information geometry. By adopting a Riemannian perspective centered around the Fisher information metric, fundamental concepts such as expectation, variance, and entropy naturally emerge as geometric quantities. The motivations include:

  • Statistical invariance: Classical statistical measures, like variance and entropy, are invariant under additive translations of random variables.
  • Intrinsic geometry: The Fisher information metric, which arises naturally from the Kullback-Leibler divergence, provides a canonical geometric structure on the probability simplex.

The main questions addressed concern the originality of this geometric duality, potential relationships to established theories, and possibilities for generalization.


Part 1: Geometry of the Probability Simplex and its Tangent Space

We consider the open probability simplex: $$ \mathring\Delta_n = \{ p \in \mathbb{R}^n \mid p_i > 0, \sum_{i=1}^n p_i = 1 \},$$

endowed naturally with the Fisher information metric: $$ g_p(x,x) = \frac{1}{2}\sum_{i=1}^n \frac{x_i^2}{p_i}, \quad x \in T_p \Delta_n^\circ = \{x \in \mathbb{R}^n \mid \sum x_i = 0\}. $$

Random variables appear naturally as elements of the tangent space, identified modulo additive constants to account for statistical invariance. Thus, we have the canonical isomorphism: $$ H^n = \mathbb{R}^n / \mathbb{R} \mathbf{1}_n \cong T_p \mathring\Delta_n. $$

From this perspective, important statistical concepts are naturally geometric:

  • Expectation as a tangent vector: The expectation with respect to a distribution $p$ corresponds uniquely to a tangent vector $e_p$ in $T_p \Delta_n^\circ$, explicitly given by: $$ e_{p,i} = 2p_i \left(p_i - \sum_j p_j^2\right). $$ This vector vanishes if and only if $p$ is uniform, reflecting equilibrium conditions in statistical mechanics.

  • Orthogonal decomposition: Every tangent vector $x \in T_p \Delta_n^\circ$ can be decomposed into a component parallel to $e_p$ (thermodynamically observable) and an orthogonal component representing statistical fluctuations or noise.


Part 2: Spectral Duality between Distributions and Variables

A duality is established between probability distributions (points in $\mathring\Delta_n$) and random variables (elements in $H^n$) through natural spectral maps:

  • Spectral embedding of distributions into variables: Define the centered logarithmic embedding $\tilde{I}$ by: $$ \tilde{I}(p) = -\ln(p) + \frac{1}{n}\sum_{j=1}^n \ln(p_j), $$ which quantifies the informational deviation from uniformity.

  • Inverse embedding of variables into distributions: Define the softmax embedding $S$ by: $$ S(x) = \text{softmax}(-x). $$

These embeddings are mutually inverse maps: $$ \tilde{I} \circ S = \text{Id}_{H^n}, \quad S \circ \tilde{I} = \text{Id}_{\mathring\Delta_n}, $$ creating an explicit duality.

Consequently, the Fisher information metric induces a corresponding geometry on $H^n$, where tangent vectors at a point $x \in H^n$ naturally represent distributions. This duality is symmetric:

  • Tangent vectors at distributions are identified with variables.
  • Tangent vectors at variables are identified with distributions.

Algebraic Structure Induced from Linear Structure

The vector space structure of $H^n$ induces a corresponding algebraic structure on the simplex $\mathring\Delta_n$. Specifically, translating the linear addition from $H^n$ via the spectral maps yields a convolution-like operation: $$ (p \star q)_i = \frac{p_i q_i}{\sum_j p_j q_j}, $$ which inherits a natural identity element (the uniform distribution) and an inversion operation: $$ p^{-1}_i = \frac{1/p_i}{\sum_j 1/p_j}. $$ This algebraic structure is naturally induced by the linearity in $H^n$ and highlights classical distributions such as Bernoulli, Boltzmann, and Binomial as particular cases of inversion.


Dual Metric and its Geometric Implications

By considering random variables as elements of $H^n$ with tangent spaces identified as distributions, a dual metric $\tilde{g}$ emerges naturally: $$ \tilde{g}_x(p,q) = g_{S(x)}(\tilde{I}(p), \tilde{I}(q)). $$

This dual construction introduces an additional geometric structure on $H^n$ that mirrors the structure on $\mathring\Delta_n$. An open question is whether such dual metrics and their geometric properties have been previously studied.


Open Questions and Further Connections:

  • Originality: Has the specific geometric duality between $\mathring\Delta_n$ and $H^n$, and particularly the induced metrics, been previously explored?
  • Algebraic structures: Is the algebraic convolution structure $(\mathring\Delta_n, \star)$ recognized or previously analyzed? My initial intuition is that the softmax function acts as a kind of analogue to a Fourier series transformation.
  • Physical interpretations: Could the tangent vector (e_p) be explicitly connected to linear-response theory or other established physical frameworks?
  • Generalization: While I have already investigated aspects such as variance, covariance, and related geometric ideas, a broader question remains: if this framework is known, how might one naturally derive higher-order statistical structures from it?
  • Applications The geometric duality and the metrics introduced here could, I believe, enable the development of efficient algorithms in machine learning, particularly for dimensionality reduction, robust distribution estimation, and improved variational inference techniques. Numerical experiments could assess the practical effectiveness of these new geometric representations in statistical analysis and both supervised and unsupervised learning settings. Are there any existing approaches in these domains that resemble what I have described?

Feedback, references, and suggestions on the novelty, relevance, and potential impact of this approach are highly welcomed. Could numerical and experimental tests help assess the practical and computational relevance of this geometric approach?


Note: This post is a significantly revised and extended version of an earlier question I had posed, now structured more clearly and with refined mathematical statements.

$\endgroup$
3
  • $\begingroup$ No idea for the open questions, but I'd add some more (1) say something about the Poisson laws in this setting, via some limits, that would be nice, and (2) say something about the distributions of complex matrices (i.e. sums of Diracs at the eigenvalues), or even better, of random matrices over finite probability spaces. $\endgroup$ Commented Apr 21 at 22:57
  • 1
    $\begingroup$ @TeoBanica Thanks! I didn't mention this, but the operadic structure on composition induces a decomposition of relative entropy and of the Fisher metric: $$g_{w\circ\left(p^{1},\cdots,p^{n}\right)}=g_{w}+\sum_{i=1}^{n}w_{i}g_{p^{i}} $$ This splits the geometry into a marginal and conditional part. In $\Delta_3$ it's even orthogonal. With $w = u_n$ and $p^i = (\lambda/n, 1 - \lambda/n)$, perhaps a Poisson law emerges in the limit. Still speculative — any related refs welcome! $\endgroup$ Commented Apr 21 at 23:46
  • 1
    $\begingroup$ @TeoBanica Or perhaps more simply: start with a truncated version of the Poisson distribution and investigate its limiting behavior. $\endgroup$ Commented Apr 21 at 23:57

1 Answer 1

2
$\begingroup$

You've made quite a few of observations in your question, so while the broad strokes of what you mention are previously known, it definitely may be the case that some of what you have found is novel. However, let me mention the things which are known.

The convolutional structure on the probability simplex was studied by Soumik Pal and Leonard Wong [1] (see page 5). In addition, I would recommend also reading their work [2] which goes into a great deal of depth about geometric structures on the probability simplex.

The dualistic geometry induced by the softmax function is also previously known. In the information geometry literature, the coordinates induced by taking the softmax of the probabilities are referred to as "natural parameters" because they are the natural parameters of the multinomial distribution when written as an exponential family. There is a large amount of work studying the Fisher metric for exponential families because if you use the natural parameters as coordinates, the Fisher metric is given by the Hessian of the log-partition function.

In addition, the duality between the natural parameters and the probability simplex plays an important role as it turns out that if one writes the multinomial family as an exponential family, the weights $p_i$ are the expected values for the sufficient statistics associated to the natural parameters. As such, these are often known as "expectation coordinates" or sometimes "mixture coordinates". For this set of coordinates, the Fisher metric will also be given by the Hessian of a function, this time with respect to the Legendre transform of the log-partition function. In IG, this is called a "dually-flat space" because they often specify the natural parameters and expectation parameters in terms of the flat connections that they induce. These two connections satisfy a type of conjugacy relationship with respect to the Fisher metric, so are dual to each other. These appear in the literature under various names such as e- and m- connections (short for exponential and mixture) or $\alpha = \pm 1$ connections.

The details of this should be contained in Amari's book on information geometry (or any other standard reference). However, the literature on IG can be somewhat hard to follow, so hopefully some of the keywords that I have mentioned make it easier to find the relevant material.

[1] Pal, Soumik; Wong, Ting-Kam Leonard, Multiplicative Schrödinger problem and the Dirichlet transport, Probab. Theory Relat. Fields 178, No. 1-2, 613-654 (2020). ZBL1466.49042.

[2] Pal, Soumik; Wong, Ting-Kam Leonard, Exponentially concave functions and a new information geometry, Ann. Probab. 46, No. 2, 1070-1113 (2018). ZBL1390.60064.

$\endgroup$
2
  • $\begingroup$ Thanks! I’ll try to use these keywords to navigate the literature. That said, IG texts are often opaque to me, as I lack formal training in statistics. What I described stems from a simple problem to define the amount of information associated with an event of probability : $I:(0,1]→R$ such that (1) $p<q⇒I(p)>I(q)$, and (2) $I(pq)=I(p)+I(q)$. A quick argument shows this forces $I(p)=−\log⁡(p)$ (up to constant), and from this point, I let elementary Riemannian geometry and linear algebra take over. $\endgroup$ Commented Apr 22 at 14:03
  • 1
    $\begingroup$ I completely agree that the IG literature is difficult to follow, and I’m not sure a background in statistics actually helps that much. There isn’t a reference I know of that really works out a few examples in detail but Shima’s Geometry of Hessian Structures might have some useful material if you can track down a copy. $\endgroup$ Commented Apr 22 at 15:14

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.