Part I: Smooth Manifolds with the Fisher-Rao Metric
Goal (edited: 01-Jan-23)
This blog post focuses on the Fisher-Rao metric, which gives rise to the Fisher information matrix (FIM). We will introduce the following useful concepts to ensure non-singular FIMs:
- Regularity conditions and intrinsic parameterizations of a distribution
- Dimensionality of a smooth manifold
The discussion here is informal and focuses on more on intuitions, rather than rigor.
Click to see how to cite this blog post
Motivation
The goal of this blog is to introduce geometric structures associated with probability distribution. Why should we care about such geometric structures? By exploiting the structures, we can
- design efficient and simple algorithms [1]
- design robust methods that are less sensitive to re-parametrization [2]
- understand the behavior of models/algorithms using tools from differential geometry, information geometry, and invariant theory [3]
These benefits are relevant for the majority of machine learning methods, all of which make use of probability distributions of various kinds.
Below, we give some common examples from the literature. A reader familiar with such examples can skip this part.
Empirical Risk Minimization (frequentist estimation):
Given N input-output pairs
, the least-square loss can be viewed as a finite-sample approximation of the expectation w.r.t. a probability distribution (data generating distribution), where is assumed to be the data-generating distribution. Here, denotes a normal distribution over with mean and variance . Well-known algorithms such as Fisher scoring and (empirical) natural-gradient descent [4] are commonly used methods that exploit the geometric structure of
. These are examples of algorithms derived from a frequentist perspective, which can also be generalized to neural networks [4].
Variational Inference (Bayesian estimation):
Given a prior
and a likelihood over a latent vector and known data , we can approximate the exact posterior by optimizing a variational objective with respect to an approximated distribution : where is the Kullback–Leibler divergence. The natural-gradient variational inference [5] is an algorithm that speeds up the inference by exploiting the geometry of
induced by the Fisher-Rao metric. This approach is derived from a Bayesian perspective, and can also be generalized to neural networks [6] and Bayesian neural networks [7].
Evolution Strategies and Policy-Gradient Methods (Global optimization):
Global optimization methods often use a search distribution, denoted by
, to find the global maximum of an objective by solving a problem of the following form: Samples from the search distribution are evaluated through a “fitness” function , and guide the optimization towards better optima. The natural evolution strategies [8] is an algorithm that speeds up the search process by exploiting the geometry of
. In the context of reinforcement learning, is known as the policy distribution to generate actions and the natural evolution strategies is known as the natural policy gradient method [9].
In all of the examples above, the objective function is expressed in terms of an expectation w.r.t. a distribution in red, parameterized with the parameter
Example | meaning of |
distribution |
---|---|---|
Empirical Risk Minimization | observation |
|
Variational Inference | latent variable |
|
Evolution Strategies | decision variable |
Note:
In general, we may have to compute or estimate the inverse of the FIM. However, in many useful machine learning applications, algorithms such as [2] [4] [5] [7] [8] [9] [10] can be efficiently implemented without explicitly computing the inverse of the FIM.
We discuss this in other posts. See
Part V
and
our ICML work.
In the rest of the post, we will mainly focus on the geometric structure of (finite-dimensional) parametric families, for example, a univariate Gaussian family.
The following figure illustrates four distributions in the Gaussian family denoted by
Intrinsic Parameterizations
We start by discussing a special type of parameterizations, we call intrinsic parameterizations, which are useful to obtain non-singular FIMs.
An arbitrary parameterization may not always be appropriate for a smooth manifold [12]. Rather, the parameterization should be such that the manifold is locally like a flat vector space, for example, how the curved Earth surface looks flat to us, locally.
We will refer to such flat vector space as a local vector-space structure (denote it by
Local vector-space structure:
It supports local vector additions, local real scalar products, and their algebraic laws (i.e., the distributive law). (see Part II for the details.)
Intrinsic parameterizations1 are those that satisfy the following two conditions:
- We require that the parameter space of
, denoted by , be an open set in , where is the number of entries of a parameter array. Intuitively, this ensures a local vector-space structure throughout the parameter space, which then ensures that a small, local perturbation at each point stays within . - We also require that
uniquely and smoothly represents points in a manifold. The condition ensures arbitrary (smooth) parameter transformations should still represent the same sub-set. In other words, we require that- there exists a bi-jective map among such two parameterizations if these parameterizations represent a common sub-set of points in the manifold.
- this map and its inverse map are both smooth.
In differential geometry, this requirement is known as a diffeomorphism, which is a formal but more abstract definition of this requirement.
Intrinsic parameterizations satisfy the above two conditions, and lead to non-singular FIMs, as we will see soon.
We will now discuss a simple case of a manifold, a unit circle in
Parameterization 1 (an intrinsic parameterization):
A (local) parametrization at
highlighted in red for the circle is , where . We use one (scalar) parameter in this parametrization. The manifold is (locally) “flat” since we can always find a small 1-dimensional perturbation in the 1-dimensional parameter space . Therefore, this is an intrinsic parameterization.
We can similarly define a (local) parametrization at each point of the circle. In fact, we can use finite (local) parameterizations to represent the whole circle as shown below.
Now, we discuss invalid cases, where not all conditions are satisfied.
Parameterization 2 (a non-intrinsic parameterization due to non-smoothness):
Let’s define a map
such that , where we use to denote the circle. A (global) parametrization of the circle is
, where we use one (scalar) parameter. This map
is bijective and smooth. However, the parameter space is not open in , and its inverse map is not continunous at point . Therefore, this parametrization is not intrinsic. In fact, there does not exist a (single) global and intrinsic parametrization to represent the whole circle.
Smoothness of the inverse map is essential when it comes to reparametrization (A.K.A. parameter transformation). The smoothness, along with the inverse map, gives us a way to generate new intrinsic parameterizations. Essentially, in such a case, the Jacobian matrix (to change between the parameterizations) is non-singular everywhere, and we can use the chain rule and inverse function theorem to jump between different intrinsic parameterizations. We will discuss this in Part III.
Parametrization 3 (a non-intrinsic parameterization due to non-openness):
The circle does not look like a flat space under the following parametrization
. The number of entries in this parameter array is 2. The reason is that we cannot find a small 2-dimensional perturbation
in the 2-dimensional parameter space due to the constraint . In other words, is not open in .
Parametrization 4 (a non-intrinsic parameterization due to non-uniqueness):
Let’s consider the following non-intrinsic parametrization
of the circle: , where . The parameter space is open in . This parametrization is not intrinsic since it does not uniquely represent a point in the circle. It is obvious to see that
and both represent the same point in the circle when scalar .
Intrinsic Parameterizations for Parametric families
The examples in the previous section clearly show the importance of parameterization, and that it should be chosen carefully. Now, we discuss how to choose such a parameterization for a given parametric family.
Roughly speaking, a parameterization
Regularity Condition:
For any
In other words,
Note that, due to the definition of the partial derivatives, this regularity condition implicitly assumes that the parameter space
The following examples illustrate the regularity condition.
Example 1 (regularity condition for an intrinsic parameterization):
We will write the regularity condition at a point for an intrinsic parameterization. Consider a 1-dimensional Gaussian family
with mean , variance , and parametrization . The partial derivatives are the following, It is easy to see that these partial derivatives are smooth w.r.t. in its parameter space . Consider the partial derivatives at a point
,
For this point, the regularity condition will be . For this to hold for all , it is necessary that , which implies linear independence. A formal proof can be built to show that this holds for any
and .
Example 2 (regularity condition for a non-intrinsic parameterization):
By using a counterexample, we will show that the regularity condition fails for a non-intrinsic parameterization. Consider a Bernoulli family
with parameter , where function is the indicator function. The partial derivatives are
Note that when and , we have . Therefore, the partial derivatives are linearly dependent.
In a similar fashion, we will also see (soon) that the regularity condition is also not satisfied for the following parameterization:
On the other hand, the condition holds for the following parameterization:
Fisher-Rao Metric
Given an intrinsic parameterization, the Fisher-Rao metric is defined as follows,
We can also express the metric in a matrix form as
In what follows, we will assume the metric to be well-defined, which makes the Fisher-Rao metric a valid Riemannian metric [13] since the corresponding FIM is positive definite everywhere in its intrinsic parameter space.
Caveats of the Fisher matrix computation
There are some caveats when it comes to the Fisher matrix computation. In particular, the regularity condition should be satisfied. It is possible to define the FIM under a non-intrinsic parameterization. However, the FIM often is singular or ill-defined under a non-intrinsic parameterization as shown below.
Bernoulli Examples
Example 1 (Ill-defined FIM):
Consider Bernoulli family
with non-intrinsic parameter . The following computation is not correct. Do you make similar mistakes like this? Let
, where . The derivative is Thus, by Eq. , the FIM under this parameterization is
This computation is not correct. Do you know why?
Reason: (Click to expand)
Example 2 (Singular FIM):
Consider Bernoulli family
with non-intrinsic parameter . Note that a Bernoulli distribution in the family is not uniquely represented by this parametrization. It is obvious to see that
and both represent the same Bernoulli distribution. The FIM under this parameterization is singular as shown below.
Let
, where . The derivative is
Thus, the FIM under this parameterization is
where this FIM is singular since the matrix determinant is zero as shown below.
Now, we give an example to show that the FIM of a Bernoulli family can be non-singular when we use an intrinsic parameterization.
Example 3 (Non-singular FIM):
Consider Bernoulli family
with intrinsic parameter . The FIM under this parameterization is non-singular as shown below.
Let
, where . The derivative is Thus, the FIM under this parameterization is
Gaussian Examples
Consider a bivariate Gaussian family with zero mean over random variable
. There are many parametrizations.
Ill-defined parametrization:
since must be well-defined. This parametrization leads to an ill-defined/incorrect FIM. Ill-defined parametrization:
since can be as large as possible as if is not symmetric positive-definite. In other words, the integration of this probability distribution under this parametrization is not finite. This parametrization leads to an ill-defined/incorrect FIM. Well-defined parametrization with non-intrinsic
-by- asymmetric parameter matrix : , where we have to explicitly enforce the symmetry constraint so that the distribution is well-defined. This parametrization leads to a singular FIM w.r.t. , where is the standard vectorization map. Well-defined parametrization with intrinsic
-by- parameter vector : .
Given a symmetric-by- matrix , we define another vectorization map, , which returns a -dim array obtained by vectorizing only the lower triangular part of . This map is known as the half-vectorization map.
Equivalently, parameteris a symmetric parameter matrix. Note that implicitly enforces the symmetry constraint and we should compute derivatives w.r.t. instead of under this parametrization. This parametrization leads to a non-singular FIM w.r.t. . Illustration of map
and (click to expand) The following examples show that the symmetry constraint should be respected.
The symmetry constraint in the Gaussian family is essential when it comes to the FIM computation.
Please see this Python (JAX) code to compute FIMs in the following examples.
Python (JAX) code of the Gaussian examples: (Click to expand)
For simplicity, consider
, where and . Gaussian Example 1 (without the symmetry constraint):
. Note that the bivariate Gaussian distribution
is not well-defined since is in general not symmetric.
where is even not positive semi-definite since . Recall that a proper FIM is at least positive semi-definite by definition. Gaussian Example 2 (with the symmetry constraint):
. Note that the bivariate Gaussian distribution
is well-defined since the symmetry constraint is enforced.
where is positive semi-definite.
Dimensionality of a manifold
We can define the dimension of a manifold by using the degrees of freedom of an intrinsic parametrization. Due to the theorem of toplological invariance of dimension, any intrinsic parametrization of a manifold has the same degrees of freedom [12]. This also gives us a tool to identify non-manifold cases. We now illustrate this by examples.
unit circle | open unit ball | closed unit ball |
---|---|---|
![]() |
![]() |
![]() |
1-dim manifold | 2-dim manifold | non-manifold, which is indeed a manifold with (closed) boundary |
As we shown in the previous section, a unit circle is a 1-dimensional manifold. We can similarly show that an open unit ball is a 2-dimensional manifold.
However, a closed unit ball is NOT a manifold since its interior is an open unit ball and its boundary is a unit circle. The circle and the open unit ball do not have the same dimensionality.
For statistical manifolds, consider the following examples. We will discuss more about them in Part II.
1-dim Gaussian with zero mean | |
---|---|
under intrinsic parameterization |
under intrinsic parameterization |
1-dim statistical manifold |
|
References
[1] S.-I. Amari, "Natural gradient works efficiently in learning," Neural computation 10:251–276 (1998).
[2] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Tractable structured natural gradient descent using local parameterizations," International Conference on Machine Learning (ICML) (2021).
[3] T. Liang, T. Poggio, A. Rakhlin, & J. Stokes, "Fisher-rao metric, geometry, and complexity of neural networks," The 22nd International Conference on Artificial Intelligence and Statistics (PMLR, 2019), pp. 888–896.
[4] J. Martens, "New Insights and Perspectives on the Natural Gradient Method," Journal of Machine Learning Research 21:1–76 (2020).
[5] M. Khan & W. Lin, "Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models," Artificial Intelligence and Statistics (PMLR, 2017), pp. 878–887.
[6] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Structured second-order methods via natural gradient descent," arXiv preprint arXiv:2107.10884 (2021).
[7] K. Osawa, S. Swaroop, A. Jain, R. Eschenhagen, R. E. Turner, R. Yokota, & M. E. Khan, "Practical deep learning with Bayesian principles," Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019), pp. 4287–4299.
[8] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, & J. Schmidhuber, "Natural evolution strategies," The Journal of Machine Learning Research 15:949–980 (2014).
[9] S. M. Kakade, "A natural policy gradient," Advances in neural information processing systems 14 (2001).
[10] N. Le Roux, P.-A. Manzagol, & Y. Bengio, "Topmoumoute Online Natural Gradient Algorithm.," NIPS (Citeseer, 2007), pp. 849–856.
[11] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng, & A. Schuler, "Ngboost: Natural gradient boosting for probabilistic prediction," International Conference on Machine Learning (PMLR, 2020), pp. 2690–2700.
[12] L. W. Tu, "An introduction to manifolds. Second," New York, US: Springer (2011).
[13] J. M. Lee, Introduction to Riemannian manifolds (Springer, 2018).
Footnotes:
-
In differential geometry, an intrinsic parametrization is known as a coordinate chart. ↩