Part I: Smooth Manifolds with the Fisher-Rao Metric

22 minute read

Goal (edited: 01-Jan-23)

This blog post focuses on the Fisher-Rao metric, which gives rise to the Fisher information matrix (FIM). We will introduce the following useful concepts to ensure non-singular FIMs:

  • Regularity conditions and intrinsic parameterizations of a distribution
  • Dimensionality of a smooth manifold

The discussion here is informal and focuses on more on intuitions, rather than rigor.

Click to see how to cite this blog post
@misc{lin2021NGDblog01,
  title = {Introduction to Natural-gradient Descent: Part I},
  author = {Lin, Wu and Nielsen, Frank and Khan, Mohammad Emtiyaz and Schmidt, Mark},
  url = {https://informationgeometryml.github.io/year-archive/}, 
  howpublished = {\url{https://informationgeometryml.github.io/posts/2021/09/Geomopt01/}},
  year = {2021},
  note = {Accessed: 2021-09-06}
}

Motivation


The goal of this blog is to introduce geometric structures associated with probability distribution. Why should we care about such geometric structures? By exploiting the structures, we can

  • design efficient and simple algorithms [1]
  • design robust methods that are less sensitive to re-parametrization [2]
  • understand the behavior of models/algorithms using tools from differential geometry, information geometry, and invariant theory [3]

These benefits are relevant for the majority of machine learning methods, all of which make use of probability distributions of various kinds.

Below, we give some common examples from the literature. A reader familiar with such examples can skip this part.

Empirical Risk Minimization (frequentist estimation):

Given N input-output pairs (xi,yi), the least-square loss can be viewed as a finite-sample approximation of the expectation w.r.t. a probability distribution (data generating distribution), (1)minτ12ni=1n(yixiTτ)2=1ni=1nlogN(yi|xiTτ,1)+constantEp(x,y|τ)[logp(x,y|τ)] where p(x,y|τ)=N(y|xTτ,1)p(x) is assumed to be the data-generating distribution. Here, N(y|m,v) denotes a normal distribution over y with mean m and variance v.

Well-known algorithms such as Fisher scoring and (empirical) natural-gradient descent [4] are commonly used methods that exploit the geometric structure of p(x,y|τ). These are examples of algorithms derived from a frequentist perspective, which can also be generalized to neural networks [4].

Variational Inference (Bayesian estimation):

Given a prior p(z) and a likelihood p(D|z) over a latent vector z and known data D, we can approximate the exact posterior p(z|D)=p(z,D)p(D) by optimizing a variational objective with respect to an approximated distribution q(z|τ): (2)minτKL[q(z|τ)||p(z|D)]=Eq(z|τ)[logq(z|τ)logp(z,D)]+constant where KL[q(z)||p(z)]:=Eq(z)[log(q(z)p(z))] is the Kullback–Leibler divergence.

The natural-gradient variational inference [5] is an algorithm that speeds up the inference by exploiting the geometry of q(z|τ) induced by the Fisher-Rao metric. This approach is derived from a Bayesian perspective, and can also be generalized to neural networks [6] and Bayesian neural networks [7].

Evolution Strategies and Policy-Gradient Methods (Global optimization):

Global optimization methods often use a search distribution, denoted by π(a|τ), to find the global maximum of an objective h(a) by solving a problem of the following form: (3)minτEπ(a|τ)[h(a)] Samples from the search distribution are evaluated through a “fitness” function h(a), and guide the optimization towards better optima.

The natural evolution strategies [8] is an algorithm that speeds up the search process by exploiting the geometry of π(a|τ). In the context of reinforcement learning, π(a|τ) is known as the policy distribution to generate actions and the natural evolution strategies is known as the natural policy gradient method [9].

In all of the examples above, the objective function is expressed in terms of an expectation w.r.t. a distribution in red, parameterized with the parameter τ. The geometric structure of a distribution p(w|τ) for the quantity w can be exploited to improve the learning algorithms. The table below summarizes the three examples. More applications of similar nature are discussed in [10] and [11].

Example       meaning of w       distribution p(w|τ)
Empirical Risk Minimization observation (x,y) p(x,y|τ)
Variational Inference latent variable z q(z|τ)
Evolution Strategies decision variable a π(a|τ)

Note:

In general, we may have to compute or estimate the inverse of the FIM. However, in many useful machine learning applications, algorithms such as [2] [4] [5] [7] [8] [9] [10] can be efficiently implemented without explicitly computing the inverse of the FIM.

We discuss this in other posts. See
Part V and our ICML work.

In the rest of the post, we will mainly focus on the geometric structure of (finite-dimensional) parametric families, for example, a univariate Gaussian family. The following figure illustrates four distributions in the Gaussian family denoted by {N(w|μ,σ)|μR,σ>0}, where N(w|μ,σ):=12πσexp[(wμ)22σ] and parameter τ:=(μ,σ). We will later see that this family is a 2-dimensional manifold in the parameter space.

Figure 2

Intrinsic Parameterizations


We start by discussing a special type of parameterizations, we call intrinsic parameterizations, which are useful to obtain non-singular FIMs. An arbitrary parameterization may not always be appropriate for a smooth manifold [12]. Rather, the parameterization should be such that the manifold is locally like a flat vector space, for example, how the curved Earth surface looks flat to us, locally. We will refer to such flat vector space as a local vector-space structure (denote it by E).

Local vector-space structure:

It supports local vector additions, local real scalar products, and their algebraic laws (i.e., the distributive law). (see Part II for the details.)

Intrinsic parameterizations1 are those that satisfy the following two conditions:

  • We require that the parameter space of τ, denoted by Ωτ, be an open set in RK, where K is the number of entries of a parameter array. Intuitively, this ensures a local vector-space structure throughout the parameter space, which then ensures that a small, local perturbation E at each point stays within Ωτ.
  • We also require that E uniquely and smoothly represents points in a manifold. The condition ensures arbitrary (smooth) parameter transformations should still represent the same sub-set. In other words, we require that
    • there exists a bi-jective map among such two parameterizations if these parameterizations represent a common sub-set of points in the manifold.
    • this map and its inverse map are both smooth.

    In differential geometry, this requirement is known as a diffeomorphism, which is a formal but more abstract definition of this requirement.

Intrinsic parameterizations satisfy the above two conditions, and lead to non-singular FIMs, as we will see soon.

We will now discuss a simple case of a manifold, a unit circle in R2, and give an example of an intrinsic parameterization and three non-intrinsic ones due to different reasons such as non-smoothness, non-openness, and non-uniqueness.

Parameterization 1 (an intrinsic parameterization):

A (local) parametrization at (0,1) highlighted in red for the circle is {(t,1t2)|h<t<h}, where h=0.1. We use one (scalar) parameter in this parametrization. The manifold is (locally) “flat” since we can always find a small 1-dimensional perturbation E in the 1-dimensional parameter space Ωt={t|h<t<h}. Therefore, this is an intrinsic parameterization.

We can similarly define a (local) parametrization at each point of the circle. In fact, we can use finite (local) parameterizations to represent the whole circle as shown below.

Now, we discuss invalid cases, where not all conditions are satisfied.

Parameterization 2 (a non-intrinsic parameterization due to non-smoothness):

Let’s define a map f:[0,2π)S1 such that f(θ)=(sinθ,cosθ), where we use S1 to denote the circle.

A (global) parametrization of the circle is {f(θ)|θ[0,2π)}, where we use one (scalar) parameter.

This map f is bijective and smooth. However, the parameter space is not open in R, and its inverse map f1 is not continunous at point (0,1)S1. Therefore, this parametrization is not intrinsic. In fact, there does not exist a (single) global and intrinsic parametrization to represent the whole circle.

Smoothness of the inverse map is essential when it comes to reparametrization (A.K.A. parameter transformation). The smoothness, along with the inverse map, gives us a way to generate new intrinsic parameterizations. Essentially, in such a case, the Jacobian matrix (to change between the parameterizations) is non-singular everywhere, and we can use the chain rule and inverse function theorem to jump between different intrinsic parameterizations. We will discuss this in Part III.

Parametrization 3 (a non-intrinsic parameterization due to non-openness):

The circle does not look like a flat space under the following parametrization {(x,y)|x2+y2=1,x,yR}. The number of entries in this parameter array is 2.

The reason is that we cannot find a small 2-dimensional perturbation E in the 2-dimensional parameter space Ωτ={(x,y)|x2+y2=1} due to the constraint x2+y2=1. In other words, Ωτ is not open in R2.

Parametrization 4 (a non-intrinsic parameterization due to non-uniqueness):

Let’s consider the following non-intrinsic parametrization τ of the circle: {(xx2+y2,yx2+y2)|x2+y20,x,yR}, where τ=(x,y). The parameter space Ωτ is open in R2.

This parametrization is not intrinsic since it does not uniquely represent a point in the circle. It is obvious to see that τ1=(x1,y1) and ατ1=(αx1,αy1) both represent the same point in the circle when scalar α>0.

Intrinsic Parameterizations for Parametric families


The examples in the previous section clearly show the importance of parameterization, and that it should be chosen carefully. Now, we discuss how to choose such a parameterization for a given parametric family.

Roughly speaking, a parameterization τ for a family of distribution p(w|τ) is intrinsic if logp(w|τ) is both smooth and unique w.r.t. τ in its parameter space Ωτ. Below is the formal condition.

Regularity Condition:

For any τΩτ, the set of partial derivatives {τilogp(w|τ)} is smooth w.r.t. τ and is a set of linearly independent functions of w.

In other words, ici[τilogp(w|τ)]=0 holds only when constant ci is zero and the value of ci does not depend on w.

Note that, due to the definition of the partial derivatives, this regularity condition implicitly assumes that the parameter space Ωτ is an open set in RK, where K is the number of entries in parameter array τ. In other words, the openness requirement is not explicit and hidden within the regularity condition. We will discuss more about this at here.

The following examples illustrate the regularity condition.

Example 1 (regularity condition for an intrinsic parameterization):

We will write the regularity condition at a point for an intrinsic parameterization. Consider a 1-dimensional Gaussian family {N(w|μ,v)|μR,v>0} with mean μ, variance v, and parametrization τ=(μ,v). The partial derivatives are the following, μlogN(w|μ,v)=wμv,vlogN(w|μ,v)=(wμ)22v212v It is easy to see that these partial derivatives are smooth w.r.t. τ=(μ,v) in its parameter space Ωτ={(μ,v)|μR,v>0}.

Consider the partial derivatives at a point (μ=0,v=1),

μlogN(w|μ,v)|μ=0,v=1=w,vlogN(w|μ,v)|μ=0,v=1=w212 For this point, the regularity condition will be c1w+c2(w212)=0. For this to hold for all w, it is necessary that c1=c2=0, which implies linear independence.

A formal proof can be built to show that this holds for any μR and v>0.

Example 2 (regularity condition for a non-intrinsic parameterization):

By using a counterexample, we will show that the regularity condition fails for a non-intrinsic parameterization. Consider a Bernoulli family {I(w=0)π0π0+π1+I(w=1)π1π0+π1|π0>0,π1>0} with parameter τ=(π0,π1), where function I() is the indicator function. The partial derivatives are

π0logB(w|π0,π1)=(I(w=0)I(w=1))B(w|π0,π1)π1(π0+π1)2 π1logB(w|π0,π1)=(I(w=0)I(w=1))B(w|π0,π1)π0(π0+π1)2 Note that when c0=π00 and c1=π10, we have c0π1(π0+π1)2+c1π0(π0+π1)2=0. Therefore, the partial derivatives are linearly dependent.

In a similar fashion, we will also see (soon) that the regularity condition is also not satisfied for the following parameterization: {I(w=0)π0+I(w=1)π1|π0>0,π1>0,π0+π1=1} with parameter τ=(π0,π1). The main reason is that the parameter space is not open in R2.

On the other hand, the condition holds for the following parameterization: {I(w=0)π0+I(w=1)(1π0)|0<π0<1} with parameter τ=π0.

Fisher-Rao Metric


Given an intrinsic parameterization, the Fisher-Rao metric is defined as follows, Fij(τ):=Ep(w|τ)[(τilogp(w|τ))(τjlogp(w|τ))].

We can also express the metric in a matrix form as

F(τ):=Ep(w|τ)[(τlogp(w|τ))(τlogp(w|τ))T], where K is the number of entries of parameter array τ and τlogp(w|τ):=[τ1logp(w|τ),,τKlogp(w|τ)]T is a column vector. The matrix form is also known as the Fisher information matrix (FIM). Obviously, the form of the FIM depends on the choice of parameterizations. In many cases, we could also compute the FIM as F(τ):=Ep(w|τ)[τ2logp(w|τ)]. The regularity condition guarantees that the FIM is non-singular if the matrix exists, that is, the expectation in the definition exists.

In what follows, we will assume the metric to be well-defined, which makes the Fisher-Rao metric a valid Riemannian metric [13] since the corresponding FIM is positive definite everywhere in its intrinsic parameter space.

Caveats of the Fisher matrix computation


There are some caveats when it comes to the Fisher matrix computation. In particular, the regularity condition should be satisfied. It is possible to define the FIM under a non-intrinsic parameterization. However, the FIM often is singular or ill-defined under a non-intrinsic parameterization as shown below.

Bernoulli Examples

Example 1 (Ill-defined FIM):

Consider Bernoulli family {I(w=0)π0+I(w=1)π1|π0>0,π1>0,π0+π1=1} with non-intrinsic parameter τ=(π0,π1). The following computation is not correct. Do you make similar mistakes like this?

Let p(w|τ)=I(w=0)π0+I(w=1)π1, where τ=(π0,π1). The derivative is (4)τlogp(w|τ)=1p(w|τ)[I(w=0),I(w=1)]T Thus, by Eq. (4), the FIM under this parameterization is

F(τ)=Ep(w|τ)[1p2(w|τ)[I2(w=0)I(w=1)I(w=0)I(w=0)I(w=1)I2(w=1)]]=[1π0001π1]

This computation is not correct. Do you know why?

Reason: (Click to expand)

The key reason is that the parameter space is not open in R2 due to the equality constraint π0+π1=1. Thus, Eq. (4) is incorrect.

By definition, a Bernoulli distribution is valid only when the constraint holds. Thus, the constraint π0+π1=1 must be satisfied when we compute the Fisher matrix since the computation involves computing the expectation w.r.t. this distribution.

Note that the gradient is defined as τlogp(w|τ):=[π0logp(w|τ),π1logp(w|τ)]T.

Unfortunately, these partial derivatives do not exist. By the definition of partial derivative π0logp(w|τ), we fix π1 and allow π0 to change. However, given that π1 is fixed and π0 is fully determined by π1 due to the equality constraint π0+π1=1. Therefore, π0logp(w|τ) is not well-defined. In other words, the above Fisher matrix computation is not correct since τlogp(w|τ) does not exist.

Example 2 (Singular FIM):

Consider Bernoulli family {I(w=0)π0π0+π1+I(w=1)π1π0+π1|π0>0,π1>0} with non-intrinsic parameter τ=(π0,π1).

Note that a Bernoulli distribution in the family is not uniquely represented by this parametrization. It is obvious to see that τ1=(1,1) and τ2=(2,2) both represent the same Bernoulli distribution.

The FIM under this parameterization is singular as shown below.

Let p(w|τ)=I(w=0)π0π0+π1+I(w=1)π1π0+π1, where τ=(π0,π1). The derivative is

τlogp(w|τ)=I(w=0)I(w=1)p(w|τ)[π1(π0+π1)2,π0(π0+π1)2]T

Thus, the FIM under this parameterization is F(τ)=Ep(w|τ)[(I(w=0)I(w=1))2p2(w|τ)[π12(π0+π1)4π0π1(π0+π1)4π0π1(π0+π1)4π02(π0+π1)4]]=1(π0+π1)2[π1π011π0π1] where this FIM is singular since the matrix determinant is zero as shown below. det([π1π011π0π1])=0.

Now, we give an example to show that the FIM of a Bernoulli family can be non-singular when we use an intrinsic parameterization.

Example 3 (Non-singular FIM):

Consider Bernoulli family {I(w=0)π+I(w=1)(1π)|0<π<1} with intrinsic parameter τ=π.

The FIM under this parameterization is non-singular as shown below.

Let p(w|τ)=I(w=0)π+I(w=1)(1π), where τ=π. The derivative is τlogp(w|τ)=I(w=0)I(w=1)I(w=0)π+I(w=1)(1π)

Thus, the FIM under this parameterization is F(τ)=Ep(w|τ)[(I(w=0)I(w=1))2(I(w=0)π+I(w=1)(1π))2]=π12π2+(1π)(1)2(1π)2=1π+11π=1π(1π)>0

Gaussian Examples

Consider a bivariate Gaussian family with zero mean over random variable wR2. There are many parametrizations.

  • Ill-defined parametrization: {exp(12[wTΣ1w+logdet(Σ)+2log(2π)])|ΣR2×2} since logdet(Σ) must be well-defined. This parametrization leads to an ill-defined/incorrect FIM.

  • Ill-defined parametrization: {exp(12[wTΣ1w+logdet(Σ)+2log(2π)])|det(Σ)>0,ΣR2×2} since wTΣ1w can be as large as possible as ||w||2 if Σ1 is not symmetric positive-definite. In other words, the integration of this probability distribution under this parametrization is not finite. This parametrization leads to an ill-defined/incorrect FIM.

  • Well-defined parametrization with non-intrinsic 2-by-2 asymmetric parameter matrix Σ: {exp(12[wT(Σ+ΣT2)1w+logdet(Σ+ΣT2)+2log(2π)])|Σ+ΣT20,ΣR2×2}, where we have to explicitly enforce the symmetry constraint so that the distribution is well-defined. This parametrization leads to a singular FIM w.r.t. vec(Σ), where vec() is the standard vectorization map.

  • Well-defined parametrization with intrinsic 3-by-1 parameter vector v: {exp(12[wT(vech1(v))1w+logdet(vech1(v))+2log(2π)])|vech1(v)0}.
    Given a symmetric d-by-d matrix Σ, we define another vectorization map, vech(Σ), which returns a d(d+1)2-dim array obtained by vectorizing only the lower triangular part of Σ. This map is known as the half-vectorization map.
    Equivalently, parameter Σ:=vech1(v) is a symmetric parameter matrix. Note that vech1(v) implicitly enforces the symmetry constraint and we should compute derivatives w.r.t. v instead of Σ under this parametrization. This parametrization leads to a non-singular FIM w.r.t. v=vech(Σ).

    Illustration of map vech() and vech1() (click to expand)

    Consider the following symmetric 2-by-2 matrix Σ=[2113] The output of map vech(Σ) is v:=vech(Σ)=[213]

    The output of map vech1(v) is vech1(v)=[2113]

The following examples show that the symmetry constraint should be respected.

The symmetry constraint in the Gaussian family is essential when it comes to the FIM computation.

Please see this Python (JAX) code to compute FIMs in the following examples.

Python (JAX) code of the Gaussian examples: (Click to expand)
import jax
import jax.numpy as np
from jax.config import config; config.update("jax_enable_x64", True)
def neg_log_p(param,d,is_sym):
    Sigma = np.reshape(param,(d,d)) # co-variance # (Sigma = vec^{-1}(param))
    if is_sym:
        Sigma = (Sigma+Sigma.T)/2.0
    Sigma0 = jax.lax.stop_gradient( (Sigma+Sigma.T)/2.0 )
    trace = np.trace( np.linalg.solve(Sigma,Sigma0) ) # Tr(Sigma^{-1} Sigma_0)
    _, logdet = np.linalg.slogdet(Sigma) # log det(Sigma)
    return ( trace + logdet + d*np.log(2.0*np.pi) )/2.0
is_sym = False # Gaussian Example 1 if is_sym==False; Gaussian Example 2 if is_sym==True
d=2
Sigma = np.eye(d)
param = np.reshape(Sigma,(-1,)) #vec(Sigma)
print( 'vec(Sigma):', param )
print( 'Sigma:\n', np.reshape(param,(d,d)) )
nlp = lambda param:neg_log_p(param,d,is_sym)
hess_f = jax.jacfwd(jax.jacrev(nlp))
hess = hess_f(param)
print('FIM:\n', hess)
print( 'det(FIM):%f'%np.linalg.det(hess) )
w,_ = np.linalg.eigh(hess)#eigen values
print('eigen values of the FIM:',w)

For simplicity, consider Σ0=I, where vec(Σ0)=(1,0,0,1) and vech(Σ0)=(1,0,1).

Gaussian Example 1 (without the symmetry constraint): logp1(w|Σ)=12[wTΣ1w+logdet(Σ)+2log(2π)].

Note that the bivariate Gaussian distribution p1(w|Σ) is not well-defined since Σ is in general not symmetric.

F1(vec(Σ0))=Ep1(w|Σ)[vec(Σ)2logp1(w|Σ)]|Σ=Σ0=12Ep1(w|Σ)[vec(Σ)2(Tr(Σ1wwT)+logdet(Σ))]|Σ=Σ0=12vec(Σ)2Ep1(w|Σ0)[(Tr(Σ1wwT)+logdet(Σ))]|Σ=Σ0=12vec(Σ)2(Tr(Σ1Ep1(w|Σ0)[wwT])+logdet(Σ))|Σ=Σ0=12vec(Σ)2(Tr(Σ1Σ0)+logdet(Σ))|Σ=Σ0=[0.5000000.5000.5000000.5](incorrect (non-singular) FIM), where F1(vec(Σ0)) is even not positive semi-definite since det(F1(vec(Σ0)))<0. Recall that a proper FIM is at least positive semi-definite by definition.

Gaussian Example 2 (with the symmetry constraint): logp2(w|Σ)=12[wT(Σ+ΣT2)1w+logdet(Σ+ΣT2)+2log(2π)].

Note that the bivariate Gaussian distribution p2(w|Σ) is well-defined since the symmetry constraint is enforced.

F2(vec(Σ0))=Ep2(w|Σ)[vec(Σ)2logp2(w|Σ)]|Σ=Σ0=12vec(Σ)2(Tr((Σ+ΣT2)1Σ0)+logdet(Σ+ΣT2))|Σ=Σ0=[0.500000.250.25000.250.2500000.5](correct (singular) FIM), where F2(vec(Σ0)) is positive semi-definite.

Dimensionality of a manifold


We can define the dimension of a manifold by using the degrees of freedom of an intrinsic parametrization. Due to the theorem of toplological invariance of dimension, any intrinsic parametrization of a manifold has the same degrees of freedom [12]. This also gives us a tool to identify non-manifold cases. We now illustrate this by examples.

unit circle open unit ball closed unit ball
Source:Wikipedia Source:Wikipedia Source:Wikipedia
1-dim manifold 2-dim manifold non-manifold, which is indeed a manifold with (closed) boundary

As we shown in the previous section, a unit circle is a 1-dimensional manifold. We can similarly show that an open unit ball is a 2-dimensional manifold.

However, a closed unit ball is NOT a manifold since its interior is an open unit ball and its boundary is a unit circle. The circle and the open unit ball do not have the same dimensionality.

For statistical manifolds, consider the following examples. We will discuss more about them in Part II.

1-dim Gaussian with zero mean d-dim Gaussian with zero mean
{N(w|0,s1)|s>0} with precision s
under intrinsic parameterization τ=s
{N(w|0,S1)|vech1(τ)=S0} with precision S
under intrinsic parameterization τ=vech(S).
1-dim statistical manifold d(d+1)2-dim statistical manifold

References

[1] S.-I. Amari, "Natural gradient works efficiently in learning," Neural computation 10:251–276 (1998).

[2] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Tractable structured natural gradient descent using local parameterizations," International Conference on Machine Learning (ICML) (2021).

[3] T. Liang, T. Poggio, A. Rakhlin, & J. Stokes, "Fisher-rao metric, geometry, and complexity of neural networks," The 22nd International Conference on Artificial Intelligence and Statistics (PMLR, 2019), pp. 888–896.

[4] J. Martens, "New Insights and Perspectives on the Natural Gradient Method," Journal of Machine Learning Research 21:1–76 (2020).

[5] M. Khan & W. Lin, "Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models," Artificial Intelligence and Statistics (PMLR, 2017), pp. 878–887.

[6] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Structured second-order methods via natural gradient descent," arXiv preprint arXiv:2107.10884 (2021).

[7] K. Osawa, S. Swaroop, A. Jain, R. Eschenhagen, R. E. Turner, R. Yokota, & M. E. Khan, "Practical deep learning with Bayesian principles," Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019), pp. 4287–4299.

[8] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, & J. Schmidhuber, "Natural evolution strategies," The Journal of Machine Learning Research 15:949–980 (2014).

[9] S. M. Kakade, "A natural policy gradient," Advances in neural information processing systems 14 (2001).

[10] N. Le Roux, P.-A. Manzagol, & Y. Bengio, "Topmoumoute Online Natural Gradient Algorithm.," NIPS (Citeseer, 2007), pp. 849–856.

[11] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng, & A. Schuler, "Ngboost: Natural gradient boosting for probabilistic prediction," International Conference on Machine Learning (PMLR, 2020), pp. 2690–2700.

[12] L. W. Tu, "An introduction to manifolds. Second," New York, US: Springer (2011).

[13] J. M. Lee, Introduction to Riemannian manifolds (Springer, 2018).

Footnotes:

  1. In differential geometry, an intrinsic parametrization is known as a coordinate chart.