Generalization to the Natural Gradient Descent

2025-04-15 5 0 417.45KB 8 页 10玖币

侵权投诉

Shaojun Dong,1Fengyu Le,2, 1 Meng Zhang,3, 4 Si-Jing Tao,3, 4

Chao Wang,1, ∗Yong-Jian Han,3, 1, 4, †and Guo-Ping Guo3, 1, 4

1Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center

2AHU-IAI AI Joint Laboratory, Anhui University

3CAS Key Laboratory of Quantum Information, University of Science

and Technology of China, Hefei 230026, People’s Republic of China

4Synergetic Innovation Center of Quantum Information and Quantum Physics,

University of Science and Technology of China, Hefei 230026, China

Optimization problem, which is aimed at ﬁnding the global minimal value of a given cost function,

is one of the central problem in science and engineering. Various numerical methods have been

proposed to solve this problem, among which the Gradient Descent (GD) method is the most

popular one due to its simplicity and eﬃciency. However, the GD method suﬀers from two main

issues: the local minima and the slow convergence especially near the minima point. The Natural

Gradient Descent(NGD), which has been proved as one of the most powerful method for various

optimization problems in machine learning, tensor network, variational quantum algorithms and

so on, supplies an eﬃcient way to accelerate the convergence. Here, we give a uniﬁed method to

extend the NGD method to a more general situation which keeps the fast convergence by looking for

a more suitable metric through introducing a ’proper’ reference Riemannian manifold. Our method

generalizes the NDG, and may give more insight of the optimization methods.

I. INTRODUCTION

Optimization problems play crucial role in many ﬁelds

of science, engineering and industry. Generally, a task

can be evaluated by a real-valued function L(x) (named

cost function), where xare the parameters used to iden-

tify the solution. Therefore, the key to solve the problem

optimally is searching the parameters xminimizing the

function L(x). It is a big challenge to ﬁnd the global

minima of the cost function L(x) when it is complicated.

Various numerical methods have been introduced to

solve the optimization problems. Among which, the gra-

dient descent (GD) method(also known as steepest de-

scent method), is a widely used technique for its simplic-

ity and high eﬃciency. In the GD method, the parame-

ters are updated according to the gradient, reads

xt+1 =xt−η∂xL(1)

where ηis the searching step at iteration t. Generally,

numerical methods, which reach global minima by iter-

ative updates of parameters according to local informa-

tion, suﬀer two main problems: the local minima and the

slow convergence. The stochastic methods are introduced

to help the method to jump out the local minima[1].

On the other hand, the convergence has been improved

by more sophisticated methods like Conjugate Gradi-

ent(CG) method[2–4], Adam method[1], Natural Gradi-

ent Descent(NGD) method[5,6].

Let’s take another perspective to view the gradient de-

scent: Let Xbe the parameter space which form a man-

ifold with a ﬂat metric Gij (x) = δij .L(x) is viewed as a

∗wang1329@ustc.edu.cn

†smhan@ustc.edu.cn

scalar ﬁeld deﬁned on X. It’s easy to see that −∂xLis the

direction in which we can obtain largest decrease of cost

function for ﬁxed small step length, where the distance

is deﬁned by the ﬂat metric. Following this idea, the

NGD replace the ﬂat metric on the parameter manifold

Xby a generally non-ﬂat metric G, in order to improve

the convergence. The optimization variables update with

the iteration

xt+1 =xt−ηG−1∂L(x)

∂x .(2)

The most common application of NGD is the optimiza-

tions of statistical models, where the Fisher informa-

tion(FI) matrix [5,6]is used as the metric. Fisher in-

formation(FI) matrix for parameterized probability dis-

tribution p(x, s) is deﬁned as

GF I

i,j (x) = X

p(x, s)∂log p(x, s)

∂xi

∂log p(x, s)

∂xj(3)

where xis the parameter and sis the random variable.

NGD has gained more and more attentions in recent

years by showing its outstanding performance in conver-

gence on a variety areas of researches such as Variational

Quantum Algorithms(VQA)[7–10],Variational Quantum

Eigensolver(VQE)[11–14]. The NGD has also been used

in solving quantum many-body problems[15] where the

model can be transformed to a statistical one, and

the metric is called Fubini-Study metric tensor[14,16].

In the ﬁeld of variational Monte Carlo method in

neural network[17,18], NGD (also called Stochastic

Reconﬁguration[19–21]) method is also widely used, and

the FI matrix is called S matrix. Although the NGD has

shown its eﬃciency in various realm, there comes a ques-

tion whether the FI matrix is the only choice for the met-

ric in the NGD? Could we ﬁnd a better one? Moreover,

arXiv:2210.02764v1 [math.OC] 6 Oct 2022

there are many optimization problems which can not be

transformed into statistical models naturally. How can

we ﬁnd the proper metric used in NGD in such problems?

In this paper, we generalize natural gradient descent

method by proposing a systematic method to ﬁnd out a

class of well-performed metrics on a broader class of cost

functions. Firstly, we introduce a reference space that

is highly relevant to the cost function, such that a good

metric is easy to choose on the reference space. Then

the metric on the parameter space used in NGD can be

chosen as the induced metric from the good metric on

the reference space. Our method is benchmarked on sev-

eral examples and is proved to converge faster than GD

and CG method. The results also showed that out met-

rics have better performance than the Fisher information

matrix even on some statistical models.

The rest of the paper is organized as follows. We

ﬁrst discuss the motivation and eﬀectiveness of our al-

gorithm by capturing the geometry of the optimization

landscape in sec.(II). Our algorithm is explained in detail

in sec.(III). Then we show some examples in the sec.(IV).

In the ﬁrst example in subsection.(IV A), our algorithm is

applied to the least square problem, which is common in

the realm of the deep learning and the statistics. In this

example, we found at least 4 distinctive metrics for the

NGD and all of them work with high eﬃciency. In the

second example, our algorithm is applied to a classical

spin model of Heisenberg model. The FI matrix fails in

the third example in subsection.(IV C), while our method

can provide a workable metric of high quality. We will

give a summary in sec.(V).

II. THEORY

In this section, we explain our motivation in determin-

ing the metric in manifold Xby introducing a reference

manifold Y. We also try to analysis the eﬀectiveness of

our method.

Firstly we follow the line in Ref.5to give a brief intro-

duction to the natural gradient descent method. Con-

sider a cost function L:X→R, where Xis the param-

eters space. For every point x∈X, we deﬁne a met-

ric GX(x) at x. Starting from some initial parameters

x0∈X, we search for xmin ∈Xto make L(x) minimal.

At parameter xt∈X, the strategy to descent L(x) is

to perform a line search in the direction dxt(which is

dependent of the parameter xt)

xt+1 =xt+ηdxt,(4)

where dxtis the direction in which we obtain the maximal

descent of L(x) by performing an update with ﬁxed step

size |dx|= (GX,ij dxidxj)1/2.dxtcan be determined by

|dx|=

FIG. 1. The constraint in determining the gradient direction

at x∈X.

the following fomula

dxt=−1

max

dx [L(xt+dx)] dx ∈T X(xt),|dx|=(5)

=−1

max

dx [dxi∂L

∂xi]dx ∈T X(xt),|dx|=(6)

where T X(xt) is the tangent space at xt. The constraint

is shown in Fig.1.

The minimization can be done using Laplacian multi-

plier method, and the solution dxtis given by[5]

dxt=−G−1

X(xt)∂xL(7)

up to a positive normalizing factor.

However, dxtonly gives a direction that L(x) decrease

fast locally. There’s no guarantee that the direction dxt

obtained this way is the optimal at global scale. It’s

clear to see that at each point x∈X, the globally best

evolution direction is −−−−→

x, xmin, where xmin is the global

minimum. This can be used to judge the global eﬀective-

ness of a direction (and the metric). We deﬁne a good

metric GXfor the cost function L(x) as one that satisﬁes

dx =−G−1

X∂xL(x)∼−−−−→

x, xmin (8)

However natural gradient descent method bears an is-

sue that a good metric GXis often hard to ﬁnd due to

the complexity of L(x). At present, we can only write

metric for some special types of problem such as statisti-

cal model optimization and quantum eigensolver, based

on case-speciﬁc study.

In this work, we propose a systematic method to pro-

vide metric of good quality for a class of cost functions:

one that can be re-written as L(x) = ¯

L(f(x)), where

f:X→Yis a map from parameter manifold Xto a

reference manifold Y(which is also a Riemannian mani-

fold), such that the following conditions are satisﬁed:

1. ¯

L(y) is a relatively simple function such as a poly-

nomial function or a function whose well-performed

metric is known. Therefore we can ﬁnd a good met-

ric GYin Ymanifold for the cost function ¯

L(y).

2. fis a relatively surjective map especially when the

cost function L(x) is small.

3. Let xmin ∈Xbe the minimum of L, and ymin ∈X

be the minimum of ¯

L. Then f(xmin) is close to

ymin.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneralizationtotheNaturalGradientDescentShaojunDong,1FengyuLe,2,1MengZhang,3,4Si-JingTao,3,4ChaoWang,1,Yong-JianHan,3,1,4,yandGuo-PingGuo3,1,41InstituteofArticialIntelligence,HefeiComprehensiveNationalScienceCenter2AHU-IAIAIJointLaboratory,AnhuiUniversity3CASKeyLaboratoryofQuantumInformation,Univ...

展开>> 收起<<

Generalization to the Natural Gradient Descent.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generalization to the Natural Gradient Descent

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: