Generalization to the Natural Gradient Descent

2025-04-15 1 0 417.45KB 8 页 10玖币
侵权投诉
Generalization to the Natural Gradient Descent
Shaojun Dong,1Fengyu Le,2, 1 Meng Zhang,3, 4 Si-Jing Tao,3, 4
Chao Wang,1, Yong-Jian Han,3, 1, 4, and Guo-Ping Guo3, 1, 4
1Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
2AHU-IAI AI Joint Laboratory, Anhui University
3CAS Key Laboratory of Quantum Information, University of Science
and Technology of China, Hefei 230026, People’s Republic of China
4Synergetic Innovation Center of Quantum Information and Quantum Physics,
University of Science and Technology of China, Hefei 230026, China
Optimization problem, which is aimed at finding the global minimal value of a given cost function,
is one of the central problem in science and engineering. Various numerical methods have been
proposed to solve this problem, among which the Gradient Descent (GD) method is the most
popular one due to its simplicity and efficiency. However, the GD method suffers from two main
issues: the local minima and the slow convergence especially near the minima point. The Natural
Gradient Descent(NGD), which has been proved as one of the most powerful method for various
optimization problems in machine learning, tensor network, variational quantum algorithms and
so on, supplies an efficient way to accelerate the convergence. Here, we give a unified method to
extend the NGD method to a more general situation which keeps the fast convergence by looking for
a more suitable metric through introducing a ’proper’ reference Riemannian manifold. Our method
generalizes the NDG, and may give more insight of the optimization methods.
I. INTRODUCTION
Optimization problems play crucial role in many fields
of science, engineering and industry. Generally, a task
can be evaluated by a real-valued function L(x) (named
cost function), where xare the parameters used to iden-
tify the solution. Therefore, the key to solve the problem
optimally is searching the parameters xminimizing the
function L(x). It is a big challenge to find the global
minima of the cost function L(x) when it is complicated.
Various numerical methods have been introduced to
solve the optimization problems. Among which, the gra-
dient descent (GD) method(also known as steepest de-
scent method), is a widely used technique for its simplic-
ity and high efficiency. In the GD method, the parame-
ters are updated according to the gradient, reads
xt+1 =xtηxL(1)
where ηis the searching step at iteration t. Generally,
numerical methods, which reach global minima by iter-
ative updates of parameters according to local informa-
tion, suffer two main problems: the local minima and the
slow convergence. The stochastic methods are introduced
to help the method to jump out the local minima[1].
On the other hand, the convergence has been improved
by more sophisticated methods like Conjugate Gradi-
ent(CG) method[24], Adam method[1], Natural Gradi-
ent Descent(NGD) method[5,6].
Let’s take another perspective to view the gradient de-
scent: Let Xbe the parameter space which form a man-
ifold with a flat metric Gij (x) = δij .L(x) is viewed as a
wang1329@ustc.edu.cn
smhan@ustc.edu.cn
scalar field defined on X. It’s easy to see that xLis the
direction in which we can obtain largest decrease of cost
function for fixed small step length, where the distance
is defined by the flat metric. Following this idea, the
NGD replace the flat metric on the parameter manifold
Xby a generally non-flat metric G, in order to improve
the convergence. The optimization variables update with
the iteration
xt+1 =xtηG1L(x)
x .(2)
The most common application of NGD is the optimiza-
tions of statistical models, where the Fisher informa-
tion(FI) matrix [5,6]is used as the metric. Fisher in-
formation(FI) matrix for parameterized probability dis-
tribution p(x, s) is defined as
GF I
i,j (x) = X
s
p(x, s)log p(x, s)
xi
log p(x, s)
xj(3)
where xis the parameter and sis the random variable.
NGD has gained more and more attentions in recent
years by showing its outstanding performance in conver-
gence on a variety areas of researches such as Variational
Quantum Algorithms(VQA)[710],Variational Quantum
Eigensolver(VQE)[1114]. The NGD has also been used
in solving quantum many-body problems[15] where the
model can be transformed to a statistical one, and
the metric is called Fubini-Study metric tensor[14,16].
In the field of variational Monte Carlo method in
neural network[17,18], NGD (also called Stochastic
Reconfiguration[1921]) method is also widely used, and
the FI matrix is called S matrix. Although the NGD has
shown its efficiency in various realm, there comes a ques-
tion whether the FI matrix is the only choice for the met-
ric in the NGD? Could we find a better one? Moreover,
arXiv:2210.02764v1 [math.OC] 6 Oct 2022
2
there are many optimization problems which can not be
transformed into statistical models naturally. How can
we find the proper metric used in NGD in such problems?
In this paper, we generalize natural gradient descent
method by proposing a systematic method to find out a
class of well-performed metrics on a broader class of cost
functions. Firstly, we introduce a reference space that
is highly relevant to the cost function, such that a good
metric is easy to choose on the reference space. Then
the metric on the parameter space used in NGD can be
chosen as the induced metric from the good metric on
the reference space. Our method is benchmarked on sev-
eral examples and is proved to converge faster than GD
and CG method. The results also showed that out met-
rics have better performance than the Fisher information
matrix even on some statistical models.
The rest of the paper is organized as follows. We
first discuss the motivation and effectiveness of our al-
gorithm by capturing the geometry of the optimization
landscape in sec.(II). Our algorithm is explained in detail
in sec.(III). Then we show some examples in the sec.(IV).
In the first example in subsection.(IV A), our algorithm is
applied to the least square problem, which is common in
the realm of the deep learning and the statistics. In this
example, we found at least 4 distinctive metrics for the
NGD and all of them work with high efficiency. In the
second example, our algorithm is applied to a classical
spin model of Heisenberg model. The FI matrix fails in
the third example in subsection.(IV C), while our method
can provide a workable metric of high quality. We will
give a summary in sec.(V).
II. THEORY
In this section, we explain our motivation in determin-
ing the metric in manifold Xby introducing a reference
manifold Y. We also try to analysis the effectiveness of
our method.
Firstly we follow the line in Ref.5to give a brief intro-
duction to the natural gradient descent method. Con-
sider a cost function L:XR, where Xis the param-
eters space. For every point xX, we define a met-
ric GX(x) at x. Starting from some initial parameters
x0X, we search for xmin Xto make L(x) minimal.
At parameter xtX, the strategy to descent L(x) is
to perform a line search in the direction dxt(which is
dependent of the parameter xt)
xt+1 =xt+ηdxt,(4)
where dxtis the direction in which we obtain the maximal
descent of L(x) by performing an update with fixed step
size |dx|= (GX,ij dxidxj)1/2.dxtcan be determined by
x
dx
|dx|=
X
FIG. 1. The constraint in determining the gradient direction
at xX.
the following fomula
dxt=1
max
dx [L(xt+dx)] dx T X(xt),|dx|=(5)
=1
max
dx [dxiL
xi]dx T X(xt),|dx|=(6)
where T X(xt) is the tangent space at xt. The constraint
is shown in Fig.1.
The minimization can be done using Laplacian multi-
plier method, and the solution dxtis given by[5]
dxt=G1
X(xt)xL(7)
up to a positive normalizing factor.
However, dxtonly gives a direction that L(x) decrease
fast locally. There’s no guarantee that the direction dxt
obtained this way is the optimal at global scale. It’s
clear to see that at each point xX, the globally best
evolution direction is
x, xmin, where xmin is the global
minimum. This can be used to judge the global effective-
ness of a direction (and the metric). We define a good
metric GXfor the cost function L(x) as one that satisfies
dx =G1
XxL(x)
x, xmin (8)
However natural gradient descent method bears an is-
sue that a good metric GXis often hard to find due to
the complexity of L(x). At present, we can only write
metric for some special types of problem such as statisti-
cal model optimization and quantum eigensolver, based
on case-specific study.
In this work, we propose a systematic method to pro-
vide metric of good quality for a class of cost functions:
one that can be re-written as L(x) = ¯
L(f(x)), where
f:XYis a map from parameter manifold Xto a
reference manifold Y(which is also a Riemannian mani-
fold), such that the following conditions are satisfied:
1. ¯
L(y) is a relatively simple function such as a poly-
nomial function or a function whose well-performed
metric is known. Therefore we can find a good met-
ric GYin Ymanifold for the cost function ¯
L(y).
2. fis a relatively surjective map especially when the
cost function L(x) is small.
3. Let xmin Xbe the minimum of L, and ymin X
be the minimum of ¯
L. Then f(xmin) is close to
ymin.
摘要:

GeneralizationtotheNaturalGradientDescentShaojunDong,1FengyuLe,2,1MengZhang,3,4Si-JingTao,3,4ChaoWang,1,Yong-JianHan,3,1,4,yandGuo-PingGuo3,1,41InstituteofArti cialIntelligence,HefeiComprehensiveNationalScienceCenter2AHU-IAIAIJointLaboratory,AnhuiUniversity3CASKeyLaboratoryofQuantumInformation,Univ...

展开>> 收起<<
Generalization to the Natural Gradient Descent.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:学术论文 价格:10玖币 属性:8 页 大小:417.45KB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注