SDA Simple Discrete Augmentation for Contrastive Sentence Representation Learning

2025-04-15 0 0 3.06MB 13 页 10玖币
侵权投诉
SDA: Simple Discrete Augmentation for Contrastive Sentence
Representation Learning
Dongsheng Zhu1,3,, Zhenyu Mao2,, Jinghui Lu2,
, Rui Zhao2, Fei Tan2,
1Baidu Inc. 2SenseTime Research
3Fudan University
dszhu20@fudan.edu.cn
{maozhenyu, lujinghui1, zhaorui, tanfei}@sensetime.com
Abstract
Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As
an essential element, data augmentation protocols, however, have not been well explored. The pioneering work
SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates
discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the
underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data
augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet
effective discrete sentence augmentation schemes: punctuation insertion,modal verbs, and double negation.
They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation
is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We
experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of
the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA
Keywords: Neural language representation models; Parsing, Grammar, Syntax, Treebank; Semantics;
Semi-supervised, weakly-supervised and unsupervised learning
1. Introduction
Current state-of-the-art methods utilize contrastive
learning algorithms to learn unsupervised sentence
representations (Gao et al.,2021;Yan et al.,2021).
They learn to bring similar sentences closer in the
latent space while pushing away dissimilar ones
(Hjelm et al.,2018). In the paradigm, multiple
data augmentation methods have been proposed
from different perspectives to curate different vari-
ants
1
(Oord et al.,2018;Zhu et al.,2020). The
variants and the corresponding original samples
are deemed positive pairs in the learning proce-
dure (Chen et al.,2020). Previous studies have
shown that the quality of variants largely shapes
the learning of reasonable representations (Chen
et al.,2020;Hassani and Khasahmadi,2020).
Conventional methods directly employ opera-
tions such as cropping, word deletion, and syn-
onym replacement in natural sentences (Wei and
Zou,2019;Wu et al.,2020;Meng et al.,2021). In
addition, recent studies resort to network architec-
tures for manipulating embedding vectors, such as
dropout, feature cutoff, and token shuffling (Gao
et al.,2021;Yan et al.,2021). It enables more
subtle variants of training samples in the continu-
ous latent space in a controllable way and thereby
renders better representations, which is usually ev-
idenced by more appealing performance in typical
Equal contribution.
Corresponding author.
1It refers to a sentence sample in this work.
Figure 1: Normalized representation visualization
of different augmentation methods and the way they
should be optimized.
downstream NLP tasks (e.g., textual semantic sim-
ilarity) (Zhang et al.,2021;Chen et al.,2021a).
If two sentences with large semantic gaps are
paired positively, the representation alignment is
likely to deteriorate (Wang and Isola,2020). Dis-
crete augmentation methods usually have less com-
petitive results, because they can hardly keep the
sentence semantically consistent when applied ran-
domly at lexical level. As shown in Figure 1, sen-
tence semantics become incomplete after random
masking and word deletion (Wu et al.,2020) so that
optimizing their embeddings in the direction of the
original sentence (aka anchor) leads to misunder-
standing.
To learn better representations, augmentation
methods are supposed to generate samples that
are not only diverse in representation (Tian et al.,
arXiv:2210.03963v3 [cs.CL] 14 Jun 2024
2020) but also similar in semantics. There is, how-
ever, a trade-off between semantic consistency and
expression diversity. Bigger differences between
vanilla samples and augmented ones also convey
less faithful semantics. Therefore, we hypothesize
that a good augmentation method in contrastive
learning should have desiderata to balance them.
Continuous methods can control semantic
changes since they utilize designed network struc-
tures to process redundant features (Huang et al.,
2021). SimCSE (Gao et al.,2021) utilizes dropout
(Srivastava et al.,2014) to obtain different embed-
dings of the same sentence to construct positive
pairs. But such continuous methods lack inter-
pretability to inspire further exploration of sentence
augmentation. To better prove our hypotheses and
find the promising direction of lexical data augmen-
tation methods, we propose three Simple Discrete
Augmentation (SDA) methods to satisfy the desider-
ata to different extents: Punctuation Insertion (PI),
Modal Verbs (MV), and Double Negation (DN).
Their impacts on the expression diversity increase
in a row but semantic consistency with the original
sentence tends to diminish gradually. In linguistics,
punctuation usually represents pause or tone (e.g.,
comma, exclamations) which has no specific mean-
ing itself. Modal verbs are used as supplementary
to the predicate verb of the sentence indicating at-
titudes such as permission, request, and so on,
which helps to reduce uncertainty in semantics.
DN helps two negatives cancel each other out and
thereby produces a strong affirmation, whereas the
improperly augmented sentence is at risk to be
logically confusing.
Although the proposed augmentation methods
keep the semantic meaning by carefully adding mi-
nor noises, the generated sentences are still literally
similar to the original sentence. Recent research
(Robinson et al.,2021) has pointed out the feature
suppression problem of contrastive learning. This
phenomenon could result in shortcut solutions that
the model only learns the textual rather than se-
mantic similarity. A focus on hard examples has
been proven effective to change the scope of the
captured features (Robinson et al.,2021). Thus,
we further utilize standard negation to construct
text contradicting all or part of the meaning of the
original sentence as hard negative samples. By
doing so, the model is encouraged to learn to dif-
ferentiate sentences bearing similar lexical items
yet reversed meanings.
To summarize, the contributions of this work are
as follows:
We propose SDA methods (including stan-
dard negation) for contrastive sentence rep-
resentation learning, which leverages discrete
sentence modifications to enhance the perfor-
mance of representation learning (Section 4).
Comprehensive experimental results demon-
strate that SDA achieves significantly better
performance, advancing the state-of-the-art
performance to a new bar from 78.49 to 79.60
(Section 5).
Extensive ablations and in-depth analysis are
conducted to investigate the underlying ratio-
nale and clarify the hyper-parameters choices
(Section 6).
2. Related Works
2.1. Sentence Representation Learning
BERT (Devlin et al.,2019) has steered the trajec-
tory of sentence representation towards the tech-
nical orientation of Pre-trained Language Models
(PLM). A multitude of endeavors (Tan et al.,2020,
2021;Li et al.,2020;Su et al.,2021;Lu et al.,
2023a,b) has been dedicated to substantial im-
provements based on this paradigm, leading to
significant advancements in diverse domains. No-
tably, there is a pronounced practical demand for
sentence-level text representations (Conneau et al.,
2017;Williams et al.,2018). Consequently, learn-
ing unsupervised sentence representations based
on PLM has become a focal point in recent years
(Reimers and Gurevych,2019;Zhang et al.,2020).
Current state-of-the-art methods utilize contrastive
learning to learn sentence embeddings (Kim et al.,
2021;Yan et al.,2021;Gao et al.,2021), which
in experimental results, can even rival supervised
methods. However, to advance unsupervised con-
trastive learning methods further, data augmenta-
tion emerges as a pivotal component.
2.2. Data Augmentation in Contrastive
Learning
Early research on contrastive sentence represen-
tation learning (Zhang et al.,2020) didn’t utilize
explicit augmentation methods to generate positive
pairs. Later, methods (Giorgi et al.,2021;Wu et al.,
2020,2021) which use text augmentation methods,
such as word deletion, span deletion, reordering,
synonym substitution, and word repetition, to gen-
erate different views for each sentence achieve
better results. Compared to augmentation meth-
ods applied on text, several studies (Janson et al.,
2021;Yan et al.,2021;Gao et al.,2021;Wang
et al.,2022a) utilize neural networks, such as dual
encoders, adversarial attack, token shuffling, cut-
off and dropout, to obtain different embeddings for
contrasting. A more recent study DiffCSE (Chuang
et al.,2022) designed an extra MLM-based word
replacement detection task as an equivalent aug-
mentation. The purpose of data augmentation in
Figure 2: An overview of the framework. The figure can be embodied as a training batch. Each sentence
is passed through the augmentation module to generate one positive and one negative for the anchor, the
positives generated by other sentences in the batch are also deemed as negatives for the anchor.
this study is to generate both semantic similar and
expression diverse samples so that models can
learn to extend the semantic space of the input
samples.
3. Preliminaries
3.1.
Sentence-level Contrastive Learning
Given a set of sentence pairs
D
=
{
(
xi, x+
i
)
}
,
where sentence pairs (
xi, x+
i
)are semantically simi-
lar and deemed as positive pairs. Contrastive learn-
ing aims to learn a dense representation h
i
of a
sentence
xi
by gathering positive samples together
while pushing others apart in the latent space (Belg-
hazi et al.,2018). In practice, the training proceeds
within a mini-batch of
N
sentence pairs. The ob-
jective is formulated as:
li=esim(hi,h+
i)
PN
j=1 esim(hi,h+
j)(1)
where h
i
and h
+
i
respectively denote the represen-
tation of
xi
and
x+
i
,
sim
() is the cosine similarity
function and
τ
is the temperature parameter. Under
the unsupervised setting, the semantically related
positive pairs are not explicitly given. Augmenta-
tion methods are used to generate
x+
i
for training
sample xi.
3.2. Unsupervised SimCSE
In transformer model
f
(
·
), there are dropout masks
placed on fully-connected layers and attention prob-
abilities. SimCSE
2
builds the positive pairs by feed-
ing the same input
xi
to the encoder twice, i.e.,
x+
i
=
xi
. With different dropout masks
zi
and
z+
i
,
the two separate output sentence embeddings con-
stitute a positive pair as follows:
hi=fzi(xi),h+
i=fz+
i(xi)(2)
2
The SimCSE mentioned in this article are all under
the unsupervised setting.
Method Sentence
None He travelled widely in Europe.
PI He ,travelled widely in Europe.
MV He must have travelled widely in Europe.
DN It is not the fact that he
didn’t travel widely in Europe.
Negation he didn’t travel widely in Europe.
Table 1: An example of different methods to gener-
ate the augmented sentence. The highlighted red
texts denote changes after augmentation.
3.3. Dependency Parsing and Syntax
Tree
Dependency parsing represents the relationships
between words in a sentence in the form of depen-
dencies. Each word in the sentence is connected
to another word, indicating its grammatical role and
the type of relationship it has with other words.
Syntax trees represent the hierarchical structure
of a sentence’s grammar. They consist of nodes,
where each node represents a word or a grammat-
ical unit, and edges represent syntactic relation-
ships. The root node represents the main clause,
and branches indicate phrases and sub-clauses.
4. Methodology
In this work, the augmentation module to gener-
ate positive samples for training data is denoted
as
A
(
·
). As illustrated in Fig. 2, we utilize
A
(
·
)to
subtly reword the original sentence in an attempt to
change the representation of the sentence to a lim-
ited extent on the premise that the sentence roughly
remains unchanged semantically. Afterword, Eq.
2can be rewritten as follows:
hi=fzi(xi),h+
i=fz+
i(A(xi)) (3)
In practice, we utilize spaCy
3
for dependency
3https://spacy.io/
摘要:

SDA:SimpleDiscreteAugmentationforContrastiveSentenceRepresentationLearningDongshengZhu1,3,∗,ZhenyuMao2,∗,JinghuiLu2,∗,RuiZhao2,FeiTan2,†1BaiduInc.2SenseTimeResearch3FudanUniversitydszhu20@fudan.edu.cn{maozhenyu,lujinghui1,zhaorui,tanfei}@sensetime.comAbstractContrastivelearninghasrecentlyachievedcom...

展开>> 收起<<
SDA Simple Discrete Augmentation for Contrastive Sentence Representation Learning.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:学术论文 价格:10玖币 属性:13 页 大小:3.06MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注