SDA Simple Discrete Augmentation for Contrastive Sentence Representation Learning

2025-04-15 0 0 3.06MB 13 页 10玖币

侵权投诉

SDA: Simple Discrete Augmentation for Contrastive Sentence

Representation Learning

Dongsheng Zhu1,3,∗, Zhenyu Mao2,∗, Jinghui Lu2,∗

, Rui Zhao2, Fei Tan2,†

1Baidu Inc. 2SenseTime Research

3Fudan University

dszhu20@fudan.edu.cn

{maozhenyu, lujinghui1, zhaorui, tanfei}@sensetime.com

Abstract

Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As

an essential element, data augmentation protocols, however, have not been well explored. The pioneering work

SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates

discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the

underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data

augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet

eﬀective discrete sentence augmentation schemes: punctuation insertion,modal verbs, and double negation.

They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation

is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We

experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of

the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA

Keywords: Neural language representation models; Parsing, Grammar, Syntax, Treebank; Semantics;

Semi-supervised, weakly-supervised and unsupervised learning

1. Introduction

Current state-of-the-art methods utilize contrastive

learning algorithms to learn unsupervised sentence

representations (Gao et al.,2021;Yan et al.,2021).

They learn to bring similar sentences closer in the

latent space while pushing away dissimilar ones

(Hjelm et al.,2018). In the paradigm, multiple

data augmentation methods have been proposed

from diﬀerent perspectives to curate diﬀerent vari-

ants

(Oord et al.,2018;Zhu et al.,2020). The

variants and the corresponding original samples

are deemed positive pairs in the learning proce-

dure (Chen et al.,2020). Previous studies have

shown that the quality of variants largely shapes

the learning of reasonable representations (Chen

et al.,2020;Hassani and Khasahmadi,2020).

Conventional methods directly employ opera-

tions such as cropping, word deletion, and syn-

onym replacement in natural sentences (Wei and

Zou,2019;Wu et al.,2020;Meng et al.,2021). In

addition, recent studies resort to network architec-

tures for manipulating embedding vectors, such as

dropout, feature cutoﬀ, and token shuﬄing (Gao

et al.,2021;Yan et al.,2021). It enables more

subtle variants of training samples in the continu-

ous latent space in a controllable way and thereby

renders better representations, which is usually ev-

idenced by more appealing performance in typical

∗Equal contribution.

†Corresponding author.

1It refers to a sentence sample in this work.

Figure 1: Normalized representation visualization

of diﬀerent augmentation methods and the way they

should be optimized.

downstream NLP tasks (e.g., textual semantic sim-

ilarity) (Zhang et al.,2021;Chen et al.,2021a).

If two sentences with large semantic gaps are

paired positively, the representation alignment is

likely to deteriorate (Wang and Isola,2020). Dis-

crete augmentation methods usually have less com-

petitive results, because they can hardly keep the

sentence semantically consistent when applied ran-

domly at lexical level. As shown in Figure 1, sen-

tence semantics become incomplete after random

masking and word deletion (Wu et al.,2020) so that

optimizing their embeddings in the direction of the

original sentence (aka anchor) leads to misunder-

standing.

To learn better representations, augmentation

methods are supposed to generate samples that

are not only diverse in representation (Tian et al.,

arXiv:2210.03963v3 [cs.CL] 14 Jun 2024

2020) but also similar in semantics. There is, how-

ever, a trade-oﬀ between semantic consistency and

expression diversity. Bigger diﬀerences between

vanilla samples and augmented ones also convey

less faithful semantics. Therefore, we hypothesize

that a good augmentation method in contrastive

learning should have desiderata to balance them.

Continuous methods can control semantic

changes since they utilize designed network struc-

tures to process redundant features (Huang et al.,

2021). SimCSE (Gao et al.,2021) utilizes dropout

(Srivastava et al.,2014) to obtain diﬀerent embed-

dings of the same sentence to construct positive

pairs. But such continuous methods lack inter-

pretability to inspire further exploration of sentence

augmentation. To better prove our hypotheses and

ﬁnd the promising direction of lexical data augmen-

tation methods, we propose three Simple Discrete

Augmentation (SDA) methods to satisfy the desider-

ata to diﬀerent extents: Punctuation Insertion (PI),

Modal Verbs (MV), and Double Negation (DN).

Their impacts on the expression diversity increase

in a row but semantic consistency with the original

sentence tends to diminish gradually. In linguistics,

punctuation usually represents pause or tone (e.g.,

comma, exclamations) which has no speciﬁc mean-

ing itself. Modal verbs are used as supplementary

to the predicate verb of the sentence indicating at-

titudes such as permission, request, and so on,

which helps to reduce uncertainty in semantics.

DN helps two negatives cancel each other out and

thereby produces a strong aﬃrmation, whereas the

improperly augmented sentence is at risk to be

logically confusing.

Although the proposed augmentation methods

keep the semantic meaning by carefully adding mi-

nor noises, the generated sentences are still literally

similar to the original sentence. Recent research

(Robinson et al.,2021) has pointed out the feature

suppression problem of contrastive learning. This

phenomenon could result in shortcut solutions that

the model only learns the textual rather than se-

mantic similarity. A focus on hard examples has

been proven eﬀective to change the scope of the

captured features (Robinson et al.,2021). Thus,

we further utilize standard negation to construct

text contradicting all or part of the meaning of the

original sentence as hard negative samples. By

doing so, the model is encouraged to learn to dif-

ferentiate sentences bearing similar lexical items

yet reversed meanings.

To summarize, the contributions of this work are

as follows:

•

We propose SDA methods (including stan-

dard negation) for contrastive sentence rep-

resentation learning, which leverages discrete

sentence modiﬁcations to enhance the perfor-

mance of representation learning (Section 4).

•

Comprehensive experimental results demon-

strate that SDA achieves signiﬁcantly better

performance, advancing the state-of-the-art

performance to a new bar from 78.49 to 79.60

(Section 5).

•

Extensive ablations and in-depth analysis are

conducted to investigate the underlying ratio-

nale and clarify the hyper-parameters choices

(Section 6).

2. Related Works

2.1. Sentence Representation Learning

BERT (Devlin et al.,2019) has steered the trajec-

tory of sentence representation towards the tech-

nical orientation of Pre-trained Language Models

(PLM). A multitude of endeavors (Tan et al.,2020,

2021;Li et al.,2020;Su et al.,2021;Lu et al.,

2023a,b) has been dedicated to substantial im-

provements based on this paradigm, leading to

signiﬁcant advancements in diverse domains. No-

tably, there is a pronounced practical demand for

sentence-level text representations (Conneau et al.,

2017;Williams et al.,2018). Consequently, learn-

ing unsupervised sentence representations based

on PLM has become a focal point in recent years

(Reimers and Gurevych,2019;Zhang et al.,2020).

Current state-of-the-art methods utilize contrastive

learning to learn sentence embeddings (Kim et al.,

2021;Yan et al.,2021;Gao et al.,2021), which

in experimental results, can even rival supervised

methods. However, to advance unsupervised con-

trastive learning methods further, data augmenta-

tion emerges as a pivotal component.

2.2. Data Augmentation in Contrastive

Learning

Early research on contrastive sentence represen-

tation learning (Zhang et al.,2020) didn’t utilize

explicit augmentation methods to generate positive

pairs. Later, methods (Giorgi et al.,2021;Wu et al.,

2020,2021) which use text augmentation methods,

such as word deletion, span deletion, reordering,

synonym substitution, and word repetition, to gen-

erate diﬀerent views for each sentence achieve

better results. Compared to augmentation meth-

ods applied on text, several studies (Janson et al.,

2021;Yan et al.,2021;Gao et al.,2021;Wang

et al.,2022a) utilize neural networks, such as dual

encoders, adversarial attack, token shuﬄing, cut-

oﬀ and dropout, to obtain diﬀerent embeddings for

contrasting. A more recent study DiﬀCSE (Chuang

et al.,2022) designed an extra MLM-based word

replacement detection task as an equivalent aug-

mentation. The purpose of data augmentation in

Figure 2: An overview of the framework. The ﬁgure can be embodied as a training batch. Each sentence

is passed through the augmentation module to generate one positive and one negative for the anchor, the

positives generated by other sentences in the batch are also deemed as negatives for the anchor.

this study is to generate both semantic similar and

expression diverse samples so that models can

learn to extend the semantic space of the input

samples.

3. Preliminaries

3.1.

Sentence-level Contrastive Learning

Given a set of sentence pairs

{

(

xi, x+

)

}

where sentence pairs (

xi, x+

)are semantically simi-

lar and deemed as positive pairs. Contrastive learn-

ing aims to learn a dense representation h

of a

sentence

by gathering positive samples together

while pushing others apart in the latent space (Belg-

hazi et al.,2018). In practice, the training proceeds

within a mini-batch of

sentence pairs. The ob-

jective is formulated as:

li=−esim(hi,h+

i)/τ

j=1 esim(hi,h+

j)/τ (1)

where h

and h

respectively denote the represen-

tation of

and

sim

() is the cosine similarity

function and

is the temperature parameter. Under

the unsupervised setting, the semantically related

positive pairs are not explicitly given. Augmenta-

tion methods are used to generate

for training

sample xi.

3.2. Unsupervised SimCSE

In transformer model

(

), there are dropout masks

placed on fully-connected layers and attention prob-

abilities. SimCSE

builds the positive pairs by feed-

ing the same input

to the encoder twice, i.e.,

. With diﬀerent dropout masks

and

the two separate output sentence embeddings con-

stitute a positive pair as follows:

hi=fzi(xi),h+

i=fz+

i(xi)(2)

The SimCSE mentioned in this article are all under

the unsupervised setting.

Method Sentence

None He travelled widely in Europe.

PI He ,travelled widely in Europe.

MV He must have travelled widely in Europe.

DN It is not the fact that he

didn’t travel widely in Europe.

Negation he didn’t travel widely in Europe.

Table 1: An example of diﬀerent methods to gener-

ate the augmented sentence. The highlighted red

texts denote changes after augmentation.

3.3. Dependency Parsing and Syntax

Tree

Dependency parsing represents the relationships

between words in a sentence in the form of depen-

dencies. Each word in the sentence is connected

to another word, indicating its grammatical role and

the type of relationship it has with other words.

Syntax trees represent the hierarchical structure

of a sentence’s grammar. They consist of nodes,

where each node represents a word or a grammat-

ical unit, and edges represent syntactic relation-

ships. The root node represents the main clause,

and branches indicate phrases and sub-clauses.

4. Methodology

In this work, the augmentation module to gener-

ate positive samples for training data is denoted

(

). As illustrated in Fig. 2, we utilize

(

)to

subtly reword the original sentence in an attempt to

change the representation of the sentence to a lim-

ited extent on the premise that the sentence roughly

remains unchanged semantically. Afterword, Eq.

2can be rewritten as follows:

hi=fzi(xi),h+

i=fz+

i(A(xi)) (3)

In practice, we utilize spaCy

for dependency

3https://spacy.io/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SDA:SimpleDiscreteAugmentationforContrastiveSentenceRepresentationLearningDongshengZhu1,3,∗,ZhenyuMao2,∗,JinghuiLu2,∗,RuiZhao2,FeiTan2,†1BaiduInc.2SenseTimeResearch3FudanUniversitydszhu20@fudan.edu.cn{maozhenyu,lujinghui1,zhaorui,tanfei}@sensetime.comAbstractContrastivelearninghasrecentlyachievedcom...

展开>> 收起<<

SDA Simple Discrete Augmentation for Contrastive Sentence Representation Learning.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SDA Simple Discrete Augmentation for Contrastive Sentence Representation Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: