Robust Double-Encoder Network for RGB-D Panoptic Segmentation

2025-04-15 0 0 1.29MB 7 页 10玖币

侵权投诉

Matteo Sodano Federico Magistri Tiziano Guadagnino Jens Behley Cyrill Stachniss

Abstract— Perception is crucial for robots that act in real-

world environments, as autonomous systems need to see and

understand the world around them to act properly. Panoptic

segmentation provides an interpretation of the scene by com-

puting a pixelwise semantic label together with instance IDs. In

this paper, we address panoptic segmentation using RGB-D data

of indoor scenes. We propose a novel encoder-decoder neural

network that processes RGB and depth separately through

two encoders. The features of the individual encoders are

progressively merged at different resolutions, such that the RGB

features are enhanced using complementary depth information.

We propose a novel merging approach called ResidualExcite,

which reweighs each entry of the feature map according to

its importance. With our double-encoder architecture, we are

robust to missing cues. In particular, the same model can

train and infer on RGB-D, RGB-only, and depth-only input

data, without the need to train specialized models. We evaluate

our method on publicly available datasets and show that our

approach achieves superior results compared to other common

approaches for panoptic segmentation.

I. INTRODUCTION

Holistic scene understanding is crucial in several robotics

applications. The ability of recognizing objects and obtaining

a semantic interpretation of the surrounding environment is

one of the key capabilities of truly autonomous systems. Se-

mantic scene perception and understanding supports several

robotics tasks such as mapping [5] [28], place recognition [9],

and manipulation [36]. Panoptic segmentation [20] uniﬁes

semantic and instance segmentation, and solves both jointly.

Its goal is to assign a semantic label and an instance ID to

each pixel of an image. The content of an image is typically

divided into two sets: things and stuff. Thing classes are

composed of countable objects (such as person, car, table),

while stuff classes are amorphous regions of space without

individual instances (such as sky, street, ﬂoor).

In this paper, we target panoptic segmentation using

RGB-D sensors. This data is especially interesting in indoor

environments where the geometric information provided by

the depth can help dealing with challenging scenarios such

as cluttered scenes and dynamic objects. Additionally, we

address the problem of being robust to missing cues, i.e.,

when either the RGB or the depth image is missing. This

is a practical issue, as robots can be equipped with both,

RGB-D and RGB cameras, and sometimes have to operate

in poor lighting conditions in which RGB data is not reliable.

All authors are with the University of Bonn, Germany. C. Stachniss is also

with the Department of Engineering Science at the University of Oxford,

UK, and with the Lamarr Institute for Machine Learning and Artiﬁcial

Intelligence, Germany.

This work has partially been funded by the European Union’s Hori-

zon 2020 research and innovation programme under grant agreement

No 101017008 (Harmony).

RGB

Depth

Input Our Network Output

(a)

RGB

Depth

Missing

(b)

RGB

Depth

Missing

(c)

Fig. 1: Our double-encoder network for RGB-D panoptic seg-

mentation is able to provide predictions dealing with full RGB-D

images (a), RGB-only (b) or depth-only (c). Dashed lines indicates

a detached encoder.

Thus, a single model for handling RGB-D, RGB, and depth

data is helpful in practical applications. We investigate how

an encoder-decoder architecture with two encoders for the

RGB and depth cues can provide compelling results in indoor

scenes. Previous efforts showed how double-encoder archi-

tectures are effective in processing RGB-D data [29] [37],

but they target only semantic segmentation.

The main contribution of this paper is a novel ap-

proach for RGB-D panoptic segmentation based on a double-

encoder architecture. We propose a novel feature merg-

ing strategy, called ResidualExcite, and a double-encoder

structure robust to missing cues that allows training and

inference with RGB-D, RGB-only, and depth-only data at

the same time, without the need to re-train the model (see

Fig. 1). We show that (i) our fusion mechanism performs

better with respect to other state-of-the-art fusion modules,

and (ii) our architecture allows training and inference on

RGB-D, RGB-only and depth-only data without the need

of a dedicated model for each modality. To back up these

claims, we report extensive experiments on the ScanNet [3]

and HyperSim [32] datasets. To support reproducibility, our

code and dataset splits used in this paper are published at

https://github.com/PRBonn/PS-res-excite.

arXiv:2210.02834v2 [cs.CV] 14 Jun 2023

FF FF FF FF FF

RGB encoder

Depth encoder

Depth image

RGB image

POST-

PROCESSING

Fig. 2: Our double-encoder network for RGB-D panoptic segmentation. RGB and depth images are separately processed, and their features

are merged at different output strides by the feature fusion modules (FF).

II. RELATED WORK

With the advent of deep learning, we witnessed a tremen-

dous progress in the capabilities to provide scene interpreta-

tion for autonomous robots. Kirillov et al. [21] deﬁne the task

of panoptic segmentation as the combination between se-

mantic and instance segmentation. The goal of this task is to

assign a class label to every pixel and to additionally segment

objects instances. Most of the approaches targeting panoptic

segmentation on images tackle it top-down, as they rely on

bounding box-based object proposals [15] [20]. Their goal

is to extract a number of candidate object regions [11] [17],

and then evaluate them independently. These methods are

effective but they can lead to overlapping segments in

the instance prediction. In this work, we follow bottom-up

approaches [2] [8] [33], not relying on bounding boxes but

operating directly at a pixel level.

The works mentioned so far use RGB images. Panoptic

segmentation is common also for LiDAR data, both in form

of range images [24] and point clouds [10]. However, when

considering RGB-D data, semantic segmentation [4] [31] and

instance segmentation [6] [18] are common, while panoptic

segmentation has received less attention so far [26] [42]. The

most common ways of elaborating RGB-D data rely on 3D

representations via truncated signed distance functions [18]

or voxel grids [13]. Few works go in the direction of using

directly RGB-D images. In our approach, we target panoptic

segmentation directly on RGB-D frames.

Double-encoder architectures are the most successful way

for processing 2D representations of RGB-D frames. They

allow to process RGB and depth cues separately with in-

dividual encoders and rely on feature fusion for combining

the outputs of the encoders [30] [37]. An alternative to the

direct exploitation of RGB and depth, proposed by Gupta et

al. [12], consist in a pre-processing of the depth to encode

it with three channels for each pixel, describing horizontal

disparity, height above ground and angle between the pixel’s

surface normal and the gravity direction. The core idea

of all these works, however, is that RGB and depth are

processed separately and fusion happens only at a later

point in the network, after the encoding part (late fusion).

Hazirbas et al. [14], however, show that feature merging at

different feature resolutions can enhance performance (early-

mid fusion). In contrast, we propose to use multi-resolution

merging at every downsampling step of the encoder.

Different merging strategies for features of data streams

are available. Summation [14] and concatenation [22] are the

earliest strategies, which have the limit of considering all fea-

tures without weighing them according to their effective use-

fulness. Newest efforts go in the direction of Squeeze-and-

Excitation modules [37] and gated fusion [43], which are two

different channel-attention mechanisms that aim to increase

the focus on features that are more relevant. Other works

exploit correlations between modalities to recalibrate feature

maps based on the most informative features [38] [40]. In our

work, we build on top of channel-attention mechanisms. We

propose a new merging mechanism called ResidualExcite, in-

spired by Squeeze-and-Excitation and residual networks [16],

that aims to measure the importance of features at a more

ﬁne-grained scale.

Additionally, we leverage the double-encoder structure to

have a single model capable of training and inferring on

different modalities (RGB-D, RGB-only, depth-only). Multi-

modal models have been investigated in the past, but mostly

exploiting multiple “expert models” whose outputs are fused

in a single prediction, as in the work by Blum et al. [1].

III. APPROACH TO RGB-D PANOPTIC SEGMENTATION

Our panoptic segmentation network is an encoder-decoder

architecture that operates on RGB-D images and processes

RGB and depth data by means of two different encoders.

Encoders features are merged at different output strides,

and are sent to three decoders that restore the backbone

features to the original image resolution. The ﬁrst decoder

targets semantic segmentation. The second decoder predicts

the location of object centers in the form of a probability

heatmap. The third decoder predicts an embedding vector for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustDouble-EncoderNetworkforRGB-DPanopticSegmentationMatteoSodanoFedericoMagistriTizianoGuadagninoJensBehleyCyrillStachnissAbstract—Perceptioniscrucialforrobotsthatactinreal-worldenvironments,asautonomoussystemsneedtoseeandunderstandtheworldaroundthemtoactproperly.Panopticsegmentationprovidesanint...

展开>> 收起<<

Robust Double-Encoder Network for RGB-D Panoptic Segmentation.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Double-Encoder Network for RGB-D Panoptic Segmentation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: