Robust Double-Encoder Network for RGB-D Panoptic Segmentation

2025-04-15 0 0 1.29MB 7 页 10玖币
侵权投诉
Robust Double-Encoder Network for RGB-D Panoptic Segmentation
Matteo Sodano Federico Magistri Tiziano Guadagnino Jens Behley Cyrill Stachniss
Abstract Perception is crucial for robots that act in real-
world environments, as autonomous systems need to see and
understand the world around them to act properly. Panoptic
segmentation provides an interpretation of the scene by com-
puting a pixelwise semantic label together with instance IDs. In
this paper, we address panoptic segmentation using RGB-D data
of indoor scenes. We propose a novel encoder-decoder neural
network that processes RGB and depth separately through
two encoders. The features of the individual encoders are
progressively merged at different resolutions, such that the RGB
features are enhanced using complementary depth information.
We propose a novel merging approach called ResidualExcite,
which reweighs each entry of the feature map according to
its importance. With our double-encoder architecture, we are
robust to missing cues. In particular, the same model can
train and infer on RGB-D, RGB-only, and depth-only input
data, without the need to train specialized models. We evaluate
our method on publicly available datasets and show that our
approach achieves superior results compared to other common
approaches for panoptic segmentation.
I. INTRODUCTION
Holistic scene understanding is crucial in several robotics
applications. The ability of recognizing objects and obtaining
a semantic interpretation of the surrounding environment is
one of the key capabilities of truly autonomous systems. Se-
mantic scene perception and understanding supports several
robotics tasks such as mapping [5] [28], place recognition [9],
and manipulation [36]. Panoptic segmentation [20] unifies
semantic and instance segmentation, and solves both jointly.
Its goal is to assign a semantic label and an instance ID to
each pixel of an image. The content of an image is typically
divided into two sets: things and stuff. Thing classes are
composed of countable objects (such as person, car, table),
while stuff classes are amorphous regions of space without
individual instances (such as sky, street, floor).
In this paper, we target panoptic segmentation using
RGB-D sensors. This data is especially interesting in indoor
environments where the geometric information provided by
the depth can help dealing with challenging scenarios such
as cluttered scenes and dynamic objects. Additionally, we
address the problem of being robust to missing cues, i.e.,
when either the RGB or the depth image is missing. This
is a practical issue, as robots can be equipped with both,
RGB-D and RGB cameras, and sometimes have to operate
in poor lighting conditions in which RGB data is not reliable.
All authors are with the University of Bonn, Germany. C. Stachniss is also
with the Department of Engineering Science at the University of Oxford,
UK, and with the Lamarr Institute for Machine Learning and Artificial
Intelligence, Germany.
This work has partially been funded by the European Union’s Hori-
zon 2020 research and innovation programme under grant agreement
No 101017008 (Harmony).
RGB
Depth
Input Our Network Output
(a)
RGB
Depth
Missing
(b)
RGB
Depth
Missing
(c)
Fig. 1: Our double-encoder network for RGB-D panoptic seg-
mentation is able to provide predictions dealing with full RGB-D
images (a), RGB-only (b) or depth-only (c). Dashed lines indicates
a detached encoder.
Thus, a single model for handling RGB-D, RGB, and depth
data is helpful in practical applications. We investigate how
an encoder-decoder architecture with two encoders for the
RGB and depth cues can provide compelling results in indoor
scenes. Previous efforts showed how double-encoder archi-
tectures are effective in processing RGB-D data [29] [37],
but they target only semantic segmentation.
The main contribution of this paper is a novel ap-
proach for RGB-D panoptic segmentation based on a double-
encoder architecture. We propose a novel feature merg-
ing strategy, called ResidualExcite, and a double-encoder
structure robust to missing cues that allows training and
inference with RGB-D, RGB-only, and depth-only data at
the same time, without the need to re-train the model (see
Fig. 1). We show that (i) our fusion mechanism performs
better with respect to other state-of-the-art fusion modules,
and (ii) our architecture allows training and inference on
RGB-D, RGB-only and depth-only data without the need
of a dedicated model for each modality. To back up these
claims, we report extensive experiments on the ScanNet [3]
and HyperSim [32] datasets. To support reproducibility, our
code and dataset splits used in this paper are published at
https://github.com/PRBonn/PS-res-excite.
arXiv:2210.02834v2 [cs.CV] 14 Jun 2023
FF FF FF FF FF
RGB encoder
Depth encoder
Depth image
RGB image
POST-
PROCESSING
Fig. 2: Our double-encoder network for RGB-D panoptic segmentation. RGB and depth images are separately processed, and their features
are merged at different output strides by the feature fusion modules (FF).
II. RELATED WORK
With the advent of deep learning, we witnessed a tremen-
dous progress in the capabilities to provide scene interpreta-
tion for autonomous robots. Kirillov et al. [21] define the task
of panoptic segmentation as the combination between se-
mantic and instance segmentation. The goal of this task is to
assign a class label to every pixel and to additionally segment
objects instances. Most of the approaches targeting panoptic
segmentation on images tackle it top-down, as they rely on
bounding box-based object proposals [15] [20]. Their goal
is to extract a number of candidate object regions [11] [17],
and then evaluate them independently. These methods are
effective but they can lead to overlapping segments in
the instance prediction. In this work, we follow bottom-up
approaches [2] [8] [33], not relying on bounding boxes but
operating directly at a pixel level.
The works mentioned so far use RGB images. Panoptic
segmentation is common also for LiDAR data, both in form
of range images [24] and point clouds [10]. However, when
considering RGB-D data, semantic segmentation [4] [31] and
instance segmentation [6] [18] are common, while panoptic
segmentation has received less attention so far [26] [42]. The
most common ways of elaborating RGB-D data rely on 3D
representations via truncated signed distance functions [18]
or voxel grids [13]. Few works go in the direction of using
directly RGB-D images. In our approach, we target panoptic
segmentation directly on RGB-D frames.
Double-encoder architectures are the most successful way
for processing 2D representations of RGB-D frames. They
allow to process RGB and depth cues separately with in-
dividual encoders and rely on feature fusion for combining
the outputs of the encoders [30] [37]. An alternative to the
direct exploitation of RGB and depth, proposed by Gupta et
al. [12], consist in a pre-processing of the depth to encode
it with three channels for each pixel, describing horizontal
disparity, height above ground and angle between the pixel’s
surface normal and the gravity direction. The core idea
of all these works, however, is that RGB and depth are
processed separately and fusion happens only at a later
point in the network, after the encoding part (late fusion).
Hazirbas et al. [14], however, show that feature merging at
different feature resolutions can enhance performance (early-
mid fusion). In contrast, we propose to use multi-resolution
merging at every downsampling step of the encoder.
Different merging strategies for features of data streams
are available. Summation [14] and concatenation [22] are the
earliest strategies, which have the limit of considering all fea-
tures without weighing them according to their effective use-
fulness. Newest efforts go in the direction of Squeeze-and-
Excitation modules [37] and gated fusion [43], which are two
different channel-attention mechanisms that aim to increase
the focus on features that are more relevant. Other works
exploit correlations between modalities to recalibrate feature
maps based on the most informative features [38] [40]. In our
work, we build on top of channel-attention mechanisms. We
propose a new merging mechanism called ResidualExcite, in-
spired by Squeeze-and-Excitation and residual networks [16],
that aims to measure the importance of features at a more
fine-grained scale.
Additionally, we leverage the double-encoder structure to
have a single model capable of training and inferring on
different modalities (RGB-D, RGB-only, depth-only). Multi-
modal models have been investigated in the past, but mostly
exploiting multiple “expert models” whose outputs are fused
in a single prediction, as in the work by Blum et al. [1].
III. APPROACH TO RGB-D PANOPTIC SEGMENTATION
Our panoptic segmentation network is an encoder-decoder
architecture that operates on RGB-D images and processes
RGB and depth data by means of two different encoders.
Encoders features are merged at different output strides,
and are sent to three decoders that restore the backbone
features to the original image resolution. The first decoder
targets semantic segmentation. The second decoder predicts
the location of object centers in the form of a probability
heatmap. The third decoder predicts an embedding vector for
摘要:

RobustDouble-EncoderNetworkforRGB-DPanopticSegmentationMatteoSodanoFedericoMagistriTizianoGuadagninoJensBehleyCyrillStachnissAbstract—Perceptioniscrucialforrobotsthatactinreal-worldenvironments,asautonomoussystemsneedtoseeandunderstandtheworldaroundthemtoactproperly.Panopticsegmentationprovidesanint...

展开>> 收起<<
Robust Double-Encoder Network for RGB-D Panoptic Segmentation.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:学术论文 价格:10玖币 属性:7 页 大小:1.29MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注