
Robust Double-Encoder Network for RGB-D Panoptic Segmentation
Matteo Sodano Federico Magistri Tiziano Guadagnino Jens Behley Cyrill Stachniss
Abstract— Perception is crucial for robots that act in real-
world environments, as autonomous systems need to see and
understand the world around them to act properly. Panoptic
segmentation provides an interpretation of the scene by com-
puting a pixelwise semantic label together with instance IDs. In
this paper, we address panoptic segmentation using RGB-D data
of indoor scenes. We propose a novel encoder-decoder neural
network that processes RGB and depth separately through
two encoders. The features of the individual encoders are
progressively merged at different resolutions, such that the RGB
features are enhanced using complementary depth information.
We propose a novel merging approach called ResidualExcite,
which reweighs each entry of the feature map according to
its importance. With our double-encoder architecture, we are
robust to missing cues. In particular, the same model can
train and infer on RGB-D, RGB-only, and depth-only input
data, without the need to train specialized models. We evaluate
our method on publicly available datasets and show that our
approach achieves superior results compared to other common
approaches for panoptic segmentation.
I. INTRODUCTION
Holistic scene understanding is crucial in several robotics
applications. The ability of recognizing objects and obtaining
a semantic interpretation of the surrounding environment is
one of the key capabilities of truly autonomous systems. Se-
mantic scene perception and understanding supports several
robotics tasks such as mapping [5] [28], place recognition [9],
and manipulation [36]. Panoptic segmentation [20] unifies
semantic and instance segmentation, and solves both jointly.
Its goal is to assign a semantic label and an instance ID to
each pixel of an image. The content of an image is typically
divided into two sets: things and stuff. Thing classes are
composed of countable objects (such as person, car, table),
while stuff classes are amorphous regions of space without
individual instances (such as sky, street, floor).
In this paper, we target panoptic segmentation using
RGB-D sensors. This data is especially interesting in indoor
environments where the geometric information provided by
the depth can help dealing with challenging scenarios such
as cluttered scenes and dynamic objects. Additionally, we
address the problem of being robust to missing cues, i.e.,
when either the RGB or the depth image is missing. This
is a practical issue, as robots can be equipped with both,
RGB-D and RGB cameras, and sometimes have to operate
in poor lighting conditions in which RGB data is not reliable.
All authors are with the University of Bonn, Germany. C. Stachniss is also
with the Department of Engineering Science at the University of Oxford,
UK, and with the Lamarr Institute for Machine Learning and Artificial
Intelligence, Germany.
This work has partially been funded by the European Union’s Hori-
zon 2020 research and innovation programme under grant agreement
No 101017008 (Harmony).
RGB
Depth
Input Our Network Output
(a)
(b)
(c)
Fig. 1: Our double-encoder network for RGB-D panoptic seg-
mentation is able to provide predictions dealing with full RGB-D
images (a), RGB-only (b) or depth-only (c). Dashed lines indicates
a detached encoder.
Thus, a single model for handling RGB-D, RGB, and depth
data is helpful in practical applications. We investigate how
an encoder-decoder architecture with two encoders for the
RGB and depth cues can provide compelling results in indoor
scenes. Previous efforts showed how double-encoder archi-
tectures are effective in processing RGB-D data [29] [37],
but they target only semantic segmentation.
The main contribution of this paper is a novel ap-
proach for RGB-D panoptic segmentation based on a double-
encoder architecture. We propose a novel feature merg-
ing strategy, called ResidualExcite, and a double-encoder
structure robust to missing cues that allows training and
inference with RGB-D, RGB-only, and depth-only data at
the same time, without the need to re-train the model (see
Fig. 1). We show that (i) our fusion mechanism performs
better with respect to other state-of-the-art fusion modules,
and (ii) our architecture allows training and inference on
RGB-D, RGB-only and depth-only data without the need
of a dedicated model for each modality. To back up these
claims, we report extensive experiments on the ScanNet [3]
and HyperSim [32] datasets. To support reproducibility, our
code and dataset splits used in this paper are published at
https://github.com/PRBonn/PS-res-excite.
arXiv:2210.02834v2 [cs.CV] 14 Jun 2023