To interact in 3D environments, robots need a flexible and lightweight representation of the scene. 3D Scene Graphs represent objects (location, features, reconstruction, etc.) as nodes, and relationships (distance, semantics, etc.) as edges. Existing methods rely on pre-scanned environments, making them unsuitable for previously unseen scenes. We present a method to build scene graphs incrementally from an RGB-D data stream. By extending SAM video tracking, we track all objects in the scene, including those that newly appear or disappear in the scene. These 2D tracks are lifted into 3D using depth to construct the graph, where we store per-object learned features extracted from state-of-the-art deep learning models. We demonstrate the flexibility of our approach through object reconstruction and semantic search in dynamic scenes with occlusions and distractors.
We begin by recording an RGB-D video sequence using a Zed Mini camera, capturing both color and depth information and extract camera poses. The initial frame is segmented with SAM2 (Segment Anything Model), which encodes object masks into a latent space for efficient video tracking. These masks are tracked across subsequent frames using SAM’s latent mask encoding. To maintain robust tracking and integrate unmasked parts of the video into the graph, we restart the tracking process every n frames by transferring the latest masks to a new SAM tracking state. SAM’s automatic segmentation is then used to sample points and identify new regions in the frame that have not yet been segmented. For each newly detected mask, we extract SALAD feature descriptors and compare them to previously stored crops using cosine similarity. If the similarity score is high and the object is currently not visible, we re-identify it as an object previously seen. Masks, camera poses, and depth data are used to unproject the 2D masks into 3D space, constructing nodes in a scene graph. Each node contains a centroid, extracted SALAD and CLIP features, the corresponding object crops, and a node-centric point cloud . CLIP features are used to enable text-based queries after the scene graph is constructed, allowing semantic search over objects. For object reconstruction, we apply Iterative Closest Point (ICP) alignment to unprojected partial masks from different frames, gradually refining and updating each object’s 3D representation. This approach supports dynamic scenes and resolves occlusions using the specialized re-identification module based on multi-resolution feature matching.
@article{korth_anadon2025dynamic,
author = {Korth, Daniel and Anadon, Xavier and Pollefeys, Marc and Bauer, Zuria and Barath, Daniel},
title = {Dynamic 3D Scene Graphs from RGB-D},
year = {2025},
}