ConceptGraphs
Combining vision and language to help robots navigate the world.
Combining vision and language to help robots navigate the world.
To operate effectively in complex environments, robots need to build 3D representations of their surroundings that can be used for task planning and execution. This is the so-called scene understanding problem, which combines various fields such as computer vision, natural language processing and 3D modelling.
Existing approaches generally categorize objects using a fixed set of semantic labels, which is often insufficient for complex tasks. However, advances in multimodal foundation models now make it possible to develop more flexible "open vocabulary" solutions that address these limitations.
ConceptGraphs is a step towards robots performing tasks directly from natural language instructions. It is a mapping system that integrates the geometric information of traditional 3D mapping approaches with the rich semantic information of vision language foundation models.
From raw sensor data, ConceptGraphs builds a 3D scene-graph of objects and their relationships, where semantic features are not restricted to a predefined semantic class label. This enables robots to perform complex navigation and object manipulation tasks, as demonstrated in a series of real-world experiments.
The input is a "scan" of the scene, in particular an RGB video, with information on depth and camera pose, and the output is an incrementally constructed 3D graph structure. Each node is an object, and the edges represent the relationships between objects, for example a cup sitting "on top of" a table.
For each object, large visual-language models are used to extract vector embeddings and textual legends, rather than simple semantic class labels as in previous work. The geometry and visual appearance of each object is also stored in the form of an RGB point-cloud. The result is a complete 3D map of the scene, on which a user can easily search for objects using natural language queries such as "a plush toy" or "red sneakers". This provides robots with a wide range of perceptual and task-planning capabilities.
8
ConceptGraphs involves a significant collaboration among 8 research institutions with a total of 16 authors.
30
Given just a natural language description, ConceptGraphs was used to make a wheeled robot identify, locate, and navigate to 30 different objects in a cluttered environment at the Montreal Robotics Lab (REAL).
71+88
Measured percentage accuracy of the nodes and edges of the constructed scene-graph by a human expert annotator from Amazon Mechanical Turk, evaluated on the Replica 3D Dataset from Meta.
ConceptGraphs lets us leverage the power of large vision language models for robot world representations. This enables robots to perform some pretty impressively abstract tasks right out of the box.