ConceptGraphs

Combining vision and language to help robots navigate the world. 

Geometric shapes on a dark blue background.

Background

To operate effectively in complex environments, robots need to build 3D representations of their surroundings that can be used for task planning and execution. This is the so-called scene understanding problem, which combines various fields such as computer vision, natural language processing and 3D modelling. 

Existing approaches generally categorize objects using a fixed set of semantic labels, which is often insufficient for complex tasks. However, advances in multimodal foundation models now make it possible to develop more flexible "open vocabulary" solutions that address these limitations.

Objectives

ConceptGraphs is a step towards robots performing tasks directly from natural language instructions. It is a mapping system that integrates the geometric information of traditional 3D mapping approaches with the rich semantic information of vision language foundation models. 

From raw sensor data, ConceptGraphs builds a 3D scene-graph of objects and their relationships, where semantic features are not restricted to a predefined semantic class label. This enables robots to perform complex navigation and object manipulation tasks, as demonstrated in a series of real-world experiments.

About the Project

ConceptGraphs is a mapping system that uses foundation models to build open-vocabulary 3D scene-graphs.

The input is a "scan" of the scene, in particular an RGB video, with information on depth and camera pose, and the output is an incrementally constructed 3D graph structure. Each node is an object, and the edges represent the relationships between objects, for example a cup sitting "on top of" a table. 

For each object, large visual-language models are used to extract vector embeddings and textual legends, rather than simple semantic class labels as in previous work. The geometry and visual appearance of each object is also stored in the form of an RGB point-cloud. The result is a complete 3D map of the scene, on which a user can easily search for objects using natural language queries such as "a plush toy" or "red sneakers". This provides robots with a wide range of perceptual and task-planning capabilities.

8

Research institutions

ConceptGraphs involves a significant collaboration among 8 research institutions with a total of 16 authors.

30 

 Object identification and navigation tasks 

Given just a natural language description, ConceptGraphs was used to make a wheeled robot identify, locate, and navigate to 30 different objects in a cluttered environment at the Montreal Robotics Lab (REAL).

71+88

Accuracy of objects and edges

Measured percentage accuracy of the nodes and edges of the constructed scene-graph by a human expert annotator from Amazon Mechanical Turk, evaluated on the Replica 3D Dataset from Meta.

Photo of Liam Paull

ConceptGraphs lets us leverage the power of large vision language models for robot world representations. This enables robots to perform some pretty impressively abstract tasks right out of the box.

Liam Paull, Assistant Professor, Université de Montréal, Core Academic Member, Mila

Resources

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
ConceptGraphs Project Website
ConceptGraphs on GitHub
Official code release for ConceptGraphs

Meet the Team

Mila Members
Core Academic Member
Portrait of Liam Paull
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Canada CIFAR AI Chair
Portrait of Kirsty Ellis is unavailable
Developer, Research Software, Innovation, Development and Technologies
Portrait of Sacha Morin
PhD - Université de Montréal
Other Members
Aditya Agarwal (Mila)
Bipasha Sen (Mila)
Joshua B. Tenenbaum
Rama Chellappa
Chuang Gan
Qiao Gu
Celso Miguel de Melo
Krishna Murthy Jatavallabhula
William Paul
Corban Rivera
Florian Shkurti
Antonio Torralba

Partners