Language and Image

Artificial intelligence (AI) systems can process data collected from multiple sources, through a variety of sensors, to help computers make predictions and decisions. Mila’s researchers are pioneers in the fields of natural language processing and computer vision, and continue to explore the intersections of both technologies.

Half-open laptop, switched on in a dark room.

Advances in large language models have propelled AI into a new phase, leading to many new and important questions for Mila researchers. These are linked to the remaining gap between state-of-the-art AI and human cognitive abilities — including reasoning, understanding cause and effect properly, and accepting self-doubt — with important consequences in the deployment and safety of these systems.

Multimodal AI systems can make predictions and decisions based on multiple forms of data — including vision, natural language, and audio — and can, for example, deliver live-captioning and answer questions about images.

 

Through multimodal research in machine learning, Mila’s experts are helping to develop AI systems that are more capable of understanding how humans perceive the world, which makes them better at serving the needs of society.

Featured Projects

Virtual landscape

Ubisoft-Mila Industrial Research Chair

Designed to guide technological innovation in the video game industry, the Ubisoft-Mila Industrial Research Chair explores the ethical use of AI in game development.

Geometric shapes on a dark blue background.

ConceptGraphs

ConceptGraphs is a mapping system that builds 3D scene-graphs of objects and their relationships, enabling robots to perform complex navigation and object manipulation tasks.

Photo of Aishwarya Agrawal

Multimodal research in machine learning allows us to make AI systems that are closer to how humans perceive the world, making them more suitable to serve humanity in the future. 

Aishwarya Agrawal, Assistant Professor, Université de Montréal, Core Academic Member, Mila

Resources

Understanding LLM Understanding
Co-sponsored by Mila, this summer school held in June 2024 brought together experts from diverse fields such as computer science, neuroscience and psychology to deepen our understanding of large language models through various lenses.
MAPL
MAPL is a multimodal AI system capable of understanding images and text, while generating free-form text as output.
ConceptGraphs
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.

Research Labs

Mila professors exploring the subject as part of their research.

Mila Faculty
Core Academic Member
Portrait of David Ifeoluwa Adelani
McGill University
Canada CIFAR AI Chair
Core Academic Member
Portrait of Aishwarya Agrawal
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Canada CIFAR AI Chair
Core Academic Member
Portrait of Sarath Chandar
Associate Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering
Canada CIFAR AI Chair
Core Academic Member
Portrait of Laurent Charlin
Associate Professor, HEC Montréal, Department of Decision Sciences
Canada CIFAR AI Chair
Core Academic Member
Portrait of Jackie Cheung
Associate Scientific Director, Mila, Associate Professor, McGill University, School of Computer Science
Canada CIFAR AI Chair
Associate Academic Member
Portrait of James Clark
Full Professor, McGill University
Affiliate Member
Portrait of Maria Cutumisu
Associate Professor, McGill University
Associate Industry Member
Portrait of Alexandre Drouin
Research Scientist, ServiceNow
Affiliate Member
Portrait of Samira Ebrahimi Kahou
Assistant Professor, University of Calgary, Deparment of Electrical and Software Engineering
Canada CIFAR AI Chair
Associate Academic Member
Portrait of Christian Gagné
Full Professor, Université Laval, Department of Electrical and Computer Engineering
Canada CIFAR AI Chair
Associate Academic Member
Portrait of Warren Gross
Professor, McGill University, Department of Electrical and Computer Engineering
Associate Academic Member
Portrait of Toby Dylan Hocking
Associate Professor, Université Sherbrooke, Department of Computer Science
Affiliate Member
Portrait of Mahdi Hosseini
Assistant Professor, Concordia University
Affiliate Member
Portrait of Shin (Alexandre) Koseki
Assistant Professor, Université de Montréal, School of Urban Planning and Landscape Architecture
Associate Academic Member
Portrait of Xue (Steve) Liu is unavailable
Full Professor, McGill University, School of Computer Science
Core Academic Member
Portrait of Tegan Maharaj
Assistant Professor in Machine Learning, HEC Montréal, Department of Decision Science
Associate Academic Member
Portrait of Eilif Benjamin Muller
Assistant Professor, Université de Montréal, Department of Neurosciences
Canada CIFAR AI Chair
Core Academic Member
Portrait of Timothy O'Donnell
Assistant Professor, McGill University, Department of Linguistics
Canada CIFAR AI Chair
Core Academic Member
Portrait of Chris Pal
Full Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering
Canada CIFAR AI Chair
Associate Academic Member
Portrait of Laurence Perreault-Levasseur is unavailable
Assistant Professor, Université de Montréal, Department of Physics
Associate Academic Member
Portrait of Pablo Piantanida
Full Professor, Université Paris-Saclay
Core Academic Member
Portrait of Guillaume Rabusseau
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Canada CIFAR AI Chair
Core Academic Member
Portrait of Siamak Ravanbakhsh
Assistant Professor, McGill University, School of Computer Science
Canada CIFAR AI Chair
Core Academic Member
Portrait of Siva Reddy
Assistant Professor, McGill University, School of Computer Science and Department of Linguistics
Canada CIFAR AI Chair
Associate Academic Member
Portrait of Ayla Rigouts Terryn
Assistant Professor, Université de Montréal, Linguistics and translation
Core Academic Member
Portrait of Irina Rish
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Canada CIFAR AI Chair
Associate Industry Member
Portrait of Fabio Viola
Senior Research Engineer, Google DeepMind
Associate Industry Member
Portrait of Kory Wallace Mathewson
Research Scientist, DeepMind
Associate Academic Member
Portrait of Amal Zouaq
Full Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering

Publications

IG-RL: Inductive Graph Reinforcement Learning for Massive-Scale Traffic Signal Control
François-Xavier Devailly
Denis Larocque
Scaling adaptive traffic signal control involves dealing with combinatorial state and action spaces. Multi-agent reinforcement learning atte… (see more)mpts to address this challenge by distributing control to specialized agents. However, specialization hinders generalization and transferability, and the computational graphs underlying neural-network architectures—dominating in the multi-agent setting—do not offer the flexibility to handle an arbitrary number of entities which changes both between road networks, and over time as vehicles traverse the network. We introduce Inductive Graph Reinforcement Learning (IG-RL) based on graph-convolutional networks which adapts to the structure of any road network, to learn detailed representations of traffic signal controllers and their surroundings. Our decentralized approach enables learning of a transferable-adaptive-traffic-signal-control policy. After being trained on an arbitrary set of road networks, our model can generalize to new road networks and traffic distributions, with no additional training and a constant number of parameters, enabling greater scalability compared to prior methods. Furthermore, our approach can exploit the granularity of available data by capturing the (dynamic) demand at both the lane level and the vehicle level. The proposed method is tested on both road networks and traffic settings never experienced during training. We compare IG-RL to multi-agent reinforcement learning and domain-specific baselines. In both synthetic road networks and in a larger experiment involving the control of the 3,971 traffic signals of Manhattan, we show that different instantiations of IG-RL outperform baselines.
MeshDiffusion: Score-based Generative 3D Mesh Modeling
Zhen Liu
Yao Feng
Michael J. Black
Weiyang Liu
We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and… (see more) physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Previous scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes with deformable tetrahedral grids, and then train a diffusion model on this direct parameterization. We demonstrate the effectiveness of our model on multiple generative tasks.
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Mañas
Pau Rodriguez
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We p… (see more)ropose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/oscmansan/mapl.
MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
Vikram Voleti
Alexia Jolicoeur-Martineau
Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor … (see more)and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using