Mila is hosting its first quantum computing hackathon on November 21, a unique day to explore quantum and AI prototyping, collaborate on Quandela and IBM platforms, and learn, share, and network in a stimulating environment at the heart of Quebec’s AI and quantum ecosystem.
This new initiative aims to strengthen connections between Mila’s research community, its partners, and AI experts across Quebec and Canada through in-person meetings and events focused on AI adoption in industry.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning
Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (see more)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,
In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to ch… (see more)anges in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.
A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (see more)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (see more)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
Traditionally, constrained policy optimization with Reinforcement Learning (RL) requires learning a new policy from scratch for any new envi… (see more)ronment, goal or cost function, with limited generalization to new tasks and constraints. Given the sample inefficiency of many common deep RL methods, this procedure can be impractical for many real-world scenarios, particularly when constraints or tasks are changing. As an alternative, in the unconstrained setting, various works have sought to pre-train representations from offline datasets to accelerate policy optimization upon specification of a reward.
Such methods can permit faster adaptation to new tasks in a given environment, dramatically improving sample efficiency. Recently, zero-shot policy optimization has been explored by leveraging a particular
Despite extensive safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safeguards to elicit har… (see more)mful content. While prior work attributes this vulnerability to safety training limitations, the internal mechanisms by which LLMs process adversarial prompts remain poorly understood. We present a mechanistic analysis of the jailbreaking behavior in a large-scale, safety-aligned LLM, focusing on LLaMA-2-7B-chat-hf. Leveraging edge attribution patching and subnetwork probing, we systematically identify computational circuits responsible for generating affirmative responses to jailbreak prompts. Ablating these circuits during the first token prediction can reduce attack success rates by up to 80\%, demonstrating its critical role in safety bypass. Our analysis uncovers key attention heads and MLP pathways that mediate adversarial prompt exploitation, revealing how important tokens propagate through these components to override safety constraints. These findings advance the understanding of adversarial vulnerabilities in aligned LLMs and pave the way for targeted, interpretable defenses mechanisms based on mechanistic interpretability.
Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performanc… (see more)e. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.