Hugo Larochelle appointed Scientific Director of Mila
An adjunct professor at the Université de Montréal and former head of Google's AI lab in Montréal, Hugo Larochelle is a pioneer in deep learning and one of Canada’s most respected researchers.
Mila’s AI for Climate Studio aims to bridge the gap between technology and impact to unlock the potential of AI in tackling the climate crisis rapidly and on a massive scale.
The program recently published its first policy brief, titled "Policy Considerations at the Intersection of Quantum Technologies and Artificial Intelligence," authored by Padmapriya Mohan.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens
In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate… (see more) the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.
In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate… (see more) the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.
Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same … (see more)time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with
Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same … (see more)time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with
Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this probl… (see more)em, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims, as well as the 9 datasets that consists of data in purely paragraph form. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at [anonymized].
2025-08-03
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (published)
Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data… (see more) becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.