Boris Knyazev

Is Depth Heterogeneity a Barrier to Model Merging?

Model merging offers a way to combine the capabilities of several networks at test time without retraining or additional finetuning, but mos… (see more)t merging methods assume identical architectures. Depth differences are commonly viewed as a major obstacle because they remove clear layer correspondences. We test this assumption by merging residual networks that differ only in depth, using a simple training-free pipeline based on identity expansion and permutation alignment. Across both same-task and multitask image classification experiments, heterogeneous merges closely match homogeneous ones. The results suggest that, for residual networks, depth mismatch is not the main barrier to effective model merging, and that the main difficulty in model merging comes from aligning independently trained weights in a homogeneous setting.

2026-02-28

TTU_Main_Track @ International Conference on Learning Representations (published)

openreview.net

Celo2: Towards Learned Optimization Free Lunch

Abhinav Moudgil

Boris Knyazev

Eugene Belilovsky

Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since th… (see more)ey often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months (

2026-02-21

arXiv (preprint)

doi.org

arxiv.org

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can strug… (see more)gle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (

2025-12-31

International Conference on Learning Representations (Accept (Poster))

openreview.net

Towards Learned Optimization Free Lunch

Abhinav Moudgil

Boris Knyazev

Eugene Belilovsky

Learned optimizers are powerful alternatives to hand-designed rules like Adam, yet they have seen limited practical adoption since they ofte… (see more)n fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months (

2025-12-31

International Conference on Learning Representations (Accept (Poster))

openreview.net

Concept-based Steering of Large Language Models for Conditional Molecular Generation

Yang Zhang

Modern LLMs, with their internet-scale pretraining and advanced human-level capabilities across specialized tasks, have demonstrated promisi… (see more)ng performance in molecular discovery using existing text-based molecular representations, such as SMILES and SELFIES. However, generating valid, unique, and high-fidelity molecules while precisely controlling for multiple properties simultaneously remains challenging. While prior works demonstrated success by fine-tuning language models on a novel corpus of molecules with property-conditioned tags, real-world applications require generating molecules from diverse property distributions, previously unseen in the training data. To this end, we present Concept-based Activation STeering (CAST), the first approach to apply activation steering to directly edit a model's internal representation for conditional molecular generation. CAST offers a lightweight, flexible alternative to fine-tuning by computing property-conditioned steering vectors via a concept network that does not require retraining the LLM. Through extensive experiments on datasets such as Therapeutics Data Commons, we show that CAST consistently outperforms existing methods on both in-distribution and out-of-distribution conditional generation tasks. We also conduct comprehensive ablation studies to highlight the extent of control our concept-guided steering provides on the molecules generated by the LLM.

2025-09-19

NeurIPS.cc/2025/Workshop/AI4Mat (poster)

openreview.net

Circuit Discovery Helps To Detect LLM Jailbreaking

Despite extensive safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safeguards to elicit har… (see more)mful content. While prior work attributes this vulnerability to safety training limitations, the internal mechanisms by which LLMs process adversarial prompts remain poorly understood. We present a mechanistic analysis of the jailbreaking behavior in a large-scale, safety-aligned LLM, focusing on LLaMA-2-7B-chat-hf. Leveraging edge attribution patching and subnetwork probing, we systematically identify computational circuits responsible for generating affirmative responses to jailbreak prompts. Ablating these circuits during the first token prediction can reduce attack success rates by up to 80\%, demonstrating its critical role in safety bypass. Our analysis uncovers key attention heads and MLP pathways that mediate adversarial prompt exploitation, revealing how important tokens propagate through these components to override safety constraints. These findings advance the understanding of adversarial vulnerabilities in aligned LLMs and pave the way for targeted, interpretable defenses mechanisms based on mechanistic interpretability.

2025-06-29

ICML.cc/2025/Workshop/R2-FM (poster)

openreview.net

Accelerating Training with Neuron Interaction and Nowcasting Networks

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (see more) learnable update rules can be costly and unstable to train and use. Recently, Jang et al. (2023) proposed a simpler approach to accelerate training based on weight nowcaster networks (WNNs). In their approach, Adam is used for most of the optimization steps and periodically, only every few steps, a WNN nowcasts (predicts near future) parameters. We improve WNNs by proposing neuron interaction and nowcasting (NiNo) networks. In contrast to WNNs, NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters. We further show that in some networks, such as Transformers, modeling neuron connectivity accurately is challenging. We address this and other limitations, which allows NiNo to accelerate Adam training by up to 50% in vision and language tasks.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Celo: Training Versatile Learned Optimizers on a Compute Diet

Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned upda… (see more)te rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.

2025-01-21

ArXiv (preprint)

doi.org

openreview.net

Can We Learn Communication-Efficient Optimizers?

Charles-Etienne Joseph

2024-12-31

Trans. Mach. Learn. Res. (published)

doi.org

openreview.net

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

2024-10-09

NeurIPS.cc/2024/Workshop/OPT (published)

doi.org

openreview.net

Learning Optimizers for Local SGD

Charles-Etienne Joseph

2023-10-26

NeurIPS.cc/2023/Workshop/Federated_Learning (poster)

openreview.net

Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?

Boris Knyazev

Doha Hwang

Simon Lacoste-Julien

Pretraining a neural network on a large dataset is becoming a cornerstone in machine learning that is within the reach of only a few communi… (see more)ties with large-resources. We aim at an ambitious goal of democratizing pretraining. Towards that goal, we train and release a single neural network that can predict high quality ImageNet parameters of other neural networks. By using predicted parameters for initialization we are able to boost training of diverse ImageNet models available in PyTorch. When transferred to other datasets, models initialized with predicted parameters also converge faster and reach competitive final performance.

2023-07-02

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Boris Knyazev

Publications

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Popular keywords:

Boris Knyazev

Publications