Portrait de Danilo Vucetic

Danilo Vucetic

Doctorat - UdeM
Superviseur⋅e principal⋅e
Sujets de recherche
Apprentissage par renforcement
Apprentissage profond
Apprentissage sur graphes
GFlowNets
Modèles génératifs
Modèles probabilistes
Optimisation
Théorie des jeux

Publications

Soft Mellowmax Monte Carlo Planning
Soft mellowmax (SMM) recently emerged as an alternative operator in Q-learning, achieving impressive performance in games and scientific dis… (voir plus)covery tasks. Despite SMM's ability to achieve high returns and its enticing robustness, diversity, and sample efficiency characteristics, SMM has not yet been translated into a Monte Carlo tree search algorithm. To address this gap, a soft mellowmax-based Monte Carlo tree search algorithm, SMM-TS, is proposed and theoretically justified. It is empirically demonstrated that SMM-TS converges significantly faster than other tree search methods in synthetic environments, while maintaining competitive performance in games. The fast convergence of SMM-TS makes recursive self-improvement loops more scalable, while the stability gained via planning and the robustness of the operator make SMM-TS more practical for agents operating in uncertain and changing environments.
Beyond Reward Maximization: Evaluating the Diversity of Trajectories in Reinforcement Learning with Temporal Vendi Score
In domains such as scientific discovery and automated design using reinforcement learning (RL), the final task of an agent should extend bey… (voir plus)ond maximising a single scalar reward; it requires identifying diverse sets of high-quality trajectories to uncover distinct solutions that can provide novel insights on how to solve the problems of interest and transfer robustly from simulation to the real world. However, the RL literature currently lacks a holistic, domain-agnostic standard for measuring trajectory diversity. Existing metrics have been developed to improve exploration at training time but not to evaluate and compare diversity induced by different agents, rendering cross-method comparisons inconsistent and challenging. To address this, we introduce the Temporal Vendi Score (TVS), a novel metric designed to evaluate the diversity of an RL agent by computing the entropy of the eigenvalues' similarity matrix of sampled trajectories. Unlike previous approaches, our metric captures the behavioural diversity of trajectories by accounting for both the sequential nature of state visitations and the temporal structure of the underlying MDP, rather than relying on order-agnostic state comparisons. We validate the TVS on simple environments where we can control the number of different ways a problem can be solved, demonstrating that it provides a more robust, semantically meaningful ranking of diversity than standard baselines. We then show that our metric can scale to a high-dimensional, continuous environment.
Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning
A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a … (voir plus)small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.
Solving Hidden Monotone Variational Inequalities with Surrogate Losses
Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minim… (voir plus)izing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.
Expected Flow Networks in Stochastic Environments and Two-Player Zero-Sum Games
Generative flow networks (GFlowNets) are sequential sampling models trained to match a given distribution. GFlowNets have been successfully … (voir plus)applied to various structured object generation tasks, sampling a diverse set of high-reward objects quickly. We propose expected flow networks (EFlowNets), which extend GFlowNets to stochastic environments. We show that EFlowNets outperform other GFlowNet formulations in stochastic tasks such as protein design. We then extend the concept of EFlowNets to adversarial environments, proposing adversarial flow networks (AFlowNets) for two-player zero-sum games. We show that AFlowNets learn to find above 80% of optimal moves in Connect-4 via self-play and outperform AlphaZero in tournaments.
Efficient Fine-Tuning of BERT Models on the Edge
Mohammadreza Tayaranian
Maryam Ziaeefard
James J. Clark
Brett H. Meyer
Warren J. Gross
Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always… (voir plus) suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.