Publications

RainShift: A Benchmark for Precipitation Downscaling Across Geographies
Luca Schmidt
Nicole Ludwig 0002
Matthew Chantry
Christian Lessig
Alex Hernandez-Garcia
Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolu… (see more)tion for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.
RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling
Xiuying Wei
Anunay Yadav
Caglar Gulcehre
Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major… (see more) computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a \(7\times\) improvement in training speed with 100K token sequences and \(9\times\) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT
ReCatcher: Towards LLMs Regression Testing for Code Generation
Altaf Allah Abbassi
Leuson Da Silva
Amin Nikanjam
Rethinking Prompt Optimization: Reinforcement, Diversification, and Migration in Blackbox LLMs
MohammadReza Davari
Utkarsh Garg
Weixin Cai
Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning
Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (see more)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,
Speciation of coral-associated barnacles: generalists versus specialists in the Indo-West Pacific
Lorenzo C. Halasan
Yoko Nozawa
Benny Kwok Kan Chan
On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes
Mathieu Godbout
Using machine learning algorithms to predict students' general self-efficacy in PISA 2018
Bin Tan
Hao-Yue Jin
What Can Grokking Teach Us About Learning Under Nonstationarity?
Clare Lyle
Gharda Sokar
Andr'as Gyorgy
In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to ch… (see more)anges in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.
What Matters for Maximizing Data Reuse In Value-based Deep Reinforcement Learning
Roger Creus Castanyer
A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (see more)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
What Matters for Maximizing Data Reuse In Value-based Deep Reinforcement Learning
Roger Creus Castanyer
A key ingredient for successfully applying deep reinforcement learning to challenging tasks is the effective use of data at scale. Although … (see more)originally deep RL algorithms achieved this by storing past experiences collected from a synchronous actor in an external replay memory [DQN; Mnih et al., 2013], follow-up works scaled training by collecting data asynchronously through distributed actors [R2D2; Kapturowski et al., 2018], and more recently by GPU-optimized parallelization [PQN; Gallici et al., 2024]. We argue that DQN, PQN, and R2D2 constitute a group of value-based methods for parallel training and study them to shed light on the dynamics induced by varying data collection schemes. We conduct a thorough empirical study to better understand these dynamics, and propose the Data Replay Ratio as a novel metric for quantifying data reuse. Our findings suggest that maximizing data reuse involves directly addressing the deadly triad: Q-lambda rollouts for reducing the bias from bootstrapping, the use of LayerNorm for stabilizing function approximation, and parallelized data collection for mitigating off-policy divergence.
WiSE-OD: Benchmarking Robustness in Infrared Object Detection
Heitor Rapela Medeiros
Atif Belal
Masih Aminbeidokhti
Eric Granger