Publications

A Strong Baseline for Molecular Few-Shot Learning
Hugo Jeannin
Ismail Ben Ayed
Few-shot learning has recently attracted significant interest in drug discovery, with a recent, fast-growing literature mostly involving con… (voir plus)voluted meta-learning strategies. We revisit the more straightforward fine-tuning approach for molecular data, and propose a regularized quadratic-probe loss based on the the Mahalanobis distance. We design a dedicated block-coordinate descent optimizer, which avoid the degenerate solutions of our loss. Interestingly, our simple fine-tuning approach achieves highly competitive performances in comparison to state-of-the-art methods, while being applicable to black-box settings and removing the need for specific episodic pre-training strategies. Furthermore, we introduce a new benchmark to assess the robustness of the competing methods to domain shifts. In this setting, our fine-tuning baseline obtains consistently better results than meta-learning methods.
From Markov to Laplace: How Mamba In-Context Learns Markov Chains
Marco Bondaschi
Nived Rajaraman
Xiuying Wei
Kannan Ramchandran
Michael C. Gastpar
Ashok Vardhan Makkuva
A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies
Alicia DeVrio
Myra Cheng
Lisa Egede
A.R. Olteanu
Recent attention to anthropomorphism -- the attribution of human-like qualities to non-human objects or entities -- of language technologies… (voir plus) like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.
Bugs in Large Language Models Generated Code: An Empirical Study
Florian Tambon
Amin Nikanjam
Michel C. Desmarais
Giuliano Antoniol
Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages … (voir plus)based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.
Galileo: Learning Global&Local Features of Many Remote Sensing Modalities
Anthony Fuller
Henry Herzog
Patrick Beukema
Favyen Bastani
James R Green
Evan Shelhamer
Hannah Kerner
We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, … (voir plus)elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the diversity of relevant data modalities, and because objects of interest vary massively in scale, from small boats (1-2 pixels and fast) to glaciers (thousands of pixels and slow). We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling. Our dual global and local contrastive losses differ in their targets (deep representations vs. shallow input projections) and masking strategies (structured vs. not). Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.
INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages
Hao Yu
Jesujoba Oluwadara Alabi
Andiswa Bukula
Zhuang Yun Jian
En-Shiun Annie Lee
Tadesse Kebede Guge
Israel Abebe Azime
Happy Buzaaba
Blessing Kudzaishe Sibanda
Godson Kalipe
Jonathan Mukiibi
S. Kabenamualu
M. Setaka
Lolwethu Ndolela
Nkiruka Bridget Odu
Rooweither Mabuya
Shamsuddeen Hassan Muhammad
Salomey Osei
Sokhar Samb
Juliet W. Murage … (voir 2 de plus)
Dietrich Klakow
Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons
When training neural networks, dying neurons -- units becoming inactive or saturated -- are traditionally seen as harmful. This paper sheds … (voir plus)new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56
Perception and neural representation of intermittent odor stimuli in mice
Luis Boero
Hao Wu
Joseph D. Zak
Farhad Pashakhanloo
Siddharth Jayakumar
Bahareh Tolooshams
Demba Ba
Venkatesh N. Murthy
scCobra allows contrastive cell embedding learning with domain adaptation for single cell data integration and harmonization
Bowen Zhao
Kailu Song
Dong-Qing Wei
Yi Xiong
Author Correction: Isospin competitions and valley polarized correlated insulators in twisted double bilayer graphene
Le Liu
Shihao Zhang
Yanbang Chu
Cheng Shen
Yuan Huang
Yalong Yuan
Jinpeng Tian
Yiru Ji
Rong Yang
Kenji Watanabe
Takashi Taniguchi
Dongxia Shi
Jianpeng Liu
Wei Yang
Guangyu Zhang
Modeling Multivariable High-resolution 3D Urban Microclimate Using Localized Fourier Neural Operator
Dongxue Zhan
Dingyang Geng
Wenhui Peng
Geng Tian
Yurong Shi
Naiping Gao
Xue Liu
Liangzhu (Leon) Wang
Accurate urban microclimate analysis with wind velocity and temperature is vital for energy-efficient urban planning, supporting carbon redu… (voir plus)ction, enhancing public health and comfort, and advancing the low-altitude economy. However, traditional computational fluid dynamics (CFD) simulations that couple velocity and temperature are computationally expensive. Recent machine learning advancements offer promising alternatives for accelerating urban microclimate simulations. The Fourier neural operator (FNO) has shown efficiency and accuracy in predicting single-variable velocity magnitudes in urban wind fields. Yet, for multivariable high-resolution 3D urban microclimate prediction, FNO faces three key limitations: blurry output quality, high GPU memory demand, and substantial data requirements. To address these issues, we propose a novel localized Fourier neural operator (Local-FNO) model that employs local training, geometry encoding, and patch overlapping. Local-FNO provides accurate predictions for rapidly changing turbulence in urban microclimate over 60 seconds, four times the average turbulence integral time scale, with an average error of 0.35 m/s in velocity and 0.30 °C in temperature. It also accurately captures turbulent heat flux represented by the velocity-temperature correlation. In a 2 km by 2 km domain, Local-FNO resolves turbulence patterns down to a 10 m resolution. It provides high-resolution predictions with 150 million feature dimensions on a single 32 GB GPU at nearly 50 times the speed of a CFD solver. Compared to FNO, Local-FNO achieves a 23.9% reduction in prediction error and a 47.3% improvement in turbulent fluctuation correlation.
HiPoNet: A Multi-View Simplicial Complex Network for High Dimensional Point-Cloud and Single-Cell Data
Hiren Madhu
Dhananjay Bhaskar
David R. Johnson
Rex Ying
Christopher Tape
Ian Adelstein
Michael Perlmutter
In this paper, we propose HiPoNet, an end-to-end differentiable neural network for regression, classification, and representation learning o… (voir plus)n high-dimensional point clouds. Our work is motivated by single-cell data which can have very high-dimensionality --exceeding the capabilities of existing methods for point clouds which are mostly tailored for 3D data. Moreover, modern single-cell and spatial experiments now yield entire cohorts of datasets (i.e., one data set for every patient), necessitating models that can process large, high-dimensional point-clouds at scale. Most current approaches build a single nearest-neighbor graph, discarding important geometric and topological information. In contrast, HiPoNet models the point-cloud as a set of higher-order simplicial complexes, with each particular complex being created using a reweighting of features. This method thus generates multiple constructs corresponding to different views of high-dimensional data, which in biology offers the possibility of disentangling distinct cellular processes. It then employs simplicial wavelet transforms to extract multiscale features, capturing both local and global topology from each view. We show that geometric and topological information is preserved in this framework both theoretically and empirically. We showcase the utility of HiPoNet on point-cloud level tasks, involving classification and regression of entire point-clouds in data cohorts. Experimentally, we find that HiPoNet outperforms other point-cloud and graph-based models on single-cell data. We also apply HiPoNet to spatial transcriptomics datasets using spatial coordinates as one of the views. Overall, HiPoNet offers a robust and scalable solution for high-dimensional data analysis.