Publications

A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies

Alicia DeVrio

Myra Cheng

Lisa Egede

Su Lin Blodgett

Recent attention to anthropomorphism -- the attribution of human-like qualities to non-human objects or entities -- of language technologies… (see more) like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.

2025-02-14

ArXiv (preprint)

doi.org

arxiv.org

Bugs in Large Language Models Generated Code: An Empirical Study

Florian Tambon

Arghavan Moradi Dakhel

Amin Nikanjam

Foutse Khomh

Michel C. Desmarais

Giuliano Antoniol

2025-02-13

Empirical Software Engineering (published)

doi.org

arxiv.org

Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models

Gabriel Tseng

A. Fuller

Marlena Reil

Henry Herzog

Patrick Beukema

Favyen Bastani

James R. Green

Evan Shelhamer

Hannah Kerner

David Rolnick

From crop mapping to flood detection, machine learning in remote sensing has a wide range of societally beneficial applications. The commona… (see more)lities between remote sensing data in these applications present an opportunity for pretrained machine learning models tailored to remote sensing to reduce the labeled data and effort required to solve individual tasks. However, such models must be: (i) flexible enough to ingest input data of varying sensor modalities and shapes (i.e., of varying spatial and temporal dimensions), and (ii) able to model Earth surface phenomena of varying scales and types. To solve this gap, we present Galileo, a family of pretrained remote sensing models designed to flexibly process multimodal remote sensing data. We also introduce a novel and highly effective self-supervised learning approach to learn both large- and small-scale features, a challenge not addressed by previous models. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.

2025-02-13

ArXiv (preprint)

arxiv.org

Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models

Gabriel Tseng

A. Fuller

Marlena Reil

Henry Herzog

Patrick Beukema

Favyen Bastani

James R. Green

Evan Shelhamer

Hannah Kerner

David Rolnick

From crop mapping to flood detection, machine learning in remote sensing has a wide range of societally beneficial applications. The commona… (see more)lities between remote sensing data in these applications present an opportunity for pretrained machine learning models tailored to remote sensing to reduce the labeled data and effort required to solve individual tasks. However, such models must be: (i) flexible enough to ingest input data of varying sensor modalities and shapes (i.e., of varying spatial and temporal dimensions), and (ii) able to model Earth surface phenomena of varying scales and types. To solve this gap, we present Galileo, a family of pretrained remote sensing models designed to flexibly process multimodal remote sensing data. We also introduce a novel and highly effective self-supervised learning approach to learn both large- and small-scale features, a challenge not addressed by previous models. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.

2025-02-13

ArXiv (preprint)

doi.org

arxiv.org

Galileo: Learning Global&Local Features of Many Remote Sensing Modalities

Gabriel Tseng

A. Fuller

Marlena Reil

Henry Herzog

Patrick Beukema

Favyen Bastani

James R. Green

Evan Shelhamer

Hannah Kerner

David Rolnick

We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, … (see more)elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the diversity of relevant data modalities, and because objects of interest vary massively in scale, from small boats (1-2 pixels and transient) to glaciers (thousands of pixels and persistent). We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling. Our dual global and local contrastive losses differ in their targets (deep representations vs. shallow input projections) and masking strategies (structured vs. not). Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.

2025-02-13

ArXiv (preprint)

arxiv.org

INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

Hao Yu

Jesujoba Oluwadara Alabi

Andiswa Bukula

Zhuang Yun Jian

En-Shiun Annie Lee

Tadesse Kebede Guge

Israel Abebe Azime

Happy Buzaaba

Blessing Kudzaishe Sibanda

Godson Kalipe

Jonathan Mukiibi

S. Kabenamualu

M. Setaka

Lolwethu Ndolela

Nkiruka Bridget Odu

Rooweither Mabuya

Shamsuddeen Hassan Muhammad

Salomey Osei

Sokhar Samb

Juliet W. Murage … (see 2 more)

Dietrich Klakow

David Ifeoluwa Adelani

Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks o… (see more)ften exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark the fine-tuning multilingual transformer models and the prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1-score. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls behind the fine-tuning baselines. Compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that the performance of LLMs is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.

2025-02-13

ArXiv (preprint)

arxiv.org

INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

Hao Yu

Jesujoba Oluwadara Alabi

Andiswa Bukula

Zhuang Yun Jian

En-Shiun Annie Lee

Tadesse Kebede Guge

Israel Abebe Azime

Happy Buzaaba

Blessing Kudzaishe Sibanda

Godson Kalipe

Jonathan Mukiibi

S. Kabenamualu

M. Setaka

Lolwethu Ndolela

Nkiruka Bridget Odu

Rooweither Mabuya

Shamsuddeen Hassan Muhammad

Salomey Osei

Sokhar Samb

Juliet W. Murage … (see 2 more)

Dietrich Klakow

David Ifeoluwa Adelani

2025-02-13

ArXiv (preprint)

doi.org

arxiv.org

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Simon Dufort-Labbé

Pierluca D'Oro

Evgenii Nikishin

Razvan Pascanu

Irina Rish

Pierre-Luc Bacon

Aristide Baratin

2025-02-13

TMLR (accepted)

doi.org

openreview.net

Perception and neural representation of intermittent odor stimuli in mice

Luis Boero

Hao Wu

Joseph D. Zak

Paul Masset

Farhad Pashakhanloo

Siddharth Jayakumar

Bahareh Tolooshams

Demba Ba

Venkatesh N. Murthy

2025-02-13

bioRxiv (preprint)

doi.org

scCobra allows contrastive cell embedding learning with domain adaptation for single cell data integration and harmonization

Bowen Zhao

Kailu Song

Dong-Qing Wei

Yi Xiong

Jun Ding

2025-02-13

Communications Biology (published)

doi.org

Author Correction: Isospin competitions and valley polarized correlated insulators in twisted double bilayer graphene

Le Liu

Shihao Zhang

Yanbang Chu

Cheng Shen

Yuan Huang

Yalong Yuan

Jinpeng Tian

Jian Tang

Yiru Ji

Rong Yang

Kenji Watanabe

Takashi Taniguchi

Dongxia Shi

Jianpeng Liu

Wei Yang

Guangyu Zhang

2025-02-12

Nature Communications (published)

doi.org

HiPoNet: A Multi-View Simplicial Complex Network for High Dimensional Point-Cloud and Single-Cell Data

Siddharth Viswanath

Hiren Madhu

Dhananjay Bhaskar

Jake Kovalic

David R. Johnson

Christopher Tape

Ian Adelstein

Rex Ying

Michael Perlmutter

Smita Krishnaswamy

In this paper, we propose HiPoNet, an end-to-end differentiable neural network for regression, classification, and representation learning o… (see more)n high-dimensional point clouds. Our work is motivated by single-cell data which can have very high-dimensionality --exceeding the capabilities of existing methods for point clouds which are mostly tailored for 3D data. Moreover, modern single-cell and spatial experiments now yield entire cohorts of datasets (i.e., one data set for every patient), necessitating models that can process large, high-dimensional point-clouds at scale. Most current approaches build a single nearest-neighbor graph, discarding important geometric and topological information. In contrast, HiPoNet models the point-cloud as a set of higher-order simplicial complexes, with each particular complex being created using a reweighting of features. This method thus generates multiple constructs corresponding to different views of high-dimensional data, which in biology offers the possibility of disentangling distinct cellular processes. It then employs simplicial wavelet transforms to extract multiscale features, capturing both local and global topology from each view. We show that geometric and topological information is preserved in this framework both theoretically and empirically. We showcase the utility of HiPoNet on point-cloud level tasks, involving classification and regression of entire point-clouds in data cohorts. Experimentally, we find that HiPoNet outperforms other point-cloud and graph-based models on single-cell data. We also apply HiPoNet to spatial transcriptomics datasets using spatial coordinates as one of the views. Overall, HiPoNet offers a robust and scalable solution for high-dimensional data analysis.

2025-02-11

ArXiv (preprint)

arxiv.org

AI Advantage

Strategic Priorities

Mila AI Policy Fellowship

AI Advantage

Strategic Priorities

Publications

AI Advantage

Strategic Priorities

Mila AI Policy Fellowship

AI Advantage

Strategic Priorities

Popular keywords:

Publications