Publications

A Functional Data Perspective and Baseline On Multi-Layer Out-of-Distribution Detection

Eduardo Dadalto Câmara Gomes

Pierre Colombo

Guillaume Staerman

Nathan Noiry

Pablo Piantanida

2023-06-05

ArXiv (prépublication)

doi.org

arxiv.org

The Stack: 3 TB of permissively licensed source code

Denis Kocetkov

Raymond Li

Loubna Ben allal

Jia LI

Chenghao Mou

Carlos Muñoz Ferrandis

Yacine Jernite

Margaret Mitchell

Sean Hughes

Thomas Wolf

Dzmitry Bahdanau

Leandro Von Werra

Harm de Vries

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language proces… (voir plus)sing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called"Am I in The Stack"(https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

2023-06-05

TMLR (accepté)

doi.org

openreview.net

AHA!: Facilitating AI Impact Assessment by Generating Examples of Harms

Zana Buçinca

Chau Minh Pham

Maurice Jakesch

Marco Túlio Ribeiro

A.R. Olteanu

Saleema Amershi

While demands for change and accountability for harmful AI consequences mount, foreseeing the downstream effects of deploying AI systems rem… (voir plus)ains a challenging task. We developed AHA! (Anticipating Harms of AI), a generative framework to assist AI practitioners and decision-makers in anticipating potential harms and unintended consequences of AI systems prior to development or deployment. Given an AI deployment scenario, AHA! generates descriptions of possible harms for different stakeholders. To do so, AHA! systematically considers the interplay between common problematic AI behaviors as well as their potential impacts on different stakeholders, and narrates these conditions through vignettes. These vignettes are then filled in with descriptions of possible harms by prompting crowd workers and large language models. By examining 4113 harms surfaced by AHA! for five different AI deployment scenarios, we found that AHA! generates meaningful examples of harms, with different problematic AI behaviors resulting in different types of harms. Prompting both crowds and a large language model with the vignettes resulted in more diverse examples of harms than those generated by either the crowd or the model alone. To gauge AHA!'s potential practical utility, we also conducted semi-structured interviews with responsible AI professionals (N=9). Participants found AHA!'s systematic approach to surfacing harms important for ethical reflection and discovered meaningful stakeholders and harms they believed they would not have thought of otherwise. Participants, however, differed in their opinions about whether AHA! should be used upfront or as a secondary-check and noted that AHA! may shift harm anticipation from an ideation problem to a potentially demanding review problem. Drawing on our results, we discuss design implications of building tools to help practitioners envision possible harms.

2023-06-04

ArXiv (prépublication)

doi.org

arxiv.org

PAC-Bayesian Learning of Aggregated Binary Activated Neural Networks with Probabilities over Representations

Louis Fortier-Dubois

Gaël Letarte

Benjamin Leblanc

Franccois Laviolette

Pascal Germain

2023-06-04

Proceedings of the Canadian Conference on Artificial Intelligence (publié)

doi.org

arxiv.org

Spatial variations in aromatic hydrocarbon emission in a dust-rich galaxy

Justin S. Spilker

Kedar A. Phadke

Manuel Aravena

Melanie Archipley

Matthew B. Bayliss

Jack E. Birkin

Matthieu Béthermin

James Burgoyne

Jared Cathey

Scott C. Chapman

Håkon Dahle

Anthony H. Gonzalez

Gayathri Gururajan

Christopher C. Hayward

Yashar D. Hezaveh

Ryley Hill

Taylor A. Hutchison

Keunho J. Kim

Seonwoo Kim

David Law … (voir 19 de plus)

Ronan Legin

Matthew A. Malkan

Daniel P. Marrone

Eric J. Murphy

Desika Narayanan

Alex Navarre

Grace M. Olivier

Jeffrey A. Rich

Jane R. Rigby

Cassie Reuter

James E. Rhoads

Keren Sharon

J.D. T. Smith

Manuel Solimano

Nikolaus Sulzenauer

Joaquin D. Vieira

David Law

Axel Weiß

Katherine E. Whitaker

Dust grains absorb half of the radiation emitted by stars throughout the history of the universe, re-emitting this energy at infrared wavele… (voir plus)ngths. Polycyclic aromatic hydrocarbons (PAHs) are large organic molecules that trace millimeter-size dust grains and regulate the cooling of the interstellar gas within galaxies. Observations of PAH features in very distant galaxies have been difficult due to the limited sensitivity and wavelength coverage of previous infrared telescopes. Here we present JWST observations that detect the 3.3um PAH feature in a galaxy observed less than 1.5 billion years after the Big Bang. The high equivalent width of the PAH feature indicates that star formation, rather than black hole accretion, dominates the infrared emission throughout the galaxy. The light from PAH molecules, large dust grains, and stars and hot dust are spatially distinct from one another, leading to order-of-magnitude variations in the PAH equivalent width and the ratio of PAH to total infrared luminosity across the galaxy. The spatial variations we observe suggest either a physical offset between the PAHs and large dust grains or wide variations in the local ultraviolet radiation field. Our observations demonstrate that differences in the emission from PAH molecules and large dust grains are a complex result of localized processes within early galaxies.

2023-06-04

Nature (publié)

doi.org

arxiv.org

Dialogue System with Missing Observation

Djallel Bouneffouf

Mayank Agarwal

Irina Rish

Within the domain of dialogue, the ability to orchestrate multiple independently trained dialogue agents to create a unified system is of pa… (voir plus)rticular importance. Where we define orchestration as the task of selecting a subset of skills which most appropriately answer a user input using features extracted from both the user input and the individual skills. In this work, we study the task of online dialogue orchestration where the user feedback associated with the dialogue agent may not always be observed. In order to address the missing feedback setting, we propose to combine the attentive contextual bandit approach with an unsupervised learning mechanism such as clustering. By leveraging clustering to estimate missing reward, we are able to learn from each incoming event, even those with missing rewards. Promising empirical results are obtained on proprietary conversational datasets.

2023-06-03

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

doi.org

Evaluation of Categorical Generative Models - Bridging the Gap Between Real and Synthetic Data

Florence Regol

Anja Kroon

Mark J. Coates

The machine learning community has mainly relied on real data to benchmark algorithms as it provides compelling evidence of model applicabil… (voir plus)ity. Evaluation on synthetic datasets can be a powerful tool to provide a better understanding of a model’s strengths, weaknesses and overall capabilities. Gaining these insights can be particularly important for generative modeling as the target quantity is completely unknown. Multiple issues related to the evaluation of generative models have been reported in the literature. We argue those problems can be avoided by an evaluation based on ground truth. General criticisms of synthetic experiments are that they are too simplified and not representative of practical scenarios. As such, our experimental setting is tailored to a realistic generative task. We focus on categorical data and introduce an appropriately scalable evaluation method. Our method involves tasking a generative model to learn a distribution in a high-dimensional setting. We then successively bin the large space to obtain smaller probability spaces where meaningful statistical tests can be applied. We consider increasingly large probability spaces, which correspond to increasingly difficult modeling tasks, and compare the generative models based on the highest task difficulty they can reach before being detected as being too far from the ground truth. We validate our evaluation procedure with synthetic experiments on both synthetic generative models and current state-of-the-art categorical generative models.

2023-06-03

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

doi.org

arxiv.org

Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study

Salah Zaiem

Robin Algayres

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. … (voir plus)In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.

2023-06-03

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

doi.org

arxiv.org

Self-Supervised Learning for Infant Cry Analysis

Arsenii Gorin

Yusuf Cem Sübakan

Sajjad Abdoli

Junhao Wang

Samantha Latremouille

Charles Onu

In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical in… (voir plus)dications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and timeconsuming, typically requiring the collaboration of several experts over years. Leveraging large amounts of unlabeled audio data to learn useful representations can lower the cost of building robust models and, ultimately, clinical solutions. In this work, we experiment with self-supervised pre-training of a convolutional neural network on large audio datasets. We show that pre-training with SSL contrastive loss (SimCLR) performs significantly better than supervised pre-training for both neuro injury and cry triggers. In addition, we demonstrate further performance gains through SSL-based domain adaptation using unlabeled infant cries. We also show that using such SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data of the overall system.

2023-06-03

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

doi.org

arxiv.org

ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence

Dmitriy Rivkin

Gregory Dudek

Nikhil Kakodkar

David Meger

Oliver Limoyo

Xue Liu

Francois Hogan

Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automate… (voir plus)d photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.

2023-06-01

2023 IEEE International Conference on Robotics and Automation (ICRA) (publié)

doi.org

arxiv.org

Generating Stable and Collision-Free Policies through Lyapunov Function Learning

Alexandre Coulombe

Hsiu-Chin Lin

The need for rapid and reliable robot deployment is on the rise. Imitation Learning (IL) has become popular for producing motion planning po… (voir plus)licies from a set of demonstrations. However, many methods in IL are not guaranteed to produce stable policies. The generated policy may not converge to the robot target, reducing reliability, and may collide with its environment, reducing the safety of the system. Stable Estimator of Dynamic Systems (SEDS) produces stable policies by constraining the Lyapunov stability criteria during learning, but the Lyapunov candidate function had to be manually selected. In this work, we propose a novel method for learning a Lyapunov function and a collision-free policy using a single neural network model. The method can be equipped with an obstacle avoidance module for convex object pairs to guarantee no collisions. We demonstrated our method is capable of finding policies in several simulation environments and transfer to a real-world scenario.

2023-06-01

2023 IEEE International Conference on Robotics and Automation (ICRA) (publié)

doi.org

arxiv.org

Improving Generalization in Task-oriented Dialogues with Workflows and Action Plans

Stefania Raimondo

Christopher Pal

Xiaotian Liu

David Vázquez

Hector. Palacios

2023-06-01

ArXiv (prépublication)

doi.org

arxiv.org

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Publications

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Publications