Publications

Eric Nguyen

Daniel Y Fu

Tri Dao

Stephen Baccus

Stefano Ermon

Christopher Re

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the c… (see more)ore building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

2023-02-21

ArXiv (preprint)

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli

Eric Nguyen

Daniel Y Fu

Tri Dao

Stephen Baccus

Stefano Ermon

Christopher Re

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the c… (see more)ore building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

2023-02-21

ArXiv (preprint)

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli

Eric Nguyen

Daniel Y Fu

Tri Dao

Stephen Baccus

Stefano Ermon

Christopher Re

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the c… (see more)ore building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

2023-02-21

ArXiv (preprint)

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli

Eric Nguyen

Daniel Y Fu

Tri Dao

Stephen Baccus

Stefano Ermon

Christopher Re

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the c… (see more)ore building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

2023-02-21

ArXiv (preprint)

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection

Maxime Darrin

Guillaume Staerman

Eduardo Dadalto Câmara Gomes

Jackie Cheung

Pablo Piantanida

Pierre Colombo

2023-02-20

ArXiv (preprint)

Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients

Rupali Bhati

Jennifer Jones

Audrey Durand

Cancer treatment is an arduous process for patients and causes many side-effects during and post-treatment. The treatment can affect almost … (see more)all body systems and result in pain, fatigue, sleep disturbances, cognitive impairments, etc. These conditions are often under-diagnosed or under-treated. In this paper, we use patient data to predict the evolution of their symptoms such that treatment-related impairments can be prevented or effects meaningfully ameliorated. The focus of this study is on predicting the pain and tiredness level of a patient post their diagnosis. We implement an interpretable decision tree based model called LightGBM on real-world patient data consisting of 20163 patients. There exists a class imbalance problem in the dataset which we resolve using the oversampling technique of SMOTE. Our empirical results show that the value of the previous level of a symptom is a key indicator for prediction and the weighted average deviation in prediction of pain level is 3.52 and of tiredness level is 2.27.

2023-02-19

ArXiv (preprint)

Stochastic Generative Flow Networks

Ling Pan

Dinghuai Zhang

Moksh J. Jain

Longbo Huang

Generative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures… (see more) through the lens of"inference as control". They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.

2023-02-19

ArXiv (preprint)

LAGrad: Statically Optimized Differentiable Programming in MLIR

Mai Jacob Peng

Christophe Dubach

2023-02-17

International Conference on Compiler Construction (published)

Effects of incoming particle energy and cluster size on the G-value of hydrated electrons.

Alaina Bui

H. Bekerat

Lilian Childress

Jack C Sankey

Jan Seuntjens

Shirin A. Enger

2023-02-16

Physica medica (Testo stampato) (published)

MOT: A Multi-Omics Transformer for Multiclass Classification Tumour Types Predictions

Mazid Osseni

Prudencio Tossou

Franccois Laviolette

Jacques Corbeil

2023-02-16

Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (published)

Refactoring practices in the context of data-intensive systems

Biruk Asmare Muse

Foutse Khomh

Giuliano Antoniol

2023-02-16

Empirical Software Engineering (published)

Learning to Substitute Ingredients in Recipes

Bahare Fatemi

Quentin Duval

Rohit Girdhar

Michal Drozdzal

Adriana Romero Soriano

Recipe personalization through ingredient substitution has the potential to help people meet their dietary needs and preferences, avoid pote… (see more)ntial allergens, and ease culinary exploration in everyone's kitchen. To address ingredient substitution, we build a benchmark, composed of a dataset of substitution pairs with standardized splits, evaluation metrics, and baselines. We further introduce Graph-based Ingredient Substitution Module (GISMo), a novel model that leverages the context of a recipe as well as generic ingredient relational information encoded within a graph to rank plausible substitutions. We show through comprehensive experimental validation that GISMo surpasses the best performing baseline by a large margin in terms of mean reciprocal rank. Finally, we highlight the benefits of GISMo by integrating it in an improved image-to-recipe generation pipeline, enabling recipe personalization through user intervention. Quantitative and qualitative results show the efficacy of our proposed system, paving the road towards truly personalized cooking and tasting experiences.

2023-02-15

ArXiv (preprint)