Publications

Sample Boosting Algorithm (SamBA) - An interpretable greedy ensemble classifier based on local expertise for fat data
Baptiste Bauvin
Cécile Capponi
Florence Clerc
Sokol Koço
Jacques Corbeil
Scaling Self-Supervised End-to-End Driving with Multi-View Attention Learning
Yi Xiao
Felipe Codevilla
Diego Porres
Antonio M. López
Screening methods for congenital anomalies in low and lower-middle income countries: A systematic review.
Justina O. Seyi-Olajide
Xiya Ma
Elena Guadagno
Adesoji Ademuyiwa
Self-Influence Guided Data Reweighting for Language Model Pre-training
Megh Thakkar
Tolga Bolukbasi
Sriram Ganapathy
Shikhar Vashishth
Partha Talukdar
Language Models (LMs) pre-trained with selfsupervision on large text corpora have become the default starting point for developing models fo… (see more)r various NLP tasks. Once the pre-training corpus has been assembled, all data samples in the corpus are treated with equal importance during LM pre-training. However, due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. While data reweighting has been explored in the context of task-specific supervised learning and LM fine-tuning, model-driven reweighting for pretraining data has not been explored. We fill this important gap and propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training. PRESENCE promotes novelty and stability for model pre-training. Through extensive analysis spanning multiple model sizes, datasets, and tasks, we present PRESENCE as an important first step in the research direction of sample reweighting for pre-training language models.
SORBETmatcher results for OAEI 2023.
Francis Gosselin
Amal Zouaq
Of Stances, Themes, and Anomalies in COVID-19 Mask-Wearing Tweets
Jwen Fai Low
Farkhund Iqbal
COVID-19 is an opportunity to study public acceptance of a “new” healthcare intervention, universal masking, which unlike vaccination, i… (see more)s mostly alien to the Anglosphere public despite being practiced in ages past. Using a collection of over two million tweets, we studied the ways in which proponents and opponents of masking vied for influence as well as the themes driving the discourse. Pro-mask tweets encouraging others to mask up dominated Twitter early in the pandemic though its continued dominance has been eroded by anti-mask tweets criticizing others for their masking behavior. Engagement, represented by the counts of likes, retweets, and replies, and controversiality and disagreeableness, represented by ratios of the aforementioned counts, favored pro-mask tweets initially but with anti-mask tweets slowly gaining ground. Additional analysis raised the possibility of the platform owners suppressing certain parts of the mask-wearing discussion.
Stochastic Generative Flow Networks
Ling Pan
Dinghuai Zhang
Moksh J. Jain
Longbo Huang
Generative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures… (see more) through the lens of ``inference as control''. They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.
Stochastic Generative Flow Networks
Ling Pan
Dinghuai Zhang
Moksh J. Jain
Longbo Huang
SUMMIT: Scaffolding OSS Issue Discussion Through Summarization
Saskia Gilmer
Avinash Bhat
Shuvam Shah
Kevin Cherry
Jinghui Cheng
SUMMIT: Scaffolding OSS Issue Discussion Through Summarization
Saskia Gilmer
Avinash Bhat
Shuvam Shah
Kevin Cherry
Jinghui Cheng
Supplementary Material for MixupE
Yingtian Zou
Vikas Verma
Sarthak Mittal
Wai Hoh Tang
Hieu Pham
Juho Kannala
Arno Solin
Kenji Kawaguchi
We denote by z = (x,y) the input and output pair where x ∈ X ⊆ R and y ∈ Y ⊆ R . Let fθ(x) ∈ R be the output of the logits (i.e.,… (see more) the last layer before the softmax or sigmoid) of the model parameterized by θ. We use l(θ, z) = h(fθ(x)) − yfθ(x) to denote the loss function. Let g(·) be the activation function. We use x(i) to index i-th element of the vector x and xj to represent j-th variable in a set. The notation list is:
A Survey of Diversification Metrics and Approaches in Retrieval Systems: From the Perspective of Search and Recommendation
Haolun Wu
Yansen Zhang
Chen Ma
Fuyuan Lyu
Diversifying search results is an important research topic in retrieval systems in order to satisfy both the various interests of customers … (see more)and the equal market exposure of providers. There has been a growing attention on diversity-aware research during recent years, accompanied by a proliferation of literature on methods to promote diversity in search and recommendation. However, the diversity-aware studies in retrieval systems lack a systematic organization and are rather fragmented. In this survey, we are the first to propose a unified taxonomy for classifying the metrics and approaches of diversification in both search and recommendation, which are two of the most extensively researched fields of retrieval systems. We begin the survey with a brief discussion of why diversity is important in retrieval systems