Publications

StereoSet: Measuring stereotypical bias in pretrained language models

Moin Nadeem

Anna Bethke

A stereotype is an over-generalized belief about a particular group of people, e.g., Asians are good at math or African Americans are athlet… (see more)ic. Such beliefs (biases) are known to hurt target groups. Since pretrained language models are trained on large real-world data, they are known to capture stereotypical biases. It is important to quantify to what extent these biases are present in them. Although this is a rapidly growing area of research, existing literature lacks in two important aspects: 1) they mainly evaluate bias of pretrained language models on a small set of artificial sentences, even though these models are trained on natural data 2) current evaluations focus on measuring bias without considering the language modeling ability of a model, which could lead to misleading trust on a model even if it is a poor language model. We address both these problems. We present StereoSet, a large-scale natural English dataset to measure stereotypical biases in four domains: gender, profession, race, and religion. We contrast both stereotypical bias and language modeling ability of popular models like BERT, GPT-2, RoBERTa, and XLnet. We show that these models exhibit strong stereotypical biases. Our data and code are available at https://stereoset.mit.edu.

2021-08-01

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (published)

doi.org

arxiv.org

Supervised multi-specialist topic model with applications on large-scale electronic health record data

Ziyang Song

Xavier Sumba Toral

Yixin Xu

Aihua Liu

Liming Guo

Guido Powell

Aman Verma

David Buckeridge

Ariane Marelli

Yue Li

Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision … (see more)medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. Materials and Methods: We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. Results: We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. Availability and implementation: MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS

2021-08-01

Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (published)

doi.org

arxiv.org

A systematic analysis of ICSD-3 diagnostic criteria and proposal for further structured iteration.

Christophe Gauld

Régis Lopez

Pierre A. GEOFFROY

Charles Morin

Kelly Guichard

Elodie Giroux

Yves Dauvilliers

Guillaume Dumas

Pierre Philip

Jean‐Arthur Micoulaud‐Franchi

2021-08-01

Sleep Medicine Reviews (published)

doi.org

Temporal Profiles of Social Attention Are Different Across Development in Autistic and Neurotypical People.

Teresa Del Bianco

Luke Mason

Tony Charman

Julianne Tillman

Eva Loth

Hannah Hayward

F. Shic

Jan K. Buitelaar

Mark Johnson

Emily J. H. Jones

Jumana Ahmad

Sara Ambrosino

Tobias Banaschewski

Simon Baron-Cohen

Sarah Baumeister

Christian Beckmann

Sven Bölte

Thomas Bourgeron

Carsten Bours

M. Brammer … (see 46 more)

Daniel Brandeis

Claudia Brogna

Yvette de Bruijn

Ineke Cornelissen

Daisy Crawley

Flavio Dell’Acqua

Guillaume Dumas

Sarah Durston

Christine Ecker

Jessica Faulkner

Vincent Frouin

Pilar Garcés

David Goyard

Lindsay Ham

Joerg F. Hipp

Rosemary Holt

Meng-Chuan Lai

Xavier Liogier D’ardhuy

Michael V. Lombardo

David J. Lythgoe

René Mandl

Andre Marquand

Maarten Mennes

Andreas Meyer-Lindenberg

Carolin Moessnang

Nico Mueller

Declan Murphy

Beth Oakley

Laurence O’Dwyer

Marianne Oldehinkel

Bob Oranje

Gahan Pandina

Antonio Persico

Barbara Ruggeri

Amber N. V. Ruigrok

Jessica Sabet

Roberto Sacco

Antonia San José Cáceres

Emily Simonoff

Will Spooren

Roberto Toro

Heike Tost

Jack Waldman

Steve C. R. Williams

Caroline Wooldridge

Marcel P. Zwiers

2021-08-01

Biological Psychiatry: Cognitive Neuroscience and Neuroimaging (published)

doi.org

Why do sleep disorders belong to mental disorder classifications? A network analysis of the "Sleep-Wake Disorders" section of the DSM-5.

Christophe Gauld

Régis Lopez

Charles Morin

Julien Maquet

Aileen McGonigal

Pierre A. GEOFFROY

Eric Fakra

Pierre Philip

Guillaume Dumas

Jean‐Arthur Micoulaud‐Franchi

2021-08-01

Journal of Psychiatric Research (published)

doi.org

Human brain anatomy reflects separable genetic and environmental components of socioeconomic status

H. Kweon

Gökhan Aydogan

Alain Dagher

Danilo Bzdok

C. Ruff

Gideon Nave

Martha J Farah

Philipp Koellinger

Recent studies report that socioeconomic status (SES) correlates with brain structure. Yet, such findings are variable and little is known a… (see more)bout underlying causes. We present a well-powered voxel-based analysis of grey matter volume (GMV) across levels of SES, finding many small SES effects widely distributed across the brain, including cortical, subcortical and cerebellar regions. We also construct a polygenic index of SES to control for the additive effects of common genetic variation related to SES, which attenuates observed SES-GMV relations, to different degrees in different areas. Remaining variance, which may be attributable to environmental factors, is substantially accounted for by body mass index, a marker for lifestyle related to SES. In sum, SES affects multiple brain regions through measurable genetic and environmental effects. One-sentence Summary Socioeconomic status is linked with brain anatomy through a varying balance of genetic and environmental influences.

2021-07-29

bioRxiv (preprint)

doi.org

Local Structure Matters Most: Perturbation Study in NLU

Louis Clouâtre

Prasanna Parthasarathi

Amal Zouaq

Sarath Chandar Anbil Parthipan

Recent research analyzing the sensitivity of natural language understanding models to word-order perturbations has shown that neural models … (see more)are surprisingly insensitive to the order of words.In this paper, we investigate this phenomenon by developing order-altering perturbations on the order of words, subwords, and characters to analyze their effect on neural models’ performance on language understanding tasks.We experiment with measuring the impact of perturbations to the local neighborhood of characters and global position of characters in the perturbed texts and observe that perturbation functions found in prior literature only affect the global ordering while the local ordering remains relatively unperturbed.We empirically show that neural models, invariant of their inductive biases, pretraining scheme, or the choice of tokenization, mostly rely on the local structure of text to build understanding and make limited use of the global structure.

2021-07-29

ArXiv (preprint)

doi.org

arxiv.org

Clones in deep learning code: what, where, and why?

Hadhemi Jebnoun

Md. Saidur Rahman

Foutse Khomh

Biruk Asmare Muse

2021-07-28

ArXiv (preprint)

doi.org

arxiv.org

Automated Data-Driven Generation of Personalized Pedagogical Interventions in Intelligent Tutoring Systems

Ekaterina Kochmar

Dung D. Vu

Robert Belfer

Varun Gupta

Iulian V. Serban

Joelle Pineau

2021-07-27

International Journal of Artificial Intelligence in Education (published)

doi.org

Automated Data-Driven Generation of Personalized Pedagogical Interventions in Intelligent Tutoring Systems

Ekaterina Kochmar

Dung D. Vu

Robert Belfer

Varun Gupta

Iulian V. Serban

Joelle Pineau

2021-07-27

International Journal of Artificial Intelligence in Education (published)

doi.org

Geographical concentration of COVID-19 cases by social determinants of health in 16 large metropolitan areas in Canada - a cross-sectional study

Yiqing Xia

Huiting Ma

Gary Moloney

Héctor A. Velásquez García

Monica Sirski

Naveed Janjua

David Vickers

Tyler Williamson

Alan Katz

Kristy Yu

Rafal Kustra

David Buckeridge

Marc Brisson

Stefan Baral

Sharmistha Mishra

Mathieu Maheu-Giroux

Background: There is a growing recognition that strategies to reduce SARS-CoV-2 transmission should be responsive to local transmission dyna… (see more)mics. Studies have revealed inequalities along social determinants of health, but little investigation was conducted surrounding geographic concentration within cities. We quantified social determinants of geographic concentration of COVID-19 cases across sixteen census metropolitan areas (CMA) in four Canadian provinces. Methods: We used surveillance data on confirmed COVID-19 cases at the level of dissemination area. Gini (co-Gini) coefficients were calculated by CMA based on the proportion of the population in ranks of diagnosed cases and each social determinant using census data (income, education, visible minority, recent immigration, suitable housing, and essential workers) and the corresponding share of cases. Heterogeneity was visualized using Lorenz (concentration) curves. Results: Geographic concentration was observed in all CMAs (half of the cumulative cases were concentrated among 21-35% of each city's population): with the greatest geographic heterogeneity in Ontario CMAs (Gini coefficients, 0.32-0.47), followed by British Columbia (0.23-0.36), Manitoba (0.32), and Quebec (0.28-0.37). Cases were disproportionately concentrated in areas with lower income, education attainment, and suitable housing; and higher proportion of visible minorities, recent immigrants, and essential workers. Although a consistent feature across CMAs was concentration by proportion visible minorities, the magnitude of concentration by social determinants varied across CMAs. Interpretation: The feature of geographical concentration of COVID-19 cases was consistent across CMAs, but the pattern by social determinants varied. Geographically-prioritized allocation of resources and services should be tailored to the local drivers of inequalities in transmission in response to SARS-CoV-2's resurgence.

2021-07-26

medRxiv (preprint)

doi.org

Modelling Latent Translations for Cross-Lingual Transfer

Edoardo Ponti

Julia Kreutzer

Ivan Vulic

Siva Reddy

2021-07-23

ArXiv (preprint)

arxiv.org

AI Research Driven by Real-World Problems

AI Policy Compass

Student Life and Resources

Publications

AI Research Driven by Real-World Problems

AI Policy Compass

Student Life and Resources

Popular keywords:

Publications