Researchers carry out the largest-ever psychedelics study using natural language processing tools

Brigitte Tousignant
Researchers carry out the largest-ever psychedelics study using natural language processing tools

Drugs long stigmatized, psychedelics like psilocybin and MDMA are increasingly popular and rising in profile in clinical research as potential treatment options for major mental illness, such as post-traumatic stress disorder, depression, and schizophrenia. This area of research is experiencing a boom at a time when COVID-19 has raised global awareness of the heavy toll of increasing mental health conditions.

While this new wave of research is promising, a lot remains to be understood in how these drug agents shift consciousness in human perception and the brain.

In the world’s largest study on psychedelics and the brain to date, Danilo Bzdok—Professor at Mila and McGill University and researcher at The Neuro—and colleagues from the Broad Institute at Harvard/MIT and SUNY Downstate Health Sciences University reveal how drug-induced changes in conscious awareness are anatomically rooted in distinct neurotransmitter subreceptor systems thanks to translating natural language processing (NLP) tools from machine learning.


Using NLP to map changes of conscious awareness to neurotransmitter systems in the brain

Psychedelics and other hallucinogenic drugs have been part of the underground culture for a long time. Such consciousness-altering experiences can have such a profound impact that users often feel compelled to share their personal stories and detailed experiences, whether positive or negative. This has given rise to numerous publicly available first-hand drug experience reports. Erowid Center, a not-for-profit organization, is one such portal that hosts an educational library with over 38,000 testimonials on the effects of psychoactive drugs.

The researchers mined 6,850 high-quality testimonials on Erowid Center from people who took a range of 27 different drugs, in which participants openly described their personal hallucinogenic experiences.

To enable receptor-experience modeling, Dr. Bzdok and his team constructed a bag-of-words encoding of the text descriptions in each testimonial and tabulated the counts of each word per testimonial (See “Natural language processing pipeline” in the paper). This representation of testimonials directly captured how individuals have articulated the changes in their conscious awareness, which touches on thinking, perception, emotions, and other psychological alterations. Before applying the bag-of-words encoding, researchers cleaned the text, for example, by removing punctuation marks and special characters, discarding words less than two characters, removing words that occurred less than seven times in the entire corpus of experience reports, and by removing common prepositions, such as pronouns, drug names, and articles. Subsequently, all words in the dictionary were lower-cased for the sake of consistency. This bag-of-words encoding tactic towards cleaned testimonials yielded a word matrix M with over 14,000 unique words.

To process the sparse testimonial-word matrix M, the team employed the term-frequency inverse-document-frequency (tf-idf) transformation commonly used in NLP (1). This step allowed the researchers to compute a quantity of word frequency in a given testimonial in a way that carefully accounts for its global prevalence in the entire corpus of experience reports. However, this type of bag-of-words representation is naive about word order and thus ignores event sequences in the experience reports. Nevertheless, this word encoding scheme captures high-granularity semantic information in many text-mining applications (2).

To automatically search and organize the space of semantic representations in the experience reports, researchers turned to latent semantic analysis (LSA) to effectively identify patterns (3). LSA allowed them to reliably detect and extract the similarity of collective words that co-occur in a testimonial to the extent to which they are attributable to a common semantic context. Applying LSA to the processed matrix M, generated after the tf-idf transformation on the bag-of-words matrix, enabled the extraction of a set of semantic components, ordered from most to least significant based on the explained variance in summarizing word usage combinations.

Two critical factors influenced the choice of drugs for which experience reports were selected. The drug should have a well-known receptor binding affinity (4, 5) and mediate effects through multiple receptor systems. Based on their receptor binding affinity, the team built a normalized vector for these selected drugs that capture the binding strengths \( K_ {i} \) of 40 targets: G protein-coupled receptors (GPCRs), molecular transporters, and ion channels.

The team wanted to adopt a data-driven strategy to find the dominant factors — “modes” of joint variation — that explain how semantic components emerging from word usage patterns are inter-linked with receptor binding affinities from 40 neurotransmitter receptors subclasses. They concluded that the canonical correlation analysis (CCA) algorithm was ideally suited to interrogate the possible existence of such a multi-modal correspondence between two high-dimensional variable sets (6, 7).

The researchers picked the top \(k\)=800 discovered semantic components from the LSA-derived re-representation of experience reports that formed the first variable set \( X_ {|testimonials| x k} \). The second variable set \( Y_ {|testimonials| x 40} \) was constructed from each drug’s known pharmacological properties regarding molecular affinity to the neurotransmitter receptors.

CCA involves computation of the projection vectors a and b that maximize the relationship between a linear combination of semantic contexts (\(X\)) and a linear combination of receptor affinity profiles (\(Y\)) across testimonials. CCA searches a large space of possible combinations by identifying the two projections \(Xa\) and \(Yb\) that yielded maximal association between the semantic context features of the drug experience and the neurotransmitters in the brain to which the drug binds.

NLP tools and joint modeling from CCA allowed the researchers to incorporate the relative receptor binding strengths of the drugs along with the set of semantic components extracted from testimonials to elucidate the spatial distribution of hallucinogenic compounds that modulate neuronal activity throughout the cortex during subjective “trips.”

This machine learning approach to the brain basis of psychedelics is an exciting first step for future work in this area that could potentially lead to further the development of machine learning systems that better predict which neurotransmitter receptor combinations need to be stimulated to induce a specific state of conscious experience in a given person.

This study was published in the journal Science Advances on March 16, 2022.



  1. R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval (ACM Press, 1999), vol. 463.
  2. Baeza-Yates,R.,Ribeiro-Neto,B.& Others. Modern information retrieval.vol.463(ACMpressNewYork, 1999). 
  3. Landauer,T.K.LatentSemanticAnalysis.EncyclopediaofCognitiveScience(2006) doi:10.1002/0470018860.s00561. 
  4. Ray,T.S.Psychedelics and the human receptorome.PLoSOne5,e9019(2010).
  5. Rickli, A. et al. Receptor interaction profiles of novel N-2-methoxybenzyl(NBOMe) derivatives of 2,5-dimethoxy-substituted phenethylamines (2C drugs). Neuropharmacology 99, 546–553 (2015). 
  6. Smith, S.M. et al. A positive-negative mode of population covariation links brain connectivity, demographics and behavior. Nat. Neurosci. 18, 1565–1567 (2015). 
  7. Wang, H.-T .et al. Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists. Neuroimage 216, 116745 (2020).