Portrait of Vladimir Makarenkov

Vladimir Makarenkov

Affiliate Member
Full Professor, UQAM, Department of Computer Science
Research Topics
Clustering
Computational Biology
Deep Learning
Medical Machine Learning

Biography

Vladimir Makarenkov is a full professor and director of the graduate program in bioinformatics at Université du Québec à Montréal (UQAM). He holds a master's degree in applied mathematics from Lomonosov Moscow State University and a PhD in computer science and mathematics from the École des hautes études en sciences sociales (EHESS) in Paris. Before joining the computer science department at UQAM, he completed a three-year postdoctoral fellowship at the Digital Ecology Lab at Université de Montréal.

He is the author of 80 journal articles and 67 conference papers, and the recipient of the prestigious Simon Régnier Prize and the Chikio Hayashi Prize awarded by the International Society for Mathematical Classification. His research focuses on AI, bioinformatics and data mining. This encompasses the design and development of novel unsupervised and supervised machine learning methods, as well as the use of machine learning techniques, including clustering and deep learning, for the analysis of biological and biomedical data.

Makarenkov’s current research also involves the development of an automated recommendation system based on deep learning to recommend the best clustering algorithm for a given input dataset. Additionally, he is working on creating a generic machine learning model to define the concept of cluster, and on comparing various auto-encoding approaches and clustering algorithms to achieve better clustering results.

Publications

Applying graph neural networks to predict fungal disease occurrences in precision agriculture
Stéphane Samson
Étienne Lord
Odile Carisse
Abstract Purpose Fungal diseases remain among the leading causes of global crop losses, with management still heavily reliant on fungicide a… (see more)pplications. While traditional decision support systems and machine learning models offer valuable predictive insights, they often overlook the spatial and relational dynamics underlying pathogen spread. This study evaluates the feasibility and advantages of Graph Neural Networks (GNNs) for predicting fungal disease occurrence in three key crops—onion ( Botrytis squamosa ), lettuce ( Botrytis lactucae ), and carrot ( Cercospora carotae )—to enhance precision agriculture decision-making. Methods Field observations from farms in southern Quebec were used to build plant-level graphs, with nodes representing plants enriched by biological and weather features, and edges defined by spatial proximity. Graph convolutional networks were trained for binary fungal disease occurrence classification and benchmarked against machine learning and deep learning baselines. Graph augmentation techniques and robustness tests under missing and noisy features were applied to assess GNN’s stability. Results Across the three pathosystems, GNNs achieved the strongest overall predictive performance. For onions ( B. squamosa ), Random Forest slightly outperformed the GNN on the complete feature set (accuracy = 76.4% and F1-score = 0.77); here, the GNN provided lower but comparable metric scores (accuracy = 74.8% and F1-score = 0.73). For lettuce ( B. lactucae ), the GNN achieved the highest metric scores with the accuracy of 90.4% and F1-score of 0.90, surpassing all other baselines. For carrot ( C. carotae ), GNNs reached the accuracy of 75.8% and F1-score of 0.77, clearly outperforming Decision Tree, Random Forest, k-NN, and Feed-Forward Neural Networks (FFNs). Graph augmentation further improved the GNN results: random walk sampling increased the model’s accuracy on onion data to 79.3% and F1-score to 0.79, and on lettuce data to 93.9% and to 0.94, respectively, while node/edge perturbation improved the model’s accuracy on carrot data to 78.6% and F1-score to 0.80. Furthermore, the results of the robustness experiments suggest that GNNs can still track overall field-level infection trends with up to 75% of features masked or 50% replaced by noise. Conclusion GNNs offer clear advantages for fungal disease occurrence prediction by incorporating spatial and relational plant patterns, thus improving both the accuracy and robustness of predicted outcomes.
Assessing the impact of dimensionality reduction on clustering performance -- a systematic study
Ousmane Assani Amate
Mohammadreza Bakhtyari
Émilie Roy
Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact a… (see more)cross diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.
On the Objective and Feature Weights of Minkowski Weighted k-Means
Renato Cordeiro De Amorim
The Minkowski weighted k-means (mwk-means) algorithm extends classical k-means by incorporating feature weights and a Minkowski distance. De… (see more)spite its empirical success, its theoretical properties remain insufficiently understood. We show that the mwk-means objective can be expressed as a power-mean aggregation of within-cluster dispersions, with the order determined by the Minkowski exponent p. This formulation reveals how p controls the transition between selective and uniform use of features. Using this representation, we derive bounds for the objective function and characterise the structure of the feature weights, showing that they depend only on relative dispersion and follow a power-law relationship with dispersion ratios. This leads to explicit guarantees on the suppression of high-dispersion features. Finally, we establish convergence of the algorithm and provide a unified theoretical interpretation of its behaviour.
Soil microbiome prediction using traditional machine learning and deep learning models
Zahia Aouabed
Vincent Therrien
Mohamed Achraf Bouaoune
Mohammadreza Bakhtyari
Mohamed Hijri
The accuracy of macrobiological community predictions largely depends on the taxonomic scale considered. Nowadays, the applicability of such… (see more) predictions remains an important challenge when extended to microbial soil communities. This is not only due to the lack of reliable benchmark data, but also to a greater diversity of the soil microorganisms compared to other environments. In this study, we use six traditional machine learning regression models and one deep learning regressor to predict relative frequencies of bacterial and fungal communities within the soil microbiome based on environmental factors. We analyze the data from two publicly available soil microbiome datasets: (1) Data collected by Averill and co-authors and analyzed in a recent Nature Ecology and Evolution article, and (2) Data extracted from the NEON database, to estimate the composition of bacterial and fungal communities at the functional (i.e. functional group level) and taxonomic scales (i.e. phylum, class, order, family, and genus levels). Our findings suggest the presence of a general pattern across the observed taxonomic scales according to which the predictability of the soil microbiome increases with taxonomic scale. However, a notable exception occurs when machine learning models are applied to predict bacterial communities at the functional group level for Averill et al.’s data when all of them fail to provide accurate predictions results. The best overall results obtained include the value of the coefficient of determination
Soil microbiome prediction using traditional machine learning and deep learning models
Zahia Aouabed
Vincent Therrien
Mohamed Achraf Bouaoune
Mohammadreza Bakhtyari
Mohamed Hijri
The accuracy of macrobiological community predictions largely depends on the taxonomic scale considered. Nowadays, the applicability of such… (see more) predictions remains an important challenge when extended to microbial soil communities. This is not only due to the lack of reliable benchmark data, but also to a greater diversity of the soil microorganisms compared to other environments. In this study, we use six traditional machine learning regression models and one deep learning regressor to predict relative frequencies of bacterial and fungal communities within the soil microbiome based on environmental factors. We analyze the data from two publicly available soil microbiome datasets: (1) Data collected by Averill and co-authors and analyzed in a recent Nature Ecology and Evolution article, and (2) Data extracted from the NEON database, to estimate the composition of bacterial and fungal communities at the functional (i.e. functional group level) and taxonomic scales (i.e. phylum, class, order, family, and genus levels). Our findings suggest the presence of a general pattern across the observed taxonomic scales according to which the predictability of the soil microbiome increases with taxonomic scale. However, a notable exception occurs when machine learning models are applied to predict bacterial communities at the functional group level for Averill et al.’s data when all of them fail to provide accurate predictions results. The best overall results obtained include the value of the coefficient of determination
Similarity-based transfer learning with deep learning networks for accurate CRISPR-Cas9 off-target prediction.
Transfer learning has emerged as a powerful tool for enhancing predictive accuracy in complex tasks, particularly in scenarios where data is… (see more) limited or imbalanced. This study explores the use of similarity-based pre-evaluation as a methodology to identify optimal source datasets for transfer learning, addressing the dual challenge of efficient source-target dataset pairing and off-target prediction in CRISPR-Cas9, while existing transfer learning applications in the field of gene editing often lack a principled method for source dataset selection. We use cosine, Euclidean, and Manhattan distances to evaluate similarity between the source and target datasets used in our transfer learning experiments. Four deep learning network architectures, i.e. Multilayer Perceptron (MLP), Convolutional Neural Networks (CNNs), Feedforward Neural Networks (FNNs), and Recurrent Neural Networks (RNNs), and two traditional machine learning models, i.e. Logistic Regression (LR) and Random Forest (RF), were tested and compared in our simulations. The results suggest that similarity scores are reliable indicators for pre-selecting source datasets in CRISPR-Cas9 transfer learning experiments, with cosine distance proving to be a more effective dataset comparison metric than either Euclidean or Manhattan distances. An RNN-GRU, a 5-layer FNN, and two MLP variants provided the best overall prediction results in our simulations. By integrating similarity-based source pre-selection with machine learning outcomes, we propose a dual-layered framework that not only streamlines the transfer learning process but also significantly improves off-target prediction accuracy. The code and data used in this study are freely available at: https://github.com/dagrate/transferlearning_offtargets .
ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation
Mohammadreza Bakhtyari
Renato Cordeiro De Amorim
Towards an Interpretable Machine Learning Model for Predicting Antimicrobial Resistance
Mohamed Mediouni
Abdoulaye Banire Diallo
Quantifying antimicrobial resistance in food-producing animals in North America
Mohamed Mediouni
Abdoulaye Banire Diallo
The global misuse of antimicrobial medication has further exacerbated the problem of antimicrobial resistance (AMR), enriching the pool of g… (see more)enetic mechanisms previously adopted by bacteria to evade antimicrobial drugs. AMR can be either intrinsic or acquired. It can be acquired either by selective genetic modification or by horizontal gene transfer that allows microorganisms to incorporate novel genes from other organisms or environments into their genomes. To avoid an eventual antimicrobial mistreatment, the use of antimicrobials in farm animal has been recently reconsidered in many countries. We present a systematic review of the literature discussing the cases of AMR and the related restrictions applied in North American countries (including Canada, Mexico, and the USA). The Google Scholar, PubMed, Embase, Web of Science, and Cochrane databases were searched to find plausible information on antimicrobial use and resistance in food-producing animals, covering the time period from 2015 to 2024. A total of 580 articles addressing the issue of antibiotic resistance in food-producing animals in North America met our inclusion criteria. Different AMR rates, depending on the bacterium being observed, the antibiotic class being used, and the farm animal being considered, have been identified. We determined that the highest average AMR rates have been observed for pigs (60.63% on average), the medium for cattle (48.94% on average), and the lowest for poultry (28.43% on average). We also found that Cephalosporines, Penicillins, and Tetracyclines are the antibiotic classes with the highest average AMR rates (65.86%, 61.32%, and 58.82%, respectively), whereas the use of Sulfonamides and Quinolones leads to the lowest average AMR (21.59% and 28.07%, respectively). Moreover, our analysis of antibiotic-resistant bacteria shows that Streptococcus suis (S. suis) and S. auerus provide the highest average AMR rates (71.81% and 69.48%, respectively), whereas Campylobacter spp. provides the lowest one (29.75%). The highest average AMR percentage, 57.46%, was observed in Mexico, followed by Canada at 45.22%, and the USA at 42.25%, which is most probably due to the presence of various AMR control strategies, such as stewardship programs and AMR surveillance bodies, existing in Canada and the USA. Our review highlights the need for better strategies and regulations to control the spread of AMR in North America.
Improving clustering quality evaluation in noisy Gaussian mixtures
Renato Cordeiro De Amorim
BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging
Moloud Abdar
Mohammadreza Bakhtyari
Test-time augmentation (TTA) is a well-known technique employed during the testing phase of computer vision tasks. It involves aggregating m… (see more)ultiple augmented versions of input data. Combining predictions using a simple average formulation is a common and straightforward approach after performing TTA. This paper introduces a novel framework for optimizing TTA, called BayTTA (Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First, we generate a model list associated with different variations of the input data created through TTA. Then, we use BMA to combine model predictions weighted by their respective posterior probabilities. Such an approach allows one to take into account model uncertainty, and thus to enhance the predictive performance of the related machine learning or deep learning model. We evaluate the performance of BayTTA on various public data, including three medical image datasets comprising skin cancer, breast cancer, and chest X-ray images and two well-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental results indicate that BayTTA can be effectively integrated into state-of-the-art deep learning models used in medical image analysis as well as into some popular pre-trained CNN models such as VGG-16, MobileNetV2, DenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in their accuracy and robustness performance.
A self-attention-based CNN-Bi-LSTM model for accurate state-of-charge estimation of lithium-ion batteries