Portrait of Benjamin Fung

Benjamin Fung

Associate Academic Member
Associate Professor, McGill University, School of Information Studies
McGill University University
Research Topics
AI for Software Engineering
Applied Machine Learning
Cybersecurity
Data Mining
Deep Learning
Information Retrieval
Misinformation
Privacy
Representation Learning

Biography

Benjamin Fung is a Canada Research Chair in Data Mining for Cybersecurity, as well as a full professor at the School of Information Studies and associate member of the School of Computer Science, McGill University.

Fung serves as an associate editor of IEEE Transactions of Knowledge and Data Engineering and Sustainable Cities and Society. He received his PhD in computing science from Simon Fraser University in 2007.

Dr. Fung has over 150 refereed publications to his credit and and more than 14,000 citations (h-index 57) spanning the fields of data mining, machine learning, privacy, cybersecurity and building engineering. His findings in the fields of data mining for crime investigations and authorship analysis have been reported by the media worldwide.

Publications

MalGPT: A Generative Explainable Model for Malware Binaries
Mohd Saqib
Steven H. H. Ding
Philippe Charland
MalGPT: A Generative Explainable Model for Malware Binaries
Mohd Saqib
Steven H. H. Ding
Philippe Charland
Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity
Charles E. Gagnon
Steven H. H. Ding
Philippe Charland
Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity
Charles E. Gagnon
Steven H. H. Ding
Philippe Charland
Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identify… (see more)ing semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.
Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity
Charles E. Gagnon
Steven H. H. Ding
Philippe Charland
Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identify… (see more)ing semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.
Responsible AI Day
Ebrahim Bagheri
Faezeh Ensan
Calvin Hillis
Robin Cohen
Sébastien Gambs
Responsible AI Day
Ebrahim Bagheri
Faezeh Ensan
Calvin Hillis
Robin Cohen
Sébastien Gambs
Diminished social memory and hippocampal correlates of social interactions in chronic social defeat stress susceptibility
Amanda Larosa
Tian Rui Zhang
Alice S. Wong
Cyrus Y.H. Fung
Y. H. Fung Cyrus
Xiong Ling Yun (Jenny) Long
Prabhjeet Singh
Tak Pan Wong
PAC-X: Fuzzy Explainable AI for Multi-Class Malware Detection
Mohd Saqib
Philippe Charland
CSGraph2Vec: Distributed Graph-Based Representation Learning for Assembly Functions
Wael J. Alhashemi
Adel Abusitta
Claude Fachkha
AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection
Arezo Bodaghi
Ketra A. Schmitt
A Literature Review on Detecting, Verifying, and Mitigating Online Misinformation
Arezo Bodaghi
Ketra A. Schmitt
Pierre Watine
Social media use has transformed communication and made social interaction more accessible. Public microblogs allow people to share and acce… (see more)ss news through existing and social-media-created social connections and access to public news sources. These benefits also create opportunities for the spread of false information. False information online can mislead people, decrease the benefits derived from social media, and reduce trust in genuine news. We divide false information into two categories: unintentional false information, also known as misinformation; and intentionally false information, also known as disinformation and fake news. Given the increasing prevalence of misinformation, it is imperative to address its dissemination on social media platforms. This survey focuses on six key aspects related to misinformation: 1) clarify the definition of misinformation to differentiate it from intentional forms of false information; 2) categorize proposed approaches to manage misinformation into three types: detection, verification, and mitigation; 3) review the platforms and languages for which these techniques have been proposed and tested; 4) describe the specific features that are considered in each category; 5) compare public datasets created to address misinformation and categorize into prelabeled content-only datasets and those including users and their connections; and 6) survey fact-checking websites that can be used to verify the accuracy of information. This survey offers a comprehensive and unprecedented review of misinformation, integrating various methodological approaches, datasets, and content-, user-, and network-based approaches, which will undoubtedly benefit future research in this field.