Portrait de Benjamin Fung

Benjamin Fung

Membre académique associé
Professeur agrégé, McGill University, École des sciences de l'information
McGill University University
Sujets de recherche
Apprentissage automatique appliqué
Apprentissage de représentations
Apprentissage profond
Cybersécurité
Désinformation
Exploration des données
IA pour l'ingénierie logicielle
Recherche d'information
Vie privée

Biographie

Benjamin Fung est titulaire d'une chaire de recherche du Canada en exploration de données pour la cybersécurité, professeur agrégé à l’École des sciences de l’information et membre agrégé de l’École d’informatique de l'Université McGill, rédacteur adjoint de IEEE Transactions of Knowledge and Data Engineering et rédacteur adjoint de Elsevier Sustainable Cities and Society (SCS). Il a obtenu un doctorat en informatique de l'Université Simon Fraser en 2007. Il a à son actif plus de 150 publications revues par un comité de lecture, et plus de 14 000 citations (h-index 57) qui couvrent les domaines de l'exploration de données, de l'apprentissage automatique, de la protection de la vie privée, de la cybersécurité et du génie du bâtiment. Ses travaux d'exploration de données dans les enquêtes criminelles et l'analyse de la paternité d’une œuvre ont été recensés par les médias du monde entier.

Publications

VDGraph2Vec: Vulnerability Detection in Assembly Code using Message Passing Neural Networks
Ashita Diwan
Miles Q. Li
Software vulnerability detection is one of the most challenging tasks faced by reverse engineers. Recently, vulnerability detection has rece… (voir plus)ived a lot of attention due to a drastic increase in the volume and complexity of software. Reverse engineering is a time-consuming and labor-intensive process for detecting malware and software vulnerabilities. However, with the advent of deep learning and machine learning, it has become possible for researchers to automate the process of identifying potential security breaches in software by developing more intelligent technologies. In this research, we propose VDGraph2Vec, an automated deep learning method to generate representations of assembly code for the task of vulnerability detection. Previous approaches failed to attend to topological characteristics of assembly code while discovering the weakness in the software. VDGraph2Vec embeds the control flow and semantic information of assembly code effectively using the expressive capabilities of message passing neural networks and the RoBERTa model. Our model is able to learn the important features that help distinguish between vulnerable and non-vulnerable software. We carry out our experimental analysis for performance benchmark on three of the most common weaknesses and demonstrate that our model can identify vulnerabilities with high accuracy and outperforms the current state-of-the-art binary vulnerability detection models.
Towards Adaptive Cybersecurity for Green IoT
Talal Halabi
Martine Bellaiche
The Internet of Things (IoT) paradigm has led to an explosion in the number of IoT devices and an exponential rise in carbon footprint incur… (voir plus)red by overburdened IoT networks and pervasive cloud/edge communications. Hence, there is a growing interest in industry and academia to enable the efficient use of computing infrastructures by optimizing the management of data center and IoT resources (hardware, software, network, and data) and reducing operational costs to slash greenhouse gas emissions and create healthy environments. Cybersecurity has also been considered in such efforts as a contributor to these environmental issues. Nonetheless, most green security approaches focus on designing low-overhead encryption schemes and do not emphasize energy-efficient security from architectural and deployment viewpoints. This paper sheds light on the emerging paradigm of adaptive cybersecurity as one of the research directions to support sustainable computing in green IoT. It presents three potential research directions and their associated methods for designing and deploying adaptive security in green computing and resource-constrained IoT environments to save on energy consumption. Such efforts will transform the development of data-driven IoT security solutions to be greener and more environment-friendly.
The generalizability of pre-processing techniques on the accuracy and fairness of data-driven building models: a case study
Ying Sun
Fariborz Haghighat
H4rm0ny: A Competitive Zero-Sum Two-Player Markov Game for Multi-Agent Learning on Evasive Malware Generation and Detection
Christopher Molloy
Steven H. H. Ding
Philippe Charland
To combat the increasingly versatile and mutable modern malware, Machine Learning (ML) is now a popular and effective complement to the exis… (voir plus)ting signature-based techniques for malware triage and identification. However, ML is also a readily available tool for adversaries. Recent studies have shown that malware can be modified by deep Reinforcement Learning (RL) techniques to bypass AI-based and signature-based anti-virus systems without altering their original malicious functionalities. These studies only focus on generating evasive samples and assume a static detection system as the enemy.Malware detection and evasion essentially form a two-party cat-and-mouse game. Simulating the real-life scenarios, in this paper we present the first two-player competitive game for evasive malware detection and generation, following the zero-sum Multi-Agent Reinforcement Learning (MARL) paradigm. Our experiments on recent malware show that the produced malware detection agent is more robust against adversarial attacks. Also, the produced malware modification agent is able to generate more evasive samples fooling both AI-based and other anti-malware techniques.
On the Effectiveness of Interpretable Feedforward Neural Network
Miles Q. Li
Adel Abusitta
Deep learning models have achieved state-of-the-art performance in many classification tasks. However, most of them cannot provide an explan… (voir plus)ation for their classification results. Machine learning models that are interpretable are usually linear or piecewise linear and yield inferior performance. Non-linear models achieve much better classification performance, but it is usually hard to explain their classification results. As a counter-example, an interpretable feedforward neural network (IFFNN) is proposed to achieve both high classification performance and interpretability for malware detection. If the IFFNN can perform well in a more flexible and general form for other classification tasks while providing meaningful explanations, it may be of great interest to the applied machine learning community. In this paper, we propose a way to generalize the interpretable feedforward neural network to multi-class classification scenarios and any type of feedforward neural networks, and evaluate its classification performance and interpretability on interpretable datasets. We conclude by finding that the generalized IFFNNs achieve comparable classification performance to their normal feedforward neural network counterparts and provide meaningful explanations. Thus, this kind of neural network architecture has great practical use.
Incentivized Security-Aware Computation Offloading for Large-Scale Internet of Things Applications
Talal Halabi
Adel Abusitta
Glaucio H.S. Carvalho
JARV1S: Phenotype Clone Search for Rapid Zero-Day Malware Triage and Functional Decomposition for Cyber Threat Intelligence
Christopher Molloy
Philippe Charland
Steven H. H. Ding
Cyber threat intelligence (CTI) has become a critical component of the defense of organizations against the steady surge of cyber attacks. M… (voir plus)alware is one of the most challenging problems for CTI, due to its prevalence, the massive number of variants, and the constantly changing threat actor behaviors. Currently, Malpedia has indexed 2,390 unique malware families, while the AVTEST Institute has recorded more than 166 million new unique malware samples in 2021. There exists a vast number of variants per malware family. Consequently, the signature-based representation of patterns and knowledge of legacy systems can no longer be generalized to detect future malware attacks. Machine learning-based solutions can match more variants. However, as a black-box approach, they lack the explainability and maintainability required by incident response teams.There is thus an urgent need for a data-driven system that can abstract a future-proof, human-friendly, systematic, actionable, and dependable knowledge representation from software artifacts from the past for more effective and insightful malware triage. In this paper, we present the first phenotype-based malware decomposition system for quick malware triage that is effective against malware variants. We define phenotypes as directly observable characteristics such as code fragments, constants, functions, and strings. Malware development rarely starts from scratch, and there are many reused components and code fragments. The target under investigation is decomposed into known phenotypes that are mapped to known malware families, malware behaviors, and Advanced Persistent Threat (APT) groups. The implemented system provides visualizable phenotypes through an interactive tree map, helping the cyber analysts to navigate through the decomposition results. We evaluated our system on 200,000 malware samples, 100,000 benign samples, and a malware family with over 27,284 variants. The results indicate our system is scalable, efficient, and effective against zero-day malware and new variants of known families.
The generalizability of pre-processing techniques on the accuracy and fairness of data-driven building models: a case study
Ying Sun
Fariborz Haghighat
Learning Inter-Modal Correspondence and Phenotypes From Multi-Modal Electronic Health Records
Kejing Yin
William K. Cheung
Jonathan Poon
Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health record… (voir plus)s (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models.
On the Effectiveness of Interpretable Feedforward Neural Network
Miles Q. Li
Adel Abusitta
Deep learning models have achieved state-of-the-art performance in many classification tasks. However, most of them cannot provide an explan… (voir plus)ation for their classification results. Machine learning models that are interpretable are usually linear or piecewise linear and yield inferior performance. Non-linear models achieve much better classification performance, but it is usually hard to explain their classification results. As a counter-example, an interpretable feedforward neural network (IFFNN) is proposed to achieve both high classification performance and interpretability for malware detection. If the IFFNN can perform well in a more flexible and general form for other classification tasks while providing meaningful explanations, it may be of great interest to the applied machine learning community. In this paper, we propose a way to generalize the interpretable feedforward neural network to multi-class classification scenarios and any type of feedforward neural networks, and evaluate its classification performance and interpretability on interpretable datasets. We conclude by finding that the generalized IFFNNs achieve comparable classification performance to their normal feedforward neural network counterparts and provide meaningful explanations. Thus, this kind of neural network architecture has great practical use.
The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution
Malik H. Altakrori
Trade-off Between Accuracy and Fairness of Data-driven Building and Indoor Environment Models: A Comparative Study of Pre-processing Methods
Ying Sun
Fariborz Haghighat