Portrait of Benjamin Fung

Benjamin Fung

Associate Academic Member
Associate Professor, McGill University, School of Information Studies
Research Topics
Data Mining

Biography

Benjamin Fung is a Canada Research Chair in Data Mining for Cybersecurity, as well as a full professor at the School of Information Studies and associate member of the School of Computer Science, McGill University.

Fung serves as an associate editor of IEEE Transactions of Knowledge and Data Engineering and Sustainable Cities and Society. He received his PhD in computing science from Simon Fraser University in 2007.

Dr. Fung has over 150 refereed publications to his credit and and more than 14,000 citations (h-index 57) spanning the fields of data mining, machine learning, privacy, cybersecurity and building engineering. His findings in the fields of data mining for crime investigations and authorship analysis have been reported by the media worldwide.

Publications

JARV1S: Phenotype Clone Search for Rapid Zero-Day Malware Triage and Functional Decomposition for Cyber Threat Intelligence
Christopher Molloy
Philippe Charland
Steven H. H. Ding
Cyber threat intelligence (CTI) has become a critical component of the defense of organizations against the steady surge of cyber attacks. M… (see more)alware is one of the most challenging problems for CTI, due to its prevalence, the massive number of variants, and the constantly changing threat actor behaviors. Currently, Malpedia has indexed 2,390 unique malware families, while the AVTEST Institute has recorded more than 166 million new unique malware samples in 2021. There exists a vast number of variants per malware family. Consequently, the signature-based representation of patterns and knowledge of legacy systems can no longer be generalized to detect future malware attacks. Machine learning-based solutions can match more variants. However, as a black-box approach, they lack the explainability and maintainability required by incident response teams.There is thus an urgent need for a data-driven system that can abstract a future-proof, human-friendly, systematic, actionable, and dependable knowledge representation from software artifacts from the past for more effective and insightful malware triage. In this paper, we present the first phenotype-based malware decomposition system for quick malware triage that is effective against malware variants. We define phenotypes as directly observable characteristics such as code fragments, constants, functions, and strings. Malware development rarely starts from scratch, and there are many reused components and code fragments. The target under investigation is decomposed into known phenotypes that are mapped to known malware families, malware behaviors, and Advanced Persistent Threat (APT) groups. The implemented system provides visualizable phenotypes through an interactive tree map, helping the cyber analysts to navigate through the decomposition results. We evaluated our system on 200,000 malware samples, 100,000 benign samples, and a malware family with over 27,284 variants. The results indicate our system is scalable, efficient, and effective against zero-day malware and new variants of known families.
The generalizability of pre-processing techniques on the accuracy and fairness of data-driven building models: a case study
Ying Sun
Fariborz Haghighat
Learning Inter-Modal Correspondence and Phenotypes From Multi-Modal Electronic Health Records
Kejing Yin
William K. Cheung
Jonathan Poon
Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health record… (see more)s (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models.
On the Effectiveness of Interpretable Feedforward Neural Network
Miles Q. Li
Adel Abusitta
Deep learning models have achieved state-of-the-art performance in many classification tasks. However, most of them cannot provide an explan… (see more)ation for their classification results. Machine learning models that are interpretable are usually linear or piecewise linear and yield inferior performance. Non-linear models achieve much better classification performance, but it is usually hard to explain their classification results. As a counter-example, an interpretable feedforward neural network (IFFNN) is proposed to achieve both high classification performance and interpretability for malware detection. If the IFFNN can perform well in a more flexible and general form for other classification tasks while providing meaningful explanations, it may be of great interest to the applied machine learning community. In this paper, we propose a way to generalize the interpretable feedforward neural network to multi-class classification scenarios and any type of feedforward neural networks, and evaluate its classification performance and interpretability on interpretable datasets. We conclude by finding that the generalized IFFNNs achieve comparable classification performance to their normal feedforward neural network counterparts and provide meaningful explanations. Thus, this kind of neural network architecture has great practical use.
The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution
Malik H. Altakrori
Trade-off Between Accuracy and Fairness of Data-driven Building and Indoor Environment Models: A Comparative Study of Pre-processing Methods
Ying Sun
Fariborz Haghighat
Trade-off Between Accuracy and Fairness of Data-driven Building and Indoor Environment Models: A Comparative Study of Pre-processing Methods
Ying Sun
Fariborz Haghighat
A Data Mining Analysis of Cross-Regional Study of Apparel Consumption
Osmud Rahman
A Data Mining Analysis of Cross-Regional Study of Apparel Consumption
Osmud Rahman
A Novel Neural Network-Based Malware Severity Classification System
Miles Q. Li
A Novel Neural Network-Based Malware Severity Classification System
Miles Q. Li
The Topic Confusion Task: A Novel Scenario for Authorship Attribution
Malik H. Altakrori
Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researc… (see more)hers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the topic confusion task, where we switch the author-topic config-uration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models’ confusion by the switch because the features capture the topics, and one caused by the features’ inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level n - grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple n -gram features.