Portrait de Benjamin Fung

Benjamin Fung

Membre académique associé
Professeur agrégé, McGill University, École des sciences de l'information
McGill University University
Sujets de recherche
Apprentissage automatique appliqué
Apprentissage de représentations
Apprentissage profond
Cybersécurité
Désinformation
Exploration des données
IA pour l'ingénierie logicielle
Recherche d'information
Vie privée

Biographie

Benjamin Fung est titulaire d'une chaire de recherche du Canada en exploration de données pour la cybersécurité, professeur agrégé à l’École des sciences de l’information et membre agrégé de l’École d’informatique de l'Université McGill, rédacteur adjoint de IEEE Transactions of Knowledge and Data Engineering et rédacteur adjoint de Elsevier Sustainable Cities and Society (SCS). Il a obtenu un doctorat en informatique de l'Université Simon Fraser en 2007. Il a à son actif plus de 150 publications revues par un comité de lecture, et plus de 14 000 citations (h-index 57) qui couvrent les domaines de l'exploration de données, de l'apprentissage automatique, de la protection de la vie privée, de la cybersécurité et du génie du bâtiment. Ses travaux d'exploration de données dans les enquêtes criminelles et l'analyse de la paternité d’une œuvre ont été recensés par les médias du monde entier.

Publications

Image Dehazing in Disproportionate Haze Distributions
Shih-Chia Huang
Da-Wei Jaw
Wenli Li
Zhihui Lu
Sy-Yen Kuo
Bo-Hao Chen
Thanisa Numnonda
Haze removal techniques employed to increase the visibility level of an image play an important role in many vision-based systems. Several t… (voir plus)raditional dark channel prior-based methods have been proposed to remove haze formation and thereby enhance the robustness of these systems. However, when the captured images contain disproportionate haze distributions, these methods usually fail to attain effective restoration in the restored image. Specifically, disproportionate haze distribution in an image means that the background region possesses heavy haze density and the foreground region possesses little haze density. This phenomenon usually occurs in a hazy image with a deep depth of field. In response, a novel hybrid transmission map-based haze removal method that specifically targets this situation is proposed in this work to achieve clear visibility restoration and effective information maintenance. Experimental results via both qualitative and quantitative evaluations demonstrate that the proposed method is capable of performing with higher efficacy when compared with other state-of-the-art methods, in respect to both background regions and foreground regions of restored test images captured in real-world environments.
A Novel and Dedicated Machine Learning Model for Malware Classification
Miles Q. Li
Philippe Charland
Steven H. H. Ding
: Malicious executables are comprised of functions that can be represented in assembly code. In the assembly code mining literature, many so… (voir plus)ftware reverse engineering tools have been created to disassemble executables, search function clones, and find vulnerabilities, among others. The development of a machine learning-based malware classification model that can simultaneously achieve excellent classification performance and provide insightful interpretation for the classification results remains to be a hot research topic. In this paper, we propose a novel and dedicated machine learning model for the research problem of malware classification. Our proposed model generates assembly code function clusters based on function representation learning and provides excellent interpretability for the classification results. It does not require a large or balanced dataset to train which meets the situation of real-life scenarios. Experiments show that our proposed approach outperforms previous state-of-the-art malware classification models and provides meaningful interpretation of classification results.
A Novel Neural Network-Based Malware Severity Classification System
Miles Q. Li
A Novel Neural Network-Based Malware Severity Classification System
Miles Q. Li
The Topic Confusion Task: A Novel Scenario for Authorship Attribution
Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researc… (voir plus)hers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the topic confusion task, where we switch the author-topic config-uration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models’ confusion by the switch because the features capture the topics, and one caused by the features’ inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level n - grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple n -gram features.
Toward Tweet-Mining Framework for Extracting Terrorist Attack-Related Information and Reporting
Farkhund Iqbal
Rabia Batool
Saiqa Aleem
Ahmed Abbasi
Abdul Rehman Javed
The widespread popularity of social networking is leading to the adoption of Twitter as an information dissemination tool. Existing research… (voir plus) has shown that information dissemination over Twitter has a much broader reach than traditional media and can be used for effective post-incident measures. People use informal language on Twitter, including acronyms, misspelled words, synonyms, transliteration, and ambiguous terms. This makes incident-related information extraction a non-trivial task. However, this information can be valuable for public safety organizations that need to respond in an emergency. This paper proposes an early event-related information extraction and reporting framework that monitors Twitter streams synthesizes event-specific information, e.g., a terrorist attack, and alerts law enforcement, emergency services, and media outlets. Specifically, the proposed framework, Tweet-to-Act (T2A), employs word embedding to transform tweets into a vector space model and then utilizes the Word Mover’s Distance (WMD) to cluster tweets for the identification of incidents. To extract reliable and valuable information from a large dataset of short and informal tweets, the proposed framework employs sequence labeling with bidirectional Long Short-Term Memory based Recurrent Neural Networks (bLSTM-RNN). Extensive experimental results suggest that our proposed framework, T2A, outperforms other state-of-the-art methods that use vector space modeling and distance calculation techniques, e.g., Euclidean and Cosine distance. T2A achieves an accuracy of 96% and an F1-score of 86.2% on real-life datasets.
Learning Inter-Modal Correspondence and Phenotypes From Multi-Modal Electronic Health Records
Kejing Yin
William K. Cheung
Jonathan Poon
Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health record… (voir plus)s (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models.
Trends and Applications in Knowledge Discovery and Data Mining
Lida Rashidi
Can Wang
Trends and Applications in Knowledge Discovery and Data Mining
Lida Rashidi
Can Wang