Benjamin Fung

Membre académique associé

Professeur agrégé, McGill University, École des sciences de l'information

McGill University University

Sujets de recherche

Apprentissage automatique appliqué

Apprentissage de représentations

Apprentissage profond

Cybersécurité

Désinformation

Exploration des données

IA pour l'ingénierie logicielle

Recherche d'information

Vie privée

Site web

Google Scholar

Biographie

Benjamin Fung est titulaire d'une chaire de recherche du Canada en exploration de données pour la cybersécurité, professeur agrégé à l’École des sciences de l’information et membre agrégé de l’École d’informatique de l'Université McGill, rédacteur adjoint de IEEE Transactions of Knowledge and Data Engineering et rédacteur adjoint de Elsevier Sustainable Cities and Society (SCS). Il a obtenu un doctorat en informatique de l'Université Simon Fraser en 2007. Il a à son actif plus de 150 publications revues par un comité de lecture, et plus de 14 000 citations (h-index 57) qui couvrent les domaines de l'exploration de données, de l'apprentissage automatique, de la protection de la vie privée, de la cybersécurité et du génie du bâtiment. Ses travaux d'exploration de données dans les enquêtes criminelles et l'analyse de la paternité d’une œuvre ont été recensés par les médias du monde entier.

Publications

Image Dehazing in Disproportionate Haze Distributions

Shih-Chia Huang

Da-Wei Jaw

Wenli Li

Zhihui Lu

Sy-Yen Kuo

Benjamin Fung

Bo-Hao Chen

Thanisa Numnonda

Haze removal techniques employed to increase the visibility level of an image play an important role in many vision-based systems. Several t… (voir plus)raditional dark channel prior-based methods have been proposed to remove haze formation and thereby enhance the robustness of these systems. However, when the captured images contain disproportionate haze distributions, these methods usually fail to attain effective restoration in the restored image. Specifically, disproportionate haze distribution in an image means that the background region possesses heavy haze density and the foreground region possesses little haze density. This phenomenon usually occurs in a hazy image with a deep depth of field. In response, a novel hybrid transmission map-based haze removal method that specifically targets this situation is proposed in this work to achieve clear visibility restoration and effective information maintenance. Experimental results via both qualitative and quantitative evaluations demonstrate that the proposed method is capable of performing with higher efficacy when compared with other state-of-the-art methods, in respect to both background regions and foreground regions of restored test images captured in real-world environments.

2021-01-01

IEEE Access (publié)

doi.org

A Novel and Dedicated Machine Learning Model for Malware Classification

Miles Q. Li

Benjamin Fung

Philippe Charland

Steven H. H. Ding

: Malicious executables are comprised of functions that can be represented in assembly code. In the assembly code mining literature, many so… (voir plus)ftware reverse engineering tools have been created to disassemble executables, search function clones, and ﬁnd vulnerabilities, among others. The development of a machine learning-based malware classiﬁcation model that can simultaneously achieve excellent classiﬁcation performance and provide insightful interpretation for the classiﬁcation results remains to be a hot research topic. In this paper, we propose a novel and dedicated machine learning model for the research problem of malware classiﬁcation. Our proposed model generates assembly code function clusters based on function representation learning and provides excellent interpretability for the classiﬁcation results. It does not require a large or balanced dataset to train which meets the situation of real-life scenarios. Experiments show that our proposed approach outperforms previous state-of-the-art malware classiﬁcation models and provides meaningful interpretation of classiﬁcation results.

2021-01-01

International Conference on Software and Data Technologies (publié)

doi.org

A Novel Neural Network-Based Malware Severity Classification System

Miles Q. Li

Benjamin Fung

2021-01-01

International Conference on Software and Data Technologies (publié)

doi.org

A Novel Neural Network-Based Malware Severity Classification System

Miles Q. Li

Benjamin Fung

2021-01-01

International Conference on Software and Data Technologies (published)

doi.org

The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Malik H. Altakrori

Jackie Cheung

Benjamin Fung

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researc… (voir plus)hers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the topic confusion task, where we switch the author-topic conﬁg-uration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models’ confusion by the switch because the features capture the topics, and one caused by the features’ inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level n - grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple n -gram features.

2021-01-01

arXiv.org (prépublication)

dblp.uni-trier.de

Toward Tweet-Mining Framework for Extracting Terrorist Attack-Related Information and Reporting

Farkhund Iqbal

Rabia Batool

Benjamin Fung

Saiqa Aleem

Ahmed Abbasi

Abdul Rehman Javed

The widespread popularity of social networking is leading to the adoption of Twitter as an information dissemination tool. Existing research… (voir plus) has shown that information dissemination over Twitter has a much broader reach than traditional media and can be used for effective post-incident measures. People use informal language on Twitter, including acronyms, misspelled words, synonyms, transliteration, and ambiguous terms. This makes incident-related information extraction a non-trivial task. However, this information can be valuable for public safety organizations that need to respond in an emergency. This paper proposes an early event-related information extraction and reporting framework that monitors Twitter streams synthesizes event-specific information, e.g., a terrorist attack, and alerts law enforcement, emergency services, and media outlets. Specifically, the proposed framework, Tweet-to-Act (T2A), employs word embedding to transform tweets into a vector space model and then utilizes the Word Mover’s Distance (WMD) to cluster tweets for the identification of incidents. To extract reliable and valuable information from a large dataset of short and informal tweets, the proposed framework employs sequence labeling with bidirectional Long Short-Term Memory based Recurrent Neural Networks (bLSTM-RNN). Extensive experimental results suggest that our proposed framework, T2A, outperforms other state-of-the-art methods that use vector space modeling and distance calculation techniques, e.g., Euclidean and Cosine distance. T2A achieves an accuracy of 96% and an F1-score of 86.2% on real-life datasets.

2021-01-01

IEEE Access (publié)

doi.org

Learning Inter-Modal Correspondence and Phenotypes From Multi-Modal Electronic Health Records

Kejing Yin

William K. Cheung

Benjamin Fung

Jonathan Poon

Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health record… (voir plus)s (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models.

2020-11-12

ArXiv (preprint)

doi.org

arxiv.org

Trends and Applications in Knowledge Discovery and Data Mining

Lida Rashidi

Benjamin Fung

Can Wang

2018-01-01

Lecture Notes in Computer Science (publié)

doi.org

Trends and Applications in Knowledge Discovery and Data Mining