Portrait of Warren Gross

Warren Gross

Associate Academic Member
Professor, McGill University, Department of Electrical and Computer Engineering
Research Topics
Computer Systems
Deep Learning
Information Theory
Natural Language Processing
Optimization

Biography

Warren Gross is a James McGill Professor and chair of the Department of Electrical and Computer Engineering at McGill University.

His research interests lie in bridging algorithms and implementation in machine learning and digital communications. His work focuses on efficient deep learning models, hardware for machine learning, stochastic computing, hardware-aware design-space exploration for neural networks, machine learning for digital communications, and efficient decoding algorithms and hardware for error-correcting codes.

Current Students

Publications

Decoding of Polar Codes Using Quadratic Unconstrained Binary Optimization
Huayi Zhou
Ryan Seah
Marwan Jalaleddine
Stochastic Simulated Quantum Annealing for Fast Solution of Combinatorial Optimization Problems
Naoya Onizawa
Ryoma Sasaki
Duckgyu Shin
Takahiro Hanyu
In this paper, we introduce stochastic simulated quantum annealing (SSQA) for large-scale combinatorial optimization problems. SSQA is desig… (see more)ned based on stochastic computing and quantum Monte Carlo, which can simulate quantum annealing (QA) by using multiple replicas of spins (probabilistic bits) in classical computing. The use of stochastic computing leads to an efficient parallel spin-state update algorithm, enabling quick search for a solution around the global minimum energy. Therefore, SSQA realizes quantum-like annealing for large-scale problems and can handle fully connected models in combinatorial optimization, unlike QA. The proposed method is evaluated in MATLAB on graph isomorphism problems, which are typical combinatorial optimization problems. The proposed method achieves a convergence speed an order of magnitude faster than a conventional stochastic simulaated annealing method. Additionally, it can handle a 100-times larger problem size compared to QA and a 25-times larger problem size compared to a traditional SA method, respectively, for similar convergence probabilities.
Step-GRAND: A Low Latency Universal Soft-Input Decoder
Syed Mohsin Abbas
Marwan Jalaleddine
Chi-Ying Tsui
GRAND features both soft-input and hard-input variants that are well suited to efficient hardware implementations that can be characterized … (see more)with achievable average and worst-case decoding latency. This paper introduces step-GRAND, a soft-input variant of GRAND that, in addition to achieving appealing average decoding latency, also reduces the worst-case decoding latency of the corresponding hardware implementation. The hardware implementation results demonstrate that the proposed step-GRAND can decode CA-polar code (128,105+11) with an average information throughput of 47.7 Gbps at the target FER of
xSA: A Binary Cross-Entropy Simulated Annealing Polar Decoder
Ryan Seah
Huayi Zhou
Marwan Jalaleddine
Polar decoders such as successive-cancellation and successive-cancellation list decoders are limited by their sequential nature, which leads… (see more) to a linear increase in latency with the codeword length. Heuristic based decoders such as quantum annealing have been proposed to overcome this limitation. However, these decoders have shown poor performance when decoding polar codes with more than eight bits. In this paper, we developed new meta-heuristic based polar decoder, called xSA, which uses a new receiver constraint modeled by the binary cross-entropy function. We also propose a method to determine the weights used in a quadratic unconstrained binary optimization (QUBO) function. The polar code is assumed to have been sent across an AWGN channel and we conducted our experiments and simulations using PyQUBO and dwave-neal. Our results show that xSA is able to decode codes of length 16 and 32 with a near-ML FER performance, presenting itself as a promising alternative to traditional polar decoders for real world applications and next generation cellular communications.
Efficient 1D Grouped Convolution for PyTorch a Case Study: Fast On-Device Fine-Tuning for SqueezeBERT
Seyyed Hasan Mozafari
James J. Clark
Brett Meyer
Grouped convolution has been observed to be an effective approximation for convolution in many DNN applications. For example, SqueezeBERT, w… (see more)hich is a light and fast BERT language processing model, utilizes 1D grouped convolutions. Though SqueezeBERT is well-optimized for inference on edge devices, it suffers from poor memory management during fine-tuning (training). This results in longer fine-tuning time on resource-limited GPUs compared to the original BERT model, BERT-base, despite being specifically designed for edge devices. We study this behavior and show that this poor memory management originates from the use of 1D grouped convolutions in SqueezeBERT. We re-implement 1D grouped convolutions using fully-connected layers, addressing the poor memory allocation and data locality of 1D grouped convolutions. We show that our method is well-suited for edge devices with limited memory; further, it has a negligible effect on inference speed. When utilizing our method, we observe a 42 % reduction in fine-tuning time for SqueezeBERT on edge devices.
Partial Ordered Statistics Decoding with Enhanced Error Patterns
Marwan Jalaleddine
Huayi Zhou
Jiajie Li
Guessing Random Additive Noise Decoding (GRAND) excels at decoding high-rate codes but struggles to decode low-rate codes with reasonable co… (see more)mplexity. Ordered Statistics Decoding (OSD) specifically excels in decoding short codes irrespective of rates; however, OSD necessitates the use of Gaussian elimination which introduces additional time, space and computational complexity. Partial Ordered Statistics Decoding (POSD) was proposed to reduce the time, space, and computational complexity of OSD; however, the current partition-based POSD has poor decoding performance since it does not generate test error patterns across partitions. In this paper, we propose to improve the decoding performance of POSD by incorporating test error patterns inspired by GRAND methods. This work offers a trade-off between performance and complexity compared to existing decoders such as GRAND and OSD. We enhance POSD by optimizing the scheduling of Test Error Patterns (TEPs) and show that our technique can be applied to any code in a standard form. At a target BER 10−4 with eBCH (128,64) the enhanced error patterns achieve more than 0.6 dB gain in performance compared to the POSD with partition-based error patterns. Moreover, at a target frame error rate of 10−5, POSD uses 10× less binary operations compared to GRAND when decoding eBCH (128,64) and RLC(128,64) codes. With BCH (127,29) and RLC(128,32), at a target frame error rate of 10−2, POSD with enhanced error patterns with a maximum number of queries (MQ) of 104 achieves up to a 2 dB gain to its GRAND equivalent which is using 107 maximum number of queries.
High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline
Hung-Yang Chang
Seyyed Hasan Mozafari
James J. Clark
Brett Meyer
There has been growing interest in improving the BERT inference throughput on resource-constrained edge devices for a satisfactory user expe… (see more)rience. One methodology is to employ heterogeneous computing, which utilizes multiple processing elements to accelerate inference. Another methodology is to deploy Neural Architecture Search (NAS) to find optimal solutions in accuracy-throughput design space. In this paper, for the first time, we incorporate NAS with pipelining for BERT models. We show that performing NAS with pipelining achieves on average 53% higher throughput, compared to NAS with a homogeneous system. Additionally, we propose a NAS algorithm that incorporates hardware performance feedback to accelerate the NAS process. Our proposed NAS algorithm speeds up the search process by ~4x, and 5.5x on the design space of the BERT and CNNs, respectively. Also, by exploring the accuracy-throughput design space of BERT models, we demonstrate that performing pipelining then NAS (Pipeline-then-NAS) can lead to solutions with up to 9x higher inference throughput, compared to running homogeneous inference on the BERT-base model, with only a 1.3% decrease in accuracy.
Fast Fine-Tuning Using Curriculum Domain Adaptation
Lulan Shen
Ibtihel Amara
Ruofeng Li
Brett Meyer
James J. Clark
Current deep neural networks (DNNs) have achieved remarkable accuracy in various downstream tasks. However, their training and fine-tuning a… (see more)re challenging due to several factors, such as limited computational resources, extended training and fine-tuning times, and over-fitting due to small datasets. To address these challenges, we propose a three-stage fast fine-tuning method that efficiently trains DNNs for edge devices. Our method combines curriculum learning and domain adaptation techniques to accelerate training while achieving comparable performance. First, we develop a data curriculum approach, which ranks the dataset according to difficulty and split it into the source domain (containing easy data) and the target domain (containing difficult data). Second, we adapt the pretrained model from the source domain to the target domain using an unsupervised domain adaptation (UDA) method called Deep CORAL. Finally, we continue training the adapted model on the source domain with fewer epochs. Our method achieves high accuracy quickly on various modern neural network architectures and datasets such as CIFAR-10, CIFAR-100, and CINIC-10.
Hybrid GRAND Sphere Decoding: Accelerated GRAND for Low-Rate Codes
Huayi Zhou
Guessing random additive noise decoding (GRAND) and sphere decoding (SD) are two algorithms that can achieve maximum likelihood decoding. In… (see more) this paper, a hybrid GRAND-SD (HGRAND) scheme is proposed to extend GRAND to low-rate codes. An accelerated GRAND decoder, assisted by a sphere decoder running in parallel and giving hints to it to allow skipping of certain candidates allows HGRAND to achieve a latency below the minimum latency of the individual component decoders while guaranteeing error-correction performance.
Training Acceleration of Frequency Domain CNNs Using Activation Compression
Seyyed Hasan Mozafari
James J. Clark
Brett Meyer
Reducing the complexity of training convolutional neural networks results in lower energy consumption expended during training, or higher ac… (see more)curacy by admitting a greater number of training epochs within a training time budget. During backpropagation, a considerable amount of temporary data is offloaded from GPU memory to CPU memory, increasing training time. In this paper, we address this training time overhead by introducing an activation compression technique for frequency domain convolutional neural networks. Applying this compression technique on frequency domain AlexNet results in activation compression of 57.7%, and a reduction of training time by 23%, with a negligible effect on classification accuracy.
Low-Complexity Sphere Decoding for Polar-Coded MIMO Systems
Huayi Zhou
Jian Zheng
Minhua Yang
Xiaohu You
Chuan Zhang
For polar-coded MIMO systems, separate detection and decoding (SDD) is the traditional scheme. In SDD systems, sphere decoding (SD) is one o… (see more)f the competitive MIMO detection schemes. However, SD may not utilize the coding information sufficiently in SDD systems, causing an error-correction performance loss. The existed joint detection and decoding using breadth-first SD (BSD) improves the performance than SDD, whereas the limited search space still causes a performance loss. In this paper, we propose joint detection and decoding based on SD (SD JDD) for polar-coded MIMO systems to reach maximum likelihood (ML) bound. Subsequently, two approaches are further proposed to reduce the computational complexity. The first approach reduces the layers of the SD search tree by exploiting symbol synchro sets, which could accelerate the convergence of SD JDD. The second efficient approach performs multiple tree searches. A small initial radius of the sphere for the first search is assigned to reduce the search space. The ML optimality could be preserved by the following multiple tree searches with increasing radius. It is shown from the numerical results that the proposed JDD outperforms SDD by 3.1 dB at FER
SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation
Olivier Therrien
Marihan Amein
Zhuoran Xiong
Brett Meyer
We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. I… (see more)t uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.88 * 10^17 possible networks. To further reduce search time, SSS3D splits the complete search space and introduces a two-stage search that finds optimal subnetworks in 54% of the time required by single-stage searches.