Portrait de Warren Gross

Warren Gross

Membre académique associé
Professeur, McGill University, Département de génie électrique et informatique

Biographie

Warren Gross est professeur titulaire de la chaire James McGill et directeur du Département de génie électrique et informatique de l'Université McGill. Dans ses recherches, il s’intéresse au rapprochement entre les algorithmes et leur mise en œuvre dans les domaines de l'apprentissage automatique et des communications numériques. Ses travaux portent sur les modèles efficaces d'apprentissage profond, le matériel pour l'apprentissage automatique, l'informatique stochastique, l'exploration matérielle de l'espace de conception pour les réseaux neuronaux, l'apprentissage automatique pour les communications numériques, ainsi que les algorithmes de décodage efficaces et le matériel pour les codes correcteurs d'erreurs.

Publications

Step-GRAND: A Low Latency Universal Soft-Input Decoder
Syed Mohsin Abbas
Marwan Jalaleddine
Chi-Ying Tsui
GRAND features both soft-input and hard-input variants that are well suited to efficient hardware implementations that can be characterized … (voir plus)with achievable average and worst-case decoding latency. This paper introduces step-GRAND, a soft-input variant of GRAND that, in addition to achieving appealing average decoding latency, also reduces the worst-case decoding latency of the corresponding hardware implementation. The hardware implementation results demonstrate that the proposed step-GRAND can decode CA-polar code (128,105+11) with an average information throughput of 47.7 Gbps at the target FER of
xSA: A Binary Cross-Entropy Simulated Annealing Polar Decoder
Ryan Seah
Huayi Zhou
Marwan Jalaleddine
Polar decoders such as successive-cancellation and successive-cancellation list decoders are limited by their sequential nature, which leads… (voir plus) to a linear increase in latency with the codeword length. Heuristic based decoders such as quantum annealing have been proposed to overcome this limitation. However, these decoders have shown poor performance when decoding polar codes with more than eight bits. In this paper, we developed new meta-heuristic based polar decoder, called xSA, which uses a new receiver constraint modeled by the binary cross-entropy function. We also propose a method to determine the weights used in a quadratic unconstrained binary optimization (QUBO) function. The polar code is assumed to have been sent across an AWGN channel and we conducted our experiments and simulations using PyQUBO and dwave-neal. Our results show that xSA is able to decode codes of length 16 and 32 with a near-ML FER performance, presenting itself as a promising alternative to traditional polar decoders for real world applications and next generation cellular communications.
Efficient 1D Grouped Convolution for PyTorch a Case Study: Fast On-Device Fine-Tuning for SqueezeBERT
Seyyed Hasan Mozafari
James J. Clark
Brett Meyer
Grouped convolution has been observed to be an effective approximation for convolution in many DNN applications. For example, SqueezeBERT, w… (voir plus)hich is a light and fast BERT language processing model, utilizes 1D grouped convolutions. Though SqueezeBERT is well-optimized for inference on edge devices, it suffers from poor memory management during fine-tuning (training). This results in longer fine-tuning time on resource-limited GPUs compared to the original BERT model, BERT-base, despite being specifically designed for edge devices. We study this behavior and show that this poor memory management originates from the use of 1D grouped convolutions in SqueezeBERT. We re-implement 1D grouped convolutions using fully-connected layers, addressing the poor memory allocation and data locality of 1D grouped convolutions. We show that our method is well-suited for edge devices with limited memory; further, it has a negligible effect on inference speed. When utilizing our method, we observe a 42 % reduction in fine-tuning time for SqueezeBERT on edge devices.
Partial Ordered Statistics Decoding with Enhanced Error Patterns
Marwan Jalaleddine
Huayi Zhou
Jiajie Li
Guessing Random Additive Noise Decoding (GRAND) excels at decoding high-rate codes but struggles to decode low-rate codes with reasonable co… (voir plus)mplexity. Ordered Statistics Decoding (OSD) specifically excels in decoding short codes irrespective of rates; however, OSD necessitates the use of Gaussian elimination which introduces additional time, space and computational complexity. Partial Ordered Statistics Decoding (POSD) was proposed to reduce the time, space, and computational complexity of OSD; however, the current partition-based POSD has poor decoding performance since it does not generate test error patterns across partitions. In this paper, we propose to improve the decoding performance of POSD by incorporating test error patterns inspired by GRAND methods. This work offers a trade-off between performance and complexity compared to existing decoders such as GRAND and OSD. We enhance POSD by optimizing the scheduling of Test Error Patterns (TEPs) and show that our technique can be applied to any code in a standard form. At a target BER 10−4 with eBCH (128,64) the enhanced error patterns achieve more than 0.6 dB gain in performance compared to the POSD with partition-based error patterns. Moreover, at a target frame error rate of 10−5, POSD uses 10× less binary operations compared to GRAND when decoding eBCH (128,64) and RLC(128,64) codes. With BCH (127,29) and RLC(128,32), at a target frame error rate of 10−2, POSD with enhanced error patterns with a maximum number of queries (MQ) of 104 achieves up to a 2 dB gain to its GRAND equivalent which is using 107 maximum number of queries.
High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline
Hung-Yang Chang
Seyyed Hasan Mozafari
James J. Clark
Brett Meyer
There has been growing interest in improving the BERT inference throughput on resource-constrained edge devices for a satisfactory user expe… (voir plus)rience. One methodology is to employ heterogeneous computing, which utilizes multiple processing elements to accelerate inference. Another methodology is to deploy Neural Architecture Search (NAS) to find optimal solutions in accuracy-throughput design space. In this paper, for the first time, we incorporate NAS with pipelining for BERT models. We show that performing NAS with pipelining achieves on average 53% higher throughput, compared to NAS with a homogeneous system. Additionally, we propose a NAS algorithm that incorporates hardware performance feedback to accelerate the NAS process. Our proposed NAS algorithm speeds up the search process by ~4x, and 5.5x on the design space of the BERT and CNNs, respectively. Also, by exploring the accuracy-throughput design space of BERT models, we demonstrate that performing pipelining then NAS (Pipeline-then-NAS) can lead to solutions with up to 9x higher inference throughput, compared to running homogeneous inference on the BERT-base model, with only a 1.3% decrease in accuracy.
Fast Fine-Tuning Using Curriculum Domain Adaptation
Lulan Shen
Ibtihel Amara
Ruofeng Li
Brett Meyer
James J. Clark
Current deep neural networks (DNNs) have achieved remarkable accuracy in various downstream tasks. However, their training and fine-tuning a… (voir plus)re challenging due to several factors, such as limited computational resources, extended training and fine-tuning times, and over-fitting due to small datasets. To address these challenges, we propose a three-stage fast fine-tuning method that efficiently trains DNNs for edge devices. Our method combines curriculum learning and domain adaptation techniques to accelerate training while achieving comparable performance. First, we develop a data curriculum approach, which ranks the dataset according to difficulty and split it into the source domain (containing easy data) and the target domain (containing difficult data). Second, we adapt the pretrained model from the source domain to the target domain using an unsupervised domain adaptation (UDA) method called Deep CORAL. Finally, we continue training the adapted model on the source domain with fewer epochs. Our method achieves high accuracy quickly on various modern neural network architectures and datasets such as CIFAR-10, CIFAR-100, and CINIC-10.
Hybrid GRAND Sphere Decoding: Accelerated GRAND for Low-Rate Codes
Huayi Zhou
Guessing random additive noise decoding (GRAND) and sphere decoding (SD) are two algorithms that can achieve maximum likelihood decoding. In… (voir plus) this paper, a hybrid GRAND-SD (HGRAND) scheme is proposed to extend GRAND to low-rate codes. An accelerated GRAND decoder, assisted by a sphere decoder running in parallel and giving hints to it to allow skipping of certain candidates allows HGRAND to achieve a latency below the minimum latency of the individual component decoders while guaranteeing error-correction performance.
Training Acceleration of Frequency Domain CNNs Using Activation Compression
Seyyed Hasan Mozafari
James J. Clark
Brett Meyer
Reducing the complexity of training convolutional neural networks results in lower energy consumption expended during training, or higher ac… (voir plus)curacy by admitting a greater number of training epochs within a training time budget. During backpropagation, a considerable amount of temporary data is offloaded from GPU memory to CPU memory, increasing training time. In this paper, we address this training time overhead by introducing an activation compression technique for frequency domain convolutional neural networks. Applying this compression technique on frequency domain AlexNet results in activation compression of 57.7%, and a reduction of training time by 23%, with a negligible effect on classification accuracy.
Low-Complexity Sphere Decoding for Polar-Coded MIMO Systems
Huayi Zhou
Jian Zheng
Minhua Yang
Xiaohu You
Chuan Zhang
For polar-coded MIMO systems, separate detection and decoding (SDD) is the traditional scheme. In SDD systems, sphere decoding (SD) is one o… (voir plus)f the competitive MIMO detection schemes. However, SD may not utilize the coding information sufficiently in SDD systems, causing an error-correction performance loss. The existed joint detection and decoding using breadth-first SD (BSD) improves the performance than SDD, whereas the limited search space still causes a performance loss. In this paper, we propose joint detection and decoding based on SD (SD JDD) for polar-coded MIMO systems to reach maximum likelihood (ML) bound. Subsequently, two approaches are further proposed to reduce the computational complexity. The first approach reduces the layers of the SD search tree by exploiting symbol synchro sets, which could accelerate the convergence of SD JDD. The second efficient approach performs multiple tree searches. A small initial radius of the sphere for the first search is assigned to reduce the search space. The ML optimality could be preserved by the following multiple tree searches with increasing radius. It is shown from the numerical results that the proposed JDD outperforms SDD by 3.1 dB at FER
SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation
O. Therrien
Marihan Amein
Zhuoran Xiong
Brett Meyer
We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. I… (voir plus)t uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.88 * 10^17 possible networks. To further reduce search time, SSS3D splits the complete search space and introduces a two-stage search that finds optimal subnetworks in 54% of the time required by single-stage searches.
FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation
Zhuoran Xiong
Marihan Amein
O. Therrien
Brett Meyer
Memory-Efficient FPGA Implementation of Stochastic Simulated Annealing
Duckgyu Shin
Naoya Onizawa
Takahiro Hanyu
Simulated annealing (SA) is a well-known algorithm for solving combinatorial optimization problems. However, the computation time of SA incr… (voir plus)eases rapidly, as the size of the problem grows. Recently, a stochastic simulated annealing (SSA) algorithm that converges faster than conventional SA has been reported. In this paper, we present a hardware-aware SSA (HA-SSA) algorithm for memory-efficient FPGA implementations. HA-SSA can reduce the memory usage of storing intermediate results while maintaining the computing speed of SSA. For evaluation purposes, the proposed algorithm is compared with the conventional SSA and SA approaches on maximum cut combinatorial optimization problems. HA-SSA achieves a convergence speed that is up to 114-times faster than that of the conventional SA algorithm depending on the maximum cut problem selected from the G-set which is a dataset of the maximum cut problems. HA-SSA is implemented on a field-programmable gate array (FPGA) (Xilinx Kintex-7), and it achieves up to 6-times the memory efficiency of conventional SSA while maintaining high solution quality for optimization problems.