Portrait of Warren Gross

Warren Gross

Associate Academic Member
Professor, McGill University, Department of Electrical and Computer Engineering
Research Topics
Deep Learning
Information Theory
Natural Language Processing
Optimization

Biography

Warren Gross is a James McGill Professor and chair of the Department of Electrical and Computer Engineering at McGill University.

His research interests lie in bridging algorithms and implementation in machine learning and digital communications. His work focuses on efficient deep learning models, hardware for machine learning, stochastic computing, hardware-aware design-space exploration for neural networks, machine learning for digital communications, and efficient decoding algorithms and hardware for error-correcting codes.

Current Students

Publications

High-Throughput Energy-Efficient LDPC Decoders Using Differential Binary Message Passing
Kevin Cushon
Saied Hemati
Camille Leroux
Shie Mannor
In this paper, we present energy-efficient architectures for decoders of low-density parity check (LDPC) codes using the differential decodi… (see more)ng with binary message passing (DD-BMP) algorithm and its modified variant (MDD-BMP). We also propose an improved differential binary (IDB) decoding algorithm. These algorithms offer significant intrinsic advantages in the energy domain: simple computations, low interconnect complexity, and very high throughput, while achieving error correction performance up to within 0.25 dB of the offset min-sum algorithm. We report on fully parallel decoder implementations of (273, 191), (1023, 781), and (4095, 3367) finite geometry-based LDPC codes in 65 nm CMOS. Using the MDD-BMP algorithm, these decoders achieve respective areas of 0.28 mm2, 1.38 mm2, and 15.37 mm2, average throughputs of 37 Gbps, 75 Gbps, and 141 Gbps, and energy efficiencies of 4.9 pJ/bit, 13.2 pJ/bit, and 37.9 pJ/bit with a 1.0 V supply voltage in post-layout simulations. At a reduced supply voltage of 0.8 V, these decoders achieve respective throughputs of 26 Gbps, 54 Gbps, and 94 Gbps, and energy efficiencies of 3.1 pJ/bit, 8.2 pJ/bit, and 23.5 pJ/bit. We also report on a fully parallel implementation of IDB for the (2048, 1723) LDPC code specified in the IEEE 802.3an (10GBASE-T) standard. This decoder achieves an area of 1.44 mm2, average throughput of 172 Gbps, and an energy efficiency of 2.8 pJ/bit with a 1.0 V supply voltage; at 0.8 V, it achieves throughput of 116 Gbps and energy efficiency of 1.7 pJ/bit.
High-Throughput Energy-Efficient LDPC Decoders Using Differential Binary Message Passing
Kevin Cushon
Saied Hemati
Camille Leroux
Shie Mannor
In this paper, we present energy-efficient architectures for decoders of low-density parity check (LDPC) codes using the differential decodi… (see more)ng with binary message passing (DD-BMP) algorithm and its modified variant (MDD-BMP). We also propose an improved differential binary (IDB) decoding algorithm. These algorithms offer significant intrinsic advantages in the energy domain: simple computations, low interconnect complexity, and very high throughput, while achieving error correction performance up to within 0.25 dB of the offset min-sum algorithm. We report on fully parallel decoder implementations of (273, 191), (1023, 781), and (4095, 3367) finite geometry-based LDPC codes in 65 nm CMOS. Using the MDD-BMP algorithm, these decoders achieve respective areas of 0.28 mm2, 1.38 mm2, and 15.37 mm2, average throughputs of 37 Gbps, 75 Gbps, and 141 Gbps, and energy efficiencies of 4.9 pJ/bit, 13.2 pJ/bit, and 37.9 pJ/bit with a 1.0 V supply voltage in post-layout simulations. At a reduced supply voltage of 0.8 V, these decoders achieve respective throughputs of 26 Gbps, 54 Gbps, and 94 Gbps, and energy efficiencies of 3.1 pJ/bit, 8.2 pJ/bit, and 23.5 pJ/bit. We also report on a fully parallel implementation of IDB for the (2048, 1723) LDPC code specified in the IEEE 802.3an (10GBASE-T) standard. This decoder achieves an area of 1.44 mm2, average throughput of 172 Gbps, and an energy efficiency of 2.8 pJ/bit with a 1.0 V supply voltage; at 0.8 V, it achieves throughput of 116 Gbps and energy efficiency of 1.7 pJ/bit.
Adaptive Multiset Stochastic Decoding of Non-Binary LDPC Codes
Alexandru Ciobanu
Saied Hemati
We propose a non-binary stochastic decoding algorithm for low-density parity-check (LDPC) codes over GF(q) with degree two variable nodes, c… (see more)alled Adaptive Multiset Stochastic Algorithm (AMSA). The algorithm uses multisets, an extension of sets that allows multiple occurrences of an element, to represent probability mass functions that simplifies the structure of the variable nodes. The run-time complexity of one decoding cycle using AMSA is O(q) for conventional memory architectures, and O(1) if a custom memory architecture is used. Two fully-parallel AMSA decoders are implemented on FPGA for two (192,96) (2,4)-regular codes over GF(64) and GF(256), both achieving a maximum clock frequency of 108 MHz. The GF(64) decoder has a coded throughput of 65 Mb/s at Eb/N0=2.4 dB when using conventional memory, while a decoder using the custom memory version can achieve 698 Mb/s at the same Eb/N0. At a frame error rate (FER) of 2×10-6 the GF(64) version of the algorithm is only 0.04 dB away from the floating-point SPA performance, and for the GF(256) code the difference is 0.2 dB. To the best of our knowledge, this is the first fully parallel non-binary LDPC decoder over GF(256) reported in the literature.
Adaptive Multiset Stochastic Decoding of Non-Binary LDPC Codes
Alexandru Sorin Ciobanu
Saied Hemati
We propose a non-binary stochastic decoding algorithm for low-density parity-check (LDPC) codes over GF(q) with degree two variable nodes, c… (see more)alled Adaptive Multiset Stochastic Algorithm (AMSA). The algorithm uses multisets, an extension of sets that allows multiple occurrences of an element, to represent probability mass functions that simplifies the structure of the variable nodes. The run-time complexity of one decoding cycle using AMSA is O(q) for conventional memory architectures, and O(1) if a custom memory architecture is used. Two fully-parallel AMSA decoders are implemented on FPGA for two (192,96) (2,4)-regular codes over GF(64) and GF(256), both achieving a maximum clock frequency of 108 MHz. The GF(64) decoder has a coded throughput of 65 Mb/s at Eb/N0=2.4 dB when using conventional memory, while a decoder using the custom memory version can achieve 698 Mb/s at the same Eb/N0. At a frame error rate (FER) of 2×10-6 the GF(64) version of the algorithm is only 0.04 dB away from the floating-point SPA performance, and for the GF(256) code the difference is 0.2 dB. To the best of our knowledge, this is the first fully parallel non-binary LDPC decoder over GF(256) reported in the literature.
A Scalable Successive-Cancellation Decoder for Polar Codes
Alexandre J. Raymond
Polar codes are the first error-correcting codes to provably achieve channel capacity, asymptotically in code length, with an explicit const… (see more)ruction. However, under successive-cancellation decoding, polar codes require very long code lengths to compete with existing modern codes. Nonetheless, the successive cancellation algorithm enables very-low-complexity implementations in hardware, due to the regular structure exhibited by polar codes. In this paper, we present an improved architecture for successive-cancellation decoding of polar codes, making use of a novel semi-parallel, encoder-based partial-sum computation module. We also provide quantization results for realistic code length N=215, and explore various optimization techniques such as a chained processing element and a variable quantization scheme. This design is shown to scale to code lengths of up to N=221, enabled by its low logic use, low register use and simple datapaths, limited almost exclusively by the amount of available SRAM. It also supports an overlapped loading of frames, allowing full-throughput decoding with a single set of input buffers.
A Semi-Parallel Successive-Cancellation Decoder for Polar Codes
Camille Leroux
Alexandre J. Raymond
Gabi Sarkis
Polar codes are a recently discovered family of capacity-achieving codes that are seen as a major breakthrough in coding theory. Motivated b… (see more)y the recent rapid progress in the theory of polar codes, we propose a semi-parallel architecture for the implementation of successive cancellation decoding. We take advantage of the recursive structure of polar codes to make efficient use of processing resources. The derived architecture has a very low processing complexity while the memory complexity remains similar to that of previous architectures. This drastic reduction in processing complexity allows very large polar code decoders to be implemented in hardware. An N=217 polar code successive cancellation decoder is implemented in an FPGA. We also report synthesis results for ASIC.
A Semi-Parallel Successive-Cancellation Decoder for Polar Codes
Camille Leroux
Alexandre J. Raymond
Gabi Sarkis
Polar codes are a recently discovered family of capacity-achieving codes that are seen as a major breakthrough in coding theory. Motivated b… (see more)y the recent rapid progress in the theory of polar codes, we propose a semi-parallel architecture for the implementation of successive cancellation decoding. We take advantage of the recursive structure of polar codes to make efficient use of processing resources. The derived architecture has a very low processing complexity while the memory complexity remains similar to that of previous architectures. This drastic reduction in processing complexity allows very large polar code decoders to be implemented in hardware. An N=217 polar code successive cancellation decoder is implemented in an FPGA. We also report synthesis results for ASIC.
Delayed Stochastic Decoding of LDPC Codes
Ali Naderi
Shie Mannor
Mohamad Sawan
A new stochastic decoding algorithm, called Delayed Stochastic (DS) decoding, is introduced to implement low-density-parity-check (LDPC) dec… (see more)oders. The delayed stochastic decoding uses an alternative method to track probability values, which results in reduction of hardware complexity and memory requirement of the stochastic decoders. It is therefore suitable for fully-parallel implementation of long LDPC codes with applications in optical communications. Two decoders are implemented using the DS algorithm for medium (2048, 1723) and long (32768, 26624) LDPC codes. The decoders occupy 3.93- mm2 and 56.5- mm2 silicon area using 90-nm CMOS technology and provide maximum core throughputs of 172.4 and 477.7 Gb/s at [(Eb)/(No)]=5.5 and 4.8 dB, respectively.
Delayed Stochastic Decoding of LDPC Codes
Ali Naderi
Shie Mannor
M. Sawan
A new stochastic decoding algorithm, called Delayed Stochastic (DS) decoding, is introduced to implement low-density-parity-check (LDPC) dec… (see more)oders. The delayed stochastic decoding uses an alternative method to track probability values, which results in reduction of hardware complexity and memory requirement of the stochastic decoders. It is therefore suitable for fully-parallel implementation of long LDPC codes with applications in optical communications. Two decoders are implemented using the DS algorithm for medium (2048, 1723) and long (32768, 26624) LDPC codes. The decoders occupy 3.93- mm2 and 56.5- mm2 silicon area using 90-nm CMOS technology and provide maximum core throughputs of 172.4 and 477.7 Gb/s at [(Eb)/(No)]=5.5 and 4.8 dB, respectively.
Stochastic Multiple Stream Decoding of Cortex Codes
Matthieu Arzel
Cyril Lahuec
Christophe Jego
Yvain Bruned
Being one of the most efficient solutions to implement forward error correction (FEC) decoders based on belief propagation, stochastic proce… (see more)ssing is thus a method worthy of consideration when addressing the decoding of emerging codes such as Cortex codes. This code family offers short block codes with large Hamming distances. Unfortunately, their construction introduces many hidden variables making them difficult to be efficiently decoded with digital circuits implementing the Sum-Product algorithm. With the introduction of multiple stochastic streams, the proposed solution alleviates the hidden variables problem thus yielding decoding performances close to optimal. Morevover, this new stochastic architecture is more efficient in terms of complexity-throughput ratio compared to recently published stochastic decoders using either edge or tracking forecast memories.
Stochastic Multiple Stream Decoding of Cortex Codes
Matthieu Arzel
Cyril Lahuec
Christophe Jego
Yvain Bruned
Being one of the most efficient solutions to implement forward error correction (FEC) decoders based on belief propagation, stochastic proce… (see more)ssing is thus a method worthy of consideration when addressing the decoding of emerging codes such as Cortex codes. This code family offers short block codes with large Hamming distances. Unfortunately, their construction introduces many hidden variables making them difficult to be efficiently decoded with digital circuits implementing the Sum-Product algorithm. With the introduction of multiple stochastic streams, the proposed solution alleviates the hidden variables problem thus yielding decoding performances close to optimal. Morevover, this new stochastic architecture is more efficient in terms of complexity-throughput ratio compared to recently published stochastic decoders using either edge or tracking forecast memories.
Stochastic Decoding of Turbo Codes
Quang Trung Dong
Matthieu Arzel
Christophe Jego
Stochastic computation is a technique in which operations on probabilities are performed on random bit streams. Stochastic decoding of forwa… (see more)rd error-correction (FEC) codes is inspired by this technique. This paper extends the application of the stochastic decoding approach to the families of convolutional codes and turbo codes. It demonstrates that stochastic computation is a promising solution to improve the data throughput of turbo decoders with very simple implementations. Stochastic fully-parallel turbo decoders are shown to achieve the error correction performance of conventional a posteriori probability (APP) decoders. To our knowledge, this is the first stochastic turbo decoder which decodes a state-of-the-art turbo code. Additionally, an innovative systematic technique is proposed to cope with stochastic additions, responsible for the throughput bottleneck.