Christophe Dubach

A Functional Approach to Synthesizing Routable Programmable Accelerators for Neural Networks

Paul Teng

T V Paul

Producing optimized accelerators is tedious, as even modern HDLs (Hardware Description Languages) such as Chisel, require reasoning about lo… (see more)w-level concepts. Recent functional approaches, such as Aetherling and SHIR, treat hardware as composition of pure operators. This raises the abstraction level, allowing for systematic optimizations through rewriterules for FPGAs (Field Programmable Gate Arrays).

2026-06-10

Proceedings of the 27th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (published)

doi.org

Sirop: A Small IR for HLS with Parallel Patterns

L.T. Hildebrand

Christophe Dubach

Designers of custom streaming accelerators traditionally use HDLs (Hardware Description Languages), but this is time-consuming and requires … (see more)advanced hardware expertise. C-based HLS (High-Level Synthesis) offers a higher level of abstraction and faster design time, but still requires some hardware expertise and performance is often left on the table. A promising direction is to use HLS with high-level functional parallel patterns such as map and reduce. Prior works have shown that high performance is achievable this way. However, designing such compiler systems is challenging because the optimizer must handle a large number of language primitives and interactions between them.

2026-06-10

Proceedings of the 27th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (published)

doi.org

SkeleShare: Algorithmic Skeletons and Equality Saturation for Hardware Resource Sharing

Jonathan Van der Cruysse

Tzung-Han Juang

Shakiba Bolbolian Khah

Christophe Dubach

Compiling functional programs into efficient Field Programmable Gate Array (FPGA) designs is difficult. Hardware resources must be explicitl… (see more)y allocated and shared to maximize resource efficiency. This requires careful orchestration of several transformations to expose and exploit sharing opportunities.This paper introduces SkeleShare, a novel approach that automates the problem of resource allocation and sharing. It leverages equality saturation and algorithmic skeletons to expose sharing opportunities across abstraction levels. A solver-based extractor then selects a design that consolidates computations, meeting resource constraints while maintaining performance.This approach is evaluated on neural networks and image processing targeting a real FPGA. The paper shows how SkeleShare is used to express the various algorithmic patterns and transformation rules inherent in neural network operators. The experimental evaluation demonstrates that SkeleShare’s fully automated resource allocation and sharing matches and exceeds the performance of prior work, which involves expert manual extraction of sharing opportunities.

2026-01-30

IEEE/ACM Symposium on Code Generation and Optimization (published)

doi.org

Synthesizing Specialized Sparse Tensor Accelerators for FPGAs via High-Level Functional Abstractions

Hamza Javed

Christophe Dubach

Sparsity is inherent in many applications such as machine learning and graph analytics. However, achieving high efficiency in sparse computa… (see more)tions requires specialized hardware accelerators like FPGAs, as traditional accelerators typically cater to dense data. While high level synthesis enables the automatic generation of FPGA-based accelerators, generic solutions produced via C-based synthesis flows often demand extensive development time, leading designers to prioritize broad applicability over fine-grained structural specialization. Consequently, these accelerators fail to fully exploit FPGA’s reconfigurablility, leaving substantial performance and efficiency gains untapped.This paper pushes the boundary by automatically generating specialized accelerators that match a given fixed sparse structure (e.g., in static graph analytics and pruned neural networks). It achieves this by leveraging functional abstractions within high level synthesis, an approach that has already proven effective in automating the generation of specialized dense tensor accelerator. Tensor shapes are encoded directly in the type system and specialized primitives for irregular data are introduced. Together, these innovations enable a concise specification of sparse accelerators and drive advanced optimizations—including dynamic partitioning and vector sharding—to produce hardware precisely tailored to the sparsity pattern of the underlying tensors.Compared to state-of-the-art generic accelerators (HiSparse, HiSpMV and GraphLily), the approach achieves up to a 2.8× improvement in bandwidth efficiency for sparse matrix computations and a 1.8× speedup on graph algorithms. Against the hls4ml neural network acceleration framework, it achieves up to a 1.8× improvement in throughput with a 4× reduction in resource usage, enabling scaling to larger networks. These results establish this approach as a flexible, powerful, and rapid solution for designing high-performance specialized sparse accelerators.

2026-01-30

IEEE/ACM Symposium on Code Generation and Optimization (published)

doi.org

Parallel and Customizable Equality Saturation

Jonathan Van der Cruysse

Abd-El-Aziz Zayed

Mai Jacob Peng

Christophe Dubach

Equality saturation enables compilers to explore many semantically equivalent program variants, deferring optimization decisions to a final … (see more)extraction phase. However, existing frameworks exhibit sequential execution and hard-coded saturation loops. This limits scalability and requires significant engineering effort to customize saturation behavior. This paper addresses these limitations using three novel techniques. First, it shows how saturation can be parallelized thanks to the use of thread-safe data structures and the notion of deferred e-graph updates. Second, it provides an extensible mechanism to express custom and composable saturation strategies. Third, it generalizes e-graph metadata to support custom e-graph annotations. The implementation, written in Scala, is evaluated on four use-cases: classical program optimization, idiom recognition, scalability strategies and incremental equality saturation. The results show that it outperforms several existing equality saturation engines, including the highly optimized egglog library. When used to reimplement an existing idiom recognition technique, the new design finds higher-quality idioms, 16× faster. Additionally, the design is able to natively express state-of-the-art custom equality saturation behavior such as incremental equality saturation and multi-phase rewriting strategies without any modification to the core library.

2026-01-27

ACM SIGPLAN International Conference on Compiler Construction (published)

doi.org

Automatic Recursion Elimination using Recurrence Relations for Synthesis of Stack-free Hardware

Adam Musa

Christophe Dubach

High-level Synthesis (HLS) eases hardware design by offering a higher level of abstraction. However, high-level programming concepts, such a… (see more)s recursion, are costly to synthesize, if at all possible. Recursion typically relies on a dynamic call stack, whose hardware implementation is resource-intensive and inefficient. Existing approaches solve this issue by replacing recursion with iteration using explicit stack arrays or by detecting specific patterns (e.g., tail recursion) to avoid using the stack. This paper introduces a novel technique for transforming recursive functions into equivalent stack-free iterative implementations. Using static analysis, a recurrence relation is extracted from the function, representing it as a sequence bounded by the order of the relation. This relation is then used to optimize the process of incrementalization, constructing a synthesizable, stackfree version of the function that uses a bounded static array. This approach is evaluated on a set of recursive benchmarks used in prior work. It eliminates recursion from 9 out of 19 benchmarks and achieves a

2026-01-18

Asia and South Pacific Design Automation Conference (published)

doi.org

Sound and Modular Activity Analysis for Automatic Differentiation in MLIR

Mai Jacob Peng

William S. Moses

Oleksandr Zinenko

Christophe Dubach

Computing derivatives is paramount for multiple domains ranging from training neural networks to precise climate simulations. While derivati… (see more)ves can be generated by Automatic Differentiation (AD) tools, they often require aggressive optimization to avoid compromising program performance. One of the central optimizations consists of identifying inactive operations that do not contribute to the partial derivatives of interest. Multiple tools provide activity analyses for a variety of input languages, though often with only informal correctness guarantees. This paper formally defines activity analysis for AD as an abstract interpretation, proves its soundness, and implements it within the MLIR compiler infrastructure. To account for MLIR’s genericity, a subset of MLIR’s internal representation amenable to AD is formalized for the first time. Furthermore, the paper proposes a sound intraprocedural approximation of the whole-program activity analysis via function summaries along with a mechanism to automatically derive these summaries from function definitions. The implementation is evaluated on a differentiation-specific benchmark suite. It achieves a 1.24 geometric mean speedup on CPU and a 1.7 geometric mean speedup on GPU in the runtime of generated programs, when compared to a baseline that does not use activity analysis. The evaluation also demonstrates that the intraprocedural analysis with function summaries proves inactive 100% of instructions proven inactive by the whole-program analysis.

2025-10-08

Proceedings of the ACM on Programming Languages (published)

doi.org

Hardware Synthesizable Exceptions using Continuations

Paul Teng

Christophe Dubach

2025-03-03

Proceedings of the 30th Asia and South Pacific Design Automation Conference (published)

doi.org

DialEgg: Dialect-Agnostic MLIR Optimizer using Equality Saturation with Egglog.

Abd-El-Aziz Zayed

Christophe Dubach

MLIR’s ability to optimize programs at multiple levels of abstraction is key to enabling domain-specific optimizing compilers. However, ex… (see more)pressing optimizations remains tedious. Optimizations can interact in unexpected ways, making it hard to unleash full performance. Equality saturation promises to solve these challenges. First, it simplifies the expression of optimizations using rewrite rules. Secondly, it considers all possible optimization interactions, through saturation, selecting the best program variant. Despite these advantages, equality saturation remains absent from production compilers such as MLIR. This paper proposes to integrate Egglog, a recent equality saturation engine, with MLIR, in a dialect-agnostic manner. This paper shows how the main MLIR constructs such as operations, types or attributes can be modeled in Egglog. It also presents DialEgg, a tool that pre-defines a large set of common MLIR constructs in Egglog and automatically translates between the MLIR and Egglog program representations. Using a few use-cases, this paper demonstrates the potential for combining equality saturation and MLIR.

2025-02-28

IEEE/ACM International Symposium on Code Generation and Optimization (published)

doi.org

Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning.

Tzung-Han Juang

Christophe Dubach

While traditional High-Level Synthesis (HLS) converts “high-level” C-like programs into hardware automatically, producing high-performan… (see more)ce designs still requires hardware expertise. Optimizations such as data partitioning can have a large impact on performance since they directly affect data reuse patterns and the ability to reuse hardware. However, optimizing partitioning is a difficult process since minor changes in the parameter choices can lead to totally unpredictable performance. Functional array-based languages have been proposed instead of C-based approaches, as they offer stronger performance guarantees. This article proposes to follow a similar approach and exposes a divide-and-conquer primitive at the algorithmic level to let users partition any arbitrary computation. The compiler is then free to explore different partition shapes to maximize both data and hardware reuse automatically. The main challenge remains that the impact of partitioning is only known much later in the compilation flow. This is due to the hard-to-predict effects of the many optimizations applied during compilation. To solve this problem, the partitioning is expressed using a set of symbolic tunable parameters, introduced early in the compilation pipeline. A symbolic performance model is then used in the last compilation stage to predict performance based on the possible values of the tunable parameters. Using this approach, a design space exploration is conducted on an Intel Arria 10 Field Programmable Gate Arrays (FPGAs), and competitive performance is achieved on the classical VGG and TinyYolo neural networks.

2024-12-31

ACM Trans. Archit. Code Optim. (published)

doi.org

Efficient Assignment with Time Constraints for Heterogeneous DSP Systems.

Jiajie Li

Christophe Dubach

Warren J. Gross

High-level synthesis (HLS) produces hardware au-tomatically by scheduling and assigning resources based on an input control/data-flow graph.… (see more) One particular aspect of HLS for the digital signal processing (DSP) architecture is the het-erogeneous assignment problem (HAP) which maps operations into different types of functional units available in the electronic design automation tools to build efficient implementations. An optimal solution to this assignment problem can be found by formulating the problem as integer linear programming (ILP) and using a solver. However, given the slow nature of this process, heuristics tend to be used instead leading to sub-optimal designs. This paper revisits the classical ILP formulation of the HAP with time constraints for the DSP architecture by identifying redundant constraints. This paper proves theoretically, and demonstrates experimentally, that removing these constraints does not affect the obtained solution. This technique achieves speedups of more than 100 × in terms of runtime and reductions of more than 50 × in terms of memory usage of the solver. Also, this work proposes an updated heuristic that keeps reducing the latency of a path instead of finding a new critical path after giving a new node assignment. Runtime reductions (more than another 10×) due to reduced numbers of critical path searches are observed while returning similar results.

2024-11-03

IEEE Workshop on Signal Processing Systems (published)

doi.org

Latent Idiom Recognition for a Minimalist Functional Array Language Using Equality Saturation

Jonathan Van der Cruysse

Christophe Dubach

Accelerating programs is typically done by recognizing code idioms matching high-performance libraries or hardware interfaces. However, reco… (see more)gnizing such idioms automatically is challenging. The idiom recognition machinery is difficult to write and requires expert knowledge. In addition, slight variations in the input program might hide the idiom and defeat the recognizer. This paper advocates for the use of a minimalist functional array language supporting a small, but expressive, set of operators. The minimalist design leads to a tiny sets of rewrite rules, which encode the language semantics. Crucially, the same minimalist language is also used to encode idioms. This removes the need for hand-crafted analysis passes, or for having to learn a complex domain-specific language to define the idioms. Coupled with equality saturation, this approach is able to match the core functions from the BLAS and PyTorch libraries on a set of computational kernels. Compared to reference C kernel implementations, the approach produces a geometric mean speedup of 1.46Ã— for C programs using BLAS, when generating such programs from the high-level minimalist language.

2024-03-01

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (published)

doi.org

arxiv.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Christophe Dubach

Biography

Current Students

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Christophe Dubach

Biography

Current Students

Publications