Compiling Strassen-like Matrix Multiplication Algorithms to Fast CUDA Kernels

Matrix multiplication is a key operation in scientific computing and machine learning, with GPU libraries like NVIDIA Cutlass and cuBLAS providing optimized implementations of the three nested loop cubic algorithm. While sub-cubic algorithms, like the Strassen algorithm and its variants, are theoretically faster, their recursive structure makes it challenging to implement efficient GPU kernels. This is why existing approaches either do excessive memory accesses or do not effectively overlap memory accesses and computations, leading to sub-optimal performance compared to theoretical expectations.

This paper presents SubCuber, a domain-specific compiler that generates efficient CUDA kernels for Strassen-like matrix multiplication algorithms. SubCuber contains two novel CUDA kernels that are designed to minimize memory input loads and effectively overlap computation with memory loads. To generate efficient code, SubCuber constructs the dependency graph of a Strassen-like algorithm, selects efficient kernel schedules, and applies fusion strategies tailored to a recursion level, matrix sizes, and GPU. Our evaluation on NVIDIA A100 and H200 GPUs shows that for both single- and half-precision floating point matrix multiplications, SubCuber’s generated code outperforms state-of-the-art CUDA implementations for matrix multiplication and the Strassen algorithm. SubCuber is up to 12% faster for one recursion level and 22% for two recursion levels over Cutlass and cuBLAS, while existing approaches are only up to 8% faster for one-level and 16% faster for two-levels. Furthermore, SubCuber makes matrix multiplication in language models like Phi-4 14B, Qwen-3 32B, and LLaMA-3 405b, up to 16% faster for inference scenarios.