Closing the gap: CPU and FPGA Trends in sustainable floating-point BLAS performance
Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks — as long as floating-point arithmetic is not required. Fueled by the advance of Moore’s Law, FPGAs are rapidly reaching sufficient densities to enhance peak floating-point performance as well. The question, however, is how much of this peak performance can be sustained. This paper examines three of the basic linear algebra subroutine (BLAS) functions: vector dot product, matrix-vector multiply, and matrix multiply. A comparison of microprocessors, FPGAs, and Reconfigurable Computing platforms is performed for each operation. The analysis highlights the amount of memory bandwidth and internal storage needed to sustain peak performance with FPGAs. This analysis considers the historical context of the last six years and is extrapolated for the next six years.
Summary
This paper analyzes how close modern FPGAs can come to CPUs in delivering sustained floating-point performance for core BLAS kernels (dot product, matrix-vector multiply, matrix multiply). It quantifies the memory bandwidth and on-chip storage requirements that determine whether FPGA peak FLOPS can be realized in practice and compares microprocessors, FPGAs, and reconfigurable platforms.
Key Takeaways
- Estimate the memory-bandwidth and on-chip storage needed to sustain peak floating-point throughput on FPGA implementations of DOT, GEMV, and GEMM.
- Quantify and compare sustained GFLOPS (not just peak) achievable on CPUs, FPGAs, and reconfigurable platforms for basic linear-algebra kernels.
- Assess architectural trade-offs — balance compute pipeline depth, internal buffering, and external memory accesses — to maximize sustained performance.
- Apply the paper's methodology to determine when mapping floating-point BLAS to FPGAs is beneficial versus using general-purpose CPUs.
Who Should Read This
Advanced DSP/HPC engineers, FPGA architects, algorithm developers, and system designers who need to map floating-point linear-algebra workloads to FPGA or heterogeneous systems while understanding memory and performance constraints.
HistoricalAdvanced
Related Documents
- A New Approach to Linear Filtering and Prediction Problems TimelessAdvanced
- A Quadrature Signals Tutorial: Complex, But Not Complicated TimelessIntermediate
- An Introduction To Compressive Sampling TimelessIntermediate
- Lecture Notes on Elliptic Filter Design TimelessAdvanced
- Computing FFT Twiddle Factors TimelessAdvanced







