BLAS Benchmarks

This page contains benchmark results of BLAS implementations on certain hardware.

The Basic Linear Algebra Subprograms (BLAS) are an API (and a Fortran implementation) of linear algebraic operations. Level 1 BLAS do vector-vector operations, Level 2 BLAS do matrix-vector operations, and Level 3 BLAS do matrix-matrix operations.

Apart from the generic Fortran implementation of the BLAS interface there are hand-tuned (mostly assmebler) implementations like Intel's MKL as well as semi-automatically tuned, generic implementations like ATLAS.

General Matrix-Matrix Multiplication

The BLAS 3 routine GEMM implements general matrix-matrix multiplications. On most hardware and with blocked algorithms, GEMM is CPU-bound, i.e., is limited by the number of CPU FLOPS. DGEMM is the double precision (80/64bit) and SGEMM the single precision (32bit) variant.

DGEMM results
DGEMM curves

SGEMM results
SGEMM curves

The legend contains the CPU type, the memory type, the Atlas version number, and the SIMD instruction set used.

General Matrix-Vector Multiplication

The BLAS 2 routine GEMV implements general matrix-vector multiplications. For most current GHz-CPUs this is memory-bound for matrices that do not fit into the L2 cache. The Athlon XP 1700+ with PC266 DDR-SDRAM, for example, can crunch numbers much faster (theoretical peak = 2932MFLOPS double and 5864 MFLOPS single precision) than it can read them from memory (theoretical peak = 262.5M 64bit-numbers per second and 525M 32bit-numbers per second).

DGEMV results
DGEMM curves

SGEMV results
SGEMM curves


last reviewed: February 28, 2002, Stefan Jaschke