# BLAS Benchmarks

This page contains benchmark results of BLAS implementations on certain
hardware.

The Basic Linear Algebra
Subprograms (BLAS) are an API (and a Fortran implementation) of linear
algebraic operations. Level 1 BLAS do vector-vector operations, Level 2 BLAS do
matrix-vector operations, and Level 3 BLAS do matrix-matrix operations.

Apart from the generic Fortran implementation of the BLAS interface
there are hand-tuned (mostly assmebler) implementations like Intel's
MKL as well as semi-automatically tuned, generic implementations like ATLAS.

## General Matrix-Matrix Multiplication

The BLAS 3 routine GEMM implements general matrix-matrix
multiplications. On most hardware and with blocked algorithms, GEMM is
CPU-bound, i.e., is limited by the number of CPU FLOPS. DGEMM is the double
precision (80/64bit) and SGEMM the single precision (32bit) variant.

**DGEMM results**

**SGEMM results**

The legend contains the CPU type, the memory type, the Atlas version
number, and the SIMD instruction set used.

## General Matrix-Vector Multiplication

The BLAS 2 routine GEMV implements general matrix-vector
multiplications. For most current GHz-CPUs this is memory-bound for matrices
that do not fit into the L2 cache. The Athlon XP 1700+ with PC266 DDR-SDRAM,
for example, can crunch numbers much faster (theoretical peak = 2932MFLOPS
double and 5864 MFLOPS single precision) than it can read them from memory
(theoretical peak = 262.5M 64bit-numbers per second and 525M 32bit-numbers per
second).

**DGEMV results**

**SGEMV results**

last reviewed: February 28, 2002, Stefan Jaschke