A Quick-And-Dirty Development Application Benchmark

The Problem

Back in 1997, I felt that none of those Quake-frame-rate and Windows/Office benchmarks measured what I would like to measure. I was fed up with the non-availability of the SPEC benchmark - for years, there hadn't been any SPECint95 for the AMD processors, for example - and the problems with porting other benchmarks like those of the BYTE Unix magazine. So, I decided to roll my own.

For most applications, like e-mail, editing, and web browsing, computational speed is almost irrelevant. There are very few applications that benefit from faster hardware. In my current work, this is

  1. LaTeX,
  2. Compiling C++, and
  3. once in a while some numerical computations.

So, I decided to test exactly those applications. Since the numerical computations usually run unattended, it doesn't matter much whether they take 2 or 3 hours. LaTeX and C++, however, directly cost development time. So, I would focus on the first two applications.

The source of the benchmark should be as easily portable as possible, as well as small. Always wall clock time should be measured, rather than cpu or system time.

The Test

The LaTeX test consists in running a slightly edited version of Descartes' "A Discourse on Method" (dcart10.txt of the Gutenberg project, 10 times) through latex. This produces a dvi file with 450 pages. The current computers are so fast that the speed of the file I/O can play a role when writing the dvi-file. As I wanted to be independent of NFS and hard disk performance, the dvi-file is written to /dev/null. The process needs about 1.6MB memory and thus benefits greatly from fast caches. Operations are integer, so this test measures localized integer performance.

The C++ compile test compiles a small C++ program that makes extensive use of the Standard Template Library. The process needs about 16MB during syntax analysis, so on most machines caches don't matter much. This test measures non-localized integer performance.

The problem with numerical applications is that those applications usually come in thousands of lines of source code. In a first version of this benchmark I used the linear optimization solver PCx on a specific problem, as I happen to work a lot on LP problems. This wasn't as portable as I liked, so I switched to a synthetic benchmark. The test consists in a simple (dense) matrix-vector multiplication, written in ANSI C. The dimensions are choosen such that the process takes about 19MB memory. Explicit loop unrolling (x16) is done to ensure that all the pipelines in multi-pipelined CPUs are used. The test measures non-localized floating point performance.

Here you can grab the source (57KB).


Some Results

The following table contains some results from the computers I have (had) access to. They are sorted with respect to their C++ compile performance. The numbers are elapsed time in seconds. Numbers in parentheses are cpu time in seconds. Everything below 2 seconds can be considered relatively fast and everything above 5 seconds slow (in Q4 2001).

name CPU Cache RAM LaTeX C++ Numerics
antares-2003 AMD Athlon XP 1700+ (1466MHz) 128K L1, 256K L2 256MB PC266 DDR SDRAM (CL2) 1.101 (1.1) 1.194 (1.19) 1.268 (1.27)
antares AMD Athlon 64 Newcastle 3000+ (2000MHz) 128K L1, 512K L2 1GB PC400 DDR SDRAM (CL2.5) 0.726 (0.714) 1.215 (1.161) 0.425 (0.414)
jack AMD Athlon Thunderbird 1.33GHz 128K L1, 256KB L2 256MB PC266 DDR-SDRAM 1.239 (1.24) 1.441 (1.39) 1.633 (1.63)
fubini2 Pentium 4 (Xeon) 1.8GHz
Dell PowerEdge 2650
512KB L2 1GB 1.265 (1.27) 2.204 (2.1) 0.665 (0.61)
antares-2001 AMD Athlon Thunderbird 750MHz 128K L1, 256K L2 128MB PC100 SDRAM 2.158 (2.12) 2.267 (2.21) 2.458 (2.380)
antares-2000 AMD Athlon 550MHz 128K L1, 512K L2 64MB PC100 SDRAM 3.02 (2.98) 2.45 (2.34) 3.60 (3.52)
rao Pentium III (Coppermine) 600MHz, 2 CPU 256KB L2 512MB PC100 SDRAM 2.865 (2.820) 2.852 (2.730) 2.605 (2.480)
capricorn AMD K6-III 450MHz 1MB L3 128MB PC100 SDRAM 3.73 (3.69) 2.9 (2.88) 5.72 (5.66)
jensen Alpha ? ?
Compaq XP1000
? 1GB 4.5 (4.3) 3.0 (2.0) 1.5 (1.4)
marvel Alpha EV7 1000MHz, 8 CPU
Compaq GS1280
? 8GB 3.0 (2.8) 3.1 (2.5) 0.6 (0.6)
inspiron Intel mobile Pentium-III-M 1GHz
Dell Inspiron 4100
512KB L2 512MB PC133 SDRAM 1.56 (1.54) 3.216 (3.16) 2.382 (2.31)
markov Pentium III (Katmai) 500MHz, 2 CPU 512KB 256MB 3.51 (3.47) 3.24 (3.05) 4.51 (4.44)
moser Alpha 21264A 731MHz, 8 CPU
Compaq GS80
4MB L2 8GB 4.0 (3.9) 4.3 (3.6) 1.5 (1.4)
a211n1 Pentium II 400MHz
(root on NFS)
512KB L2 64MB 4.51 (4.32) 4.70 (3.82) 4.78 (4.78)
rao-1999 AMD K6-2 300MHz
(100Mhz front side bus)
512KB L2 64MB 6.14 (6.12) 5.24 (5.23) 10.07 (10.02)
grad Alpha 21264 500MHz
Compaq
? 512MB 7.8 (7.6) 5.8 (5.1) 3.8 (3.7)
cramer Pentium II 350MHz 512KB L2 64MB 4.96 (4.81) 6.55 (4.4) 5.32 (5.27)
bernhard Alpha 21164 533MHz
dcp
L2:96K, L3:2MB 128MB 5.91 (5.83) 7.58 (6.59) 5.62 (5.62)
capricorn-1999 AMD K6 200MHz 512KB L2 64MB EDO 9.32 (9.26) 8.07 (8.02) 16.02 (15.94)
black AMD K6 300MHz
(66MHz front side bus)
512KB L2 64MB 6.90 (6.85) 8.36 (6.82) 13.21 (13.19)
rao-1997 Pentium Pro 200MHz 256KB L2 96MB 7.21 (?) 13.87 (?) 12.86 (?)
doob Power2SC 135MHz
model 595
L1:128KB/32KB, no L2 256MB 10.56 (?) 18.67 (?) 3.18 (?)
capricorn-1997 AMD K5-PR100 256KB L2 32MB EDO 13.33 (13.3) 20.78 (15.64) 36.75 (36.76)
satellite Pentium 100MHz
Toshiba Satellite 110cs
0 40MB 24.66 (24.6) 24.26 (24.04) 28.10 (28.1)
st Power2 67MHz
IBM model 3CT
L1:128KB/32KB, 1-2MB? L2 256MB 20.46 (20.29) 32.60 (32.01) 6.09 (5.98)

Integer versus Floating Point Performance

It is amazing how much different the ratio between integer and floating point performance is between the AMD processors, the Intel CPUs and the IBM RS/6000 world. AMDs CPUs used to beat comparable Intel CPUs in C++-compile performance while lagging in floating point performance. (This changed with the Athlon, who's integer/floating point performance ratio is similar to Intel's Coppermine Pentium III. See also the chess benchmarks.) On the other extreme are IBMs work stations. The Model 3CT (st) is roughly comparable to an Intel Pentium-100 in terms of integer performance. But it is twice as fast as a Pentium Pro and can still compete with 400MHz K6 in terms of (pure) floating point performance.

However, since even numerical applications don't consist of 100% floating point operations, the fast floating point performance doesn't help much when the integer performance lags behind. Indeed, on the solution of a certain LP problem, rao-1997 and st needed exactly the same time.

The Effect of Loop Unrolling

A remark on the loop unrolling in the numerical test is necessary. I tested the floating point performance on an RS/6000 model 3CT (Power2) with several compiler options and explicit loop unrolling switched on and off:

Wall clock time in seconds
C-source gcc -O2 gcc -O2 -mcpu=rios2 -funroll-loops xlc -O2 xlc -O3 -qstrict -qtune=pwr2
no explicit loop unrolling (-DNOLU) 17.40 8.68 8.23 6.24
explicit loop unrolling in the source 6.04 6.05 8.22 7.07

The Power2 CPU (in IBM's RS/6000) has several FPU pipelines. Using gcc -O2 on the source with no explicit loop unrolling reveals that the performance substantially decreases if the pipelines are not filled properly. xlc seems to do loop unrolling even with -O2.


last reviewed: February 22, 2002, Stefan Jaschke