Back in 1997, I felt that none of those Quake-frame-rate and Windows/Office benchmarks measured what I would like to measure. I was fed up with the non-availability of the SPEC benchmark - for years, there hadn't been any SPECint95 for the AMD processors, for example - and the problems with porting other benchmarks like those of the BYTE Unix magazine. So, I decided to roll my own.
For most applications, like e-mail, editing, and web browsing, computational speed is almost irrelevant. There are very few applications that benefit from faster hardware. In my current work, this is
So, I decided to test exactly those applications. Since the numerical computations usually run unattended, it doesn't matter much whether they take 2 or 3 hours. LaTeX and C++, however, directly cost development time. So, I would focus on the first two applications.
The source of the benchmark should be as easily portable as possible, as well as small. Always wall clock time should be measured, rather than cpu or system time.
The LaTeX test consists in running a slightly edited version of
Descartes' "A Discourse on Method" (dcart10.txt of the Gutenberg
project, 10 times) through
latex. This produces a dvi
file with 450 pages. The current computers are so fast that the
speed of the file I/O can play a role when writing the
dvi-file. As I wanted to be independent of NFS and hard disk
performance, the dvi-file is written to
/dev/null. The process needs about 1.6MB memory and thus
benefits greatly from fast caches. Operations are integer,
so this test measures localized integer performance.
The C++ compile test compiles a small C++ program that makes extensive use of the Standard Template Library. The process needs about 16MB during syntax analysis, so on most machines caches don't matter much. This test measures non-localized integer performance.
The problem with numerical applications is that those applications usually come in thousands of lines of source code. In a first version of this benchmark I used the linear optimization solver PCx on a specific problem, as I happen to work a lot on LP problems. This wasn't as portable as I liked, so I switched to a synthetic benchmark. The test consists in a simple (dense) matrix-vector multiplication, written in ANSI C. The dimensions are choosen such that the process takes about 19MB memory. Explicit loop unrolling (x16) is done to ensure that all the pipelines in multi-pipelined CPUs are used. The test measures non-localized floating point performance.
Here you can grab the source (57KB).
The following table contains some results from the computers I have (had) access to. They are sorted with respect to their C++ compile performance. The numbers are elapsed time in seconds. Numbers in parentheses are cpu time in seconds. Everything below 2 seconds can be considered relatively fast and everything above 5 seconds slow (in Q4 2001).
|antares-2003||AMD Athlon XP 1700+ (1466MHz)||128K L1, 256K L2||256MB PC266 DDR SDRAM (CL2)||1.101 (1.1)||1.194 (1.19)||1.268 (1.27)|
|antares||AMD Athlon 64 Newcastle 3000+ (2000MHz)||128K L1, 512K L2||1GB PC400 DDR SDRAM (CL2.5)||0.726 (0.714)||1.215 (1.161)||0.425 (0.414)|
|jack||AMD Athlon Thunderbird 1.33GHz||128K L1, 256KB L2||256MB PC266 DDR-SDRAM||1.239 (1.24)||1.441 (1.39)||1.633 (1.63)|
|fubini2||Pentium 4 (Xeon) 1.8GHz
Dell PowerEdge 2650
|512KB L2||1GB||1.265 (1.27)||2.204 (2.1)||0.665 (0.61)|
|antares-2001||AMD Athlon Thunderbird 750MHz||128K L1, 256K L2||128MB PC100 SDRAM||2.158 (2.12)||2.267 (2.21)||2.458 (2.380)|
|antares-2000||AMD Athlon 550MHz||128K L1, 512K L2||64MB PC100 SDRAM||3.02 (2.98)||2.45 (2.34)||3.60 (3.52)|
|rao||Pentium III (Coppermine) 600MHz, 2 CPU||256KB L2||512MB PC100 SDRAM||2.865 (2.820)||2.852 (2.730)||2.605 (2.480)|
|capricorn||AMD K6-III 450MHz||1MB L3||128MB PC100 SDRAM||3.73 (3.69)||2.9 (2.88)||5.72 (5.66)|
|jensen||Alpha ? ?
|?||1GB||4.5 (4.3)||3.0 (2.0)||1.5 (1.4)|
|marvel||Alpha EV7 1000MHz, 8 CPU
|?||8GB||3.0 (2.8)||3.1 (2.5)||0.6 (0.6)|
|inspiron||Intel mobile Pentium-III-M 1GHz
Dell Inspiron 4100
|512KB L2||512MB PC133 SDRAM||1.56 (1.54)||3.216 (3.16)||2.382 (2.31)|
|markov||Pentium III (Katmai) 500MHz, 2 CPU||512KB||256MB||3.51 (3.47)||3.24 (3.05)||4.51 (4.44)|
|moser||Alpha 21264A 731MHz, 8 CPU
|4MB L2||8GB||4.0 (3.9)||4.3 (3.6)||1.5 (1.4)|
|a211n1||Pentium II 400MHz
(root on NFS)
|512KB L2||64MB||4.51 (4.32)||4.70 (3.82)||4.78 (4.78)|
|rao-1999||AMD K6-2 300MHz
(100Mhz front side bus)
|512KB L2||64MB||6.14 (6.12)||5.24 (5.23)||10.07 (10.02)|
|grad||Alpha 21264 500MHz
|?||512MB||7.8 (7.6)||5.8 (5.1)||3.8 (3.7)|
|cramer||Pentium II 350MHz||512KB L2||64MB||4.96 (4.81)||6.55 (4.4)||5.32 (5.27)|
|bernhard||Alpha 21164 533MHz
|L2:96K, L3:2MB||128MB||5.91 (5.83)||7.58 (6.59)||5.62 (5.62)|
|capricorn-1999||AMD K6 200MHz||512KB L2||64MB EDO||9.32 (9.26)||8.07 (8.02)||16.02 (15.94)|
|black||AMD K6 300MHz
(66MHz front side bus)
|512KB L2||64MB||6.90 (6.85)||8.36 (6.82)||13.21 (13.19)|
|rao-1997||Pentium Pro 200MHz||256KB L2||96MB||7.21 (?)||13.87 (?)||12.86 (?)|
|L1:128KB/32KB, no L2||256MB||10.56 (?)||18.67 (?)||3.18 (?)|
|capricorn-1997||AMD K5-PR100||256KB L2||32MB EDO||13.33 (13.3)||20.78 (15.64)||36.75 (36.76)|
Toshiba Satellite 110cs
|0||40MB||24.66 (24.6)||24.26 (24.04)||28.10 (28.1)|
IBM model 3CT
|L1:128KB/32KB, 1-2MB? L2||256MB||20.46 (20.29)||32.60 (32.01)||6.09 (5.98)|
It is amazing how much different the ratio between integer and floating point performance is between the AMD processors, the Intel CPUs and the IBM RS/6000 world. AMDs CPUs used to beat comparable Intel CPUs in C++-compile performance while lagging in floating point performance. (This changed with the Athlon, who's integer/floating point performance ratio is similar to Intel's Coppermine Pentium III. See also the chess benchmarks.) On the other extreme are IBMs work stations. The Model 3CT (st) is roughly comparable to an Intel Pentium-100 in terms of integer performance. But it is twice as fast as a Pentium Pro and can still compete with 400MHz K6 in terms of (pure) floating point performance.
However, since even numerical applications don't consist of 100% floating point operations, the fast floating point performance doesn't help much when the integer performance lags behind. Indeed, on the solution of a certain LP problem, rao-1997 and st needed exactly the same time.
A remark on the loop unrolling in the numerical test is necessary. I tested the floating point performance on an RS/6000 model 3CT (Power2) with several compiler options and explicit loop unrolling switched on and off:
|no explicit loop unrolling (-DNOLU)||17.40||8.68||8.23||6.24|
|explicit loop unrolling in the source||6.04||6.05||8.22||7.07|
The Power2 CPU (in IBM's RS/6000) has several FPU
gcc -O2 on the source with no
explicit loop unrolling reveals that the performance substantially
decreases if the pipelines are not filled properly.
xlc seems to do loop unrolling even with