HPCC Record: IBM Blue Gene/P 32768 PowerPC 450 0.85 GHz

HPC Challenge Benchmark Record

System Information
Affiliation:	Argonne National Lab - LCF	URL:	www.alcf.anl.gov
Location:	USA, Illinois, Argonne	System Use:	Government
System Manufacturer:	IBM	System Name:	Blue Gene/P
Interconnect Manufacturer:	IBM	Interconnect Type:	Torus
Operating System:	Blue Gene CNK	MPI:	MPICH 2
MPI Wtick:	0.000000001176471	BLAS:	ESSL 4.3
Language:	C	Compiler:	IBM XL C/C++ 9.00
Compiler Flags:	-DHPCC_MEMALLCTR -g -O3 -qhot -qsmp=omp -qmaxmem=-1 -DBGPOPT	Processor Type:	PowerPC 450
Processor Speed:	0.85 GHz	Total Processors:	40960
Processors Entered:	32768	Processors determined:	131072
Cores per chip:	4	HPL Processes:	32768
MPI Processes:	32768	Threads Entered:	4
Threads determined:	4	FLOPs per cycle:
Theoretical peak:	557 TFlop/s	Total memory:	81920 GiB
FFT library:
Explain Optimizations:
The RandomAccess algorithm is similar in principle to the algorithm submitted for HPCC in 2005, i.e., it employs software routing and aggregation on a 3D-torus topology, routing every update along the three dimensions in dimension ordered routing manner and ensuring that the application does not have more than 1024 updates with it at any point in time. The significant changes for Blue Gene/P are as follows: Since an MPI node has 4 cores at its service, the functionality is distributed on the 4 cores with core 0 generating the updates and routing them along the X dimension, core 1 receiving from X dimension and routing along Y, core 2 receiving from Y and routing along Z and core 3 receiving from Z and performing the updates to the local table. The code bypasses MPI and directly uses the lower layer DMA based SPI communication layer (available as standard software library of Blue Gene/P). The IBM Blue Gene/P system supports direct use of messaging DMA hardware in parallel with use of MPI for applications messaging. To enable this direct use mode for DMA an initialization call to setup the DMA fifos must be executed before invoking the MPI_Init call. The optimized HPCC code has introduced a function call dma_init which is invoked just before MPI_init for this purpose. This is a method that has been put in to support special messaging situations and that is used in a number of production codes including QCD. It is also well documented in the Blue Gene redbook. For MPIFFT we changed the algorithm of parallel 1D FFT from 9-step FFT to basic 6-step algorithm. And then we modified the parallelized FFT codes under HPCC_pzfft1d function to fit to Blue Gene system. We modified all the functions in "fft235.c" to SIMDize radix 2,3,4,5 and 8 FFT routines by using intrinsic functions of IBM XLC compiler to generate appropriate double FPU instructions.

HPL
HPL:	173.362 Tflop/s	HPL time:	57080.7
HPL eps:	1.11022e-16	HPL Rnorm1:	0.00000115715
HPL Anorm1:	615309	HPL AnormI:	615249
HPL Xnorm1:	4574500	HPL XnormI:	11.9858
HPL N:	2457601	HPL NB:	120
HPL NProw:	128	HPL NPcol:	256
HPL depth:	1	HPL NBdiv:	6
HPL NBmin:	6	HPL CPfact:	C
HPL CRfact:	R	HPL CPtop:	3
HPL order:	R
HPL dMach EPS:	1.110223e-16	HPL sMach EPS:	0.00000005960464
HPL dMach sfMin:	0	HPL sMach sfMin:	1.175494e-38
HPL dMach Base:	2	HPL sMach Base:	2
HPL dMach Prec:	2.220446e-16	HPL sMach Prec:	0.0000001192093
HPL dMach mLen:	53	HPL sMach mLen:	24
HPL dMach Rnd:	1	HPL sMach Rnd:	1
HPL dMach eMin:	-1021	HPL sMach eMin:	-125
HPL dMach rMin:	0	HPL sMach rMin:	1.175494e-38
HPL dMach eMax:	1024	HPL sMach eMax:	128
HPL dMach rMax:	1.797693e308	HPL sMach rMax:	3.402823e38
dweps:	1.110223e-16	sweps:	0.00000005960464

PTRANS
PTRANS:	625.204 GB/s	PTRANS time:	18.948 seconds
PTRANS residual:	0	PTRANS N:	1228800
PTRANS NB:	120	PTRANS NProw:	128
PTRANS NPcol:	256

STREAM
S-STREAM Copy:	5.43815 GB/s	S-STREAM Scale:	3.62631 GB/s
S-STREAM Add:	3.97957 GB/s	S-STREAM Triad:	3.97997 GB/s
EP-STREAM Copy:	5.43754 GB/s	EP-STREAM Scale:	3.6263 GB/s
EP-STREAM Add:	3.9796 GB/s	EP-STREAM Triad:	3.97996 GB/s
STREAM Vector Size:	61440050	STREAM Threads:	4

RandomAccess
S-RandomAccess:	0.0096932 Gup/s	EP-RandomAccess:	0.00969341 Gup/s
G-RandomAccess:	103.18 Gup/s	G-RandomAccess N:	4398046511104
G-RandomAccess time:	170.5 seconds	G-RandomAccess Check Time:	1009.14 seconds
G-RandomAccess Errors:	0	G-RandomAccess Errors Fraction:	0
G-RandomAccess TimeBound:	-1	G-RandomAccess ExeUpdates:	17592186044416
RandomAccess N:	134217728

FFT
S-FFT:	1.21389 GFlop/s	EP-FFT:	1.21354 GFlop/s
MPIFFT:	5079.59 GFlop/s	MPIFFT N:	549755813888
MPIFFT Max Error:	0.0000000000000024651	MPIFFT time0:	0.397244 seconds
MPIFFT time1:	4.26304 seconds	MPIFFT time2:	2.08924 seconds
MPIFFT time3:	5.30936 seconds	MPIFFT time4:	3.88742 seconds
MPIFFT time5:	4.96885 seconds	MPIFFT time6:	0.189394 seconds
FFTEnblk:	16	FFTEnp:	8
FFTEl2size:	1048576

DGEMM
S-DGEMM:	9.67524 GFlop/s	EP-DGEMM:	9.67646 GFlop/s
DGEMM N:	7837

RandomRing Latency/Bandwidth
RandomRing Latency:	6.23889 usec		RandomRing Bandwidth:	0.0219922 GB/s

NaturalRing Latency/Bandwidth
NaturalRing Latency:	4.85518 usec		NaturalRing Bandwidth:	0.743607 GB/s

PingPong Latency/Bandwidth
Maximum PingPong Latency:	6.61654 usec	Maximum PingPong Bandwidth:	0.385704 GB/s
Minimum PingPong Latency:	3.58265 usec	Minimum PingPong Bandwidth:	0.379582 GB/s
Average PingPong Latency:	5.06575 usec	Average PingPong Bandwidth:	0.385048 GB/s

Size of Data Types
char:	1 byte	short:	2 bytes
int:	4 bytes	long:	4 bytes
void ptr:	4 bytes	float:	4 bytes
double:	8 bytes	size t:	4 bytes
s64Int:	8 bytes	u64Int:	8 bytes

OpenMP
M OpenMP:	200505	OpenMP Num Threads:	4
OpenMP Num Procs:	4	OpenMP Max Threads:	4

Memory
MemProc:	-1	MemSpec:	-1
MemVal:	-1

CPS
CPS_HPCC_FFT_235:	0	CPS_HPCC_FFTW_ESTIMATE:	0
CPS_HPCC_MEMALLCTR:	1	CPS_HPL_USE_GETPROCESSTIMES:	0
CPS_RA_SANDIA_NOPT:	0	CPS_RA_SANDIA_OPT2:	0

Version: 1.2.0.f - Run Type: opt - Parent ID: 317
Created: 2008-11-17 - Exported: Thu Jun 23 15:54:52 2022
HPC Challenge Benchmark Record