An OpenMP Implementation of a Spectral Method Based CFD Solver on the cc-NUMA Origin 2000

Fady Najjar1  
CSAR
Urbana, IL 61801
f-najjar@csar.uiuc.edu
http://www.csar.uiuc.edu/~f-najjar
and
Rajat Mittal2
Mechanical Engineering
University of Florida
Gainesville, FL 32611
mittal@gollum.me.ufl.edu
http://gollum.me.ufl.edu/~mittal

Extended Abstract

Keywords: OpenMP Compliant, Spectral Methods, Computational Fluid Dynamics, cc-NUMA Origin 2000

Abstract

The high accuracy of spectral methods have made them popular in direct numerical simulations (DNSs) of turbulent flows [1]. However, the high per-point computational cost of spectral method based solvers has to date limited these DNSs to low Reynolds numbers and to relatively simple flow configurations. Parallel computing provides the only avenue of extending this methodology to higher Reynolds numbers and more complex flow configurations which are of greater relevance to practical problems. However, due to the dominance of matrix-matrix multiplications and Fast Fourier Transforms in the computations, parallelization of a spectral code presents some unique challenges and these have to met in order to construct an efficient parallel solver. A Fourier-Chebyshev spectral collocation method for simulating 3-D, unsteady, viscous, incompressible flow past spheres and spheroids [2] has been ported from the CRAY C-90 to the distributed-shared (DSM) memory cc-NUMA (Cache-Coherent Non-Uniform Access Memory) Origin 2000 system using OpenMP. Our objective in porting this code to the parallel platform is to simulate and analyze the flow past spheres and spheroids upto Re=10000 and we expect to employ meshes with upto 12 million points for these simulations. A number of test cases have been run in order to characterize the parallel efficiency and speed of the solver. The tests indicate a parallel efficiency of at least 50% with a solver operating at upwards of 4 GFLOPS on 32 processors. This paper will describe the key issues faced in parallelizing this code and present in detail the results of these performace tests.

Computational Methodology

The basic shape that is considered is a prolate spheroid. The governing equations are the unsteady, incompressible Navier Stokes equations. The above equations are transformed to the prolate spheroidal coordinate system and the resulting system of equations is discretized on a body fitted spheroidal mesh the surface grid for which is shown in Figure 1. In the following discussion N1, N2 and N3 are the number of mesh points in the wall-normal (xi), wall-tangential (eta) and azimuthal (phi) directions respectively. The azimuthal direction phi is periodic over 2*pi and this allows us to use a Fourier collocation method in this direction. In the eta-direction, which is referred to as the wall-tangential direction, the flow is periodic not over pi but over 2*pi, and this allows us to use a Fourier collocation method for computing derivatives along this direction. The semi-infinite flow domain is truncated to a large but finite distance and a Chebyshev collocation method is used in this non-periodic direction. Thus, spectral discretization is used along all the three directions and this results in a highly accurate computation of the derivatives. A two-step time-split[3] scheme is used for advancing the solution in time. The intermediate velocity field is obtained first by advancing through the advection-diffusion equation. A second-order accurate, semi-implicit method is used for this step where all convective terms are treated explicitly and diffusion terms are treated implicitly. Details regarding implicit treatment of diffusion terms will be provided in section 2.3. A 2nd-order Adams-Bashforth (AB) is employed for the explicit terms and the implicit terms are discretized using a Crank-Nicolson (CN) scheme. The next step is pressure-correction and this requires the solution of the pressure Poisson equation which is solved with a homogeneous Neumann condition on the boundaries. Finally the pressure correction is added to the intermediate velocity and a divergence-free velocity field is obtained.

Figure 1. Surface Mesh showing Prolate Spheroidal Coordinate System

Single Processor Optimizations

The basic algorithm was developed on a single CRAY vector processor, C-90. Although the algorithm was efficient and optimized for Parallel Vector Processor (PVP) platforms, steps were needed to enhance the performance on RISC-based processors with distributed-shared memory hierarchy. The primary issues in this paradigm are to minimize primary (L1) and secondary (L2) data cache misses, reduce Table Look-aside Buffer (TLB) misses and optimize data locality. On the Origin 2000, access to data residing in cache is 100 times faster than access to memory. The compiler optimized for out-of-data cache provides prefetching functionality allowing enhanced cache performance. The porting effort from CRAY to the Origin 2000 accounted for minimal changes in the code, primarily removing EQUIVALENCE statements to permit data layout directives, loop interchange as to access the leftmost index the fastest, and upgrading all floating variables to double precision. Although the loop interchange is a procedure handled efficiently by the compiler with an aggressive optimization level (-O3), the loop structure was modified manually throughtout the code to permit an efficient implement on DSM systems for any optimization level. Further, the promotion of the variables from single-precision to double precision was handled in most cases with the compiler flag (-r8). The final step in the porting effort focused on modifications of the library calls to the Origin optimized libraries in particular the Fast Fourier Transform (FFT) and upgrading the library calls to double precision. For instance, the matrix-matrix multiplication routine is called on the CRAY in single precision as SGEMM, this call was modified to double precision as DGEMM. Unfortunately, this mechanism is not handled automatically by the compiler and care was taken to propagate these modifications throughout the code. Having completed this task, the code was tested on both platforms and was found to give answers within machine-accuracy.

Further, the resulting code design was found to have an efficient implementation for the RISC-based processors with minimal cache misses and a cache hit rate of 87 and 96 percentile for primary and secondary data caches, respectively. The TLB misses were seen to contribute minimally to the overall CPU time as a result of the algorithm structure described below. The main idea was to minimize large-stride array access and maintain the computations local to the threads in the leftmost index range.

Shared-Memory Parallel Programming Implementation

Although the SGI compiler provides features allowing automatic loop parallelism, several programming constructs inhibit this process including library calls, and reduction operations, to name a few. Loops containing such constructs can be concurrentized manually using compiler directives. It was decided in this step to focus on a portable scalable programming paradigm that runs efficiently on distributed-shared memory systems. This lead us to develop a parallel version based on the newly-devised industry standard for shared-memory systems, OpenMP [4]. OpenMP was designed to be a flexible standard easily implemented across different platforms and comprises of four distinct parts, control structures, data environment, synchronization and run-time library. Design issues are described with comparisons to distributed-memory message passing models in Dagum and Menon [5]. Further, the recent SGI compiler provides a native implementation of the OpenMP standard resulting in the most efficient design for the Origin 2000. The current code was structured in a modular manner to allow easy extensions in the future to various programming paradigms.

The shared-memory programming paradigm was implemented in the code via the OpenMP compiler directives with Parallel Regions (OMP$ PARALLEL) and Parallel Do Loops (OMP$ DO or OMP$ PARALLEL DO). The algorithm makes extensive use of the SGI extension to the OpenMP standard in the form of data layout directives, (C$DISTRIBUTE)[6]. The variable field consists of the 3 velocity, the pressure and the non-linear term components, with the data partitioned along a single direction as follows: C$DISTRIBUTE U(*,*,BLOCK). Hence, the two leftmost indices are distributed in a serial manner onto the local memory/cache of the thread while the third index is spread across the various threads in a block form. This will optimize the inter-processor communications across the network fabric and is found to be benefical for FFT and Matrix-Matrix Multiplications. For such computations, the array dimensions pertaining to intensive computations are done locally resulting in a high MFLOP rate (in some cases close to the CPU peak performance for the Matrix-Matrix Multiplications). However, communications are incurred when the computations are performed along the third dimension (the rightmost index). In this case, a global array motion is required to transpose the array onto the local memory of the thread. For instance, the algorithm to calculate the Fast Fourier Transform (FFT) is summarized as follows:

(i) Perform data transpose by bringing the rightmost index onto the left most index:

REAL*8 ut(N1,N2,N3)
C$DISTRIBUTE ut(*,*,BLOCK)
do j = 1, N2
     do i = 1, N1
         do k = 1, N3
              ut(k,i,j) = u(i,j,k)
         end do ! k
     end do ! i
end do ! j

(ii) Perform FFT computations on ut with optimized libraries

(iii) Transpose the data back to its original distribution:

do k = 1, N3
   do j = 1, N2
       do i = 1, N1
           u(i,j,k) = ut(k,i,j)
       end do ! i
    end do ! j
end do ! k

Steps (i) and (iii) represent inter-processor communication where global array motion is incurred. These steps are currently performed within a Parallel Do loop (OMP$ PARALLEL DO) construct and it is anticipated in the future that these communication constructs will be handled efficiently with more sophisticated libraries. Step (ii) corresponds to the computation step where various optimized library calls can be invoked at this stage. Such programming paradigm was also found to be optimal on Massively Parallel Processing Distributed-memory systems such as the CM-5 [7,8].

It is anticipated that more efficient implementations of the data layout will be devised to permit moderate to large problem size to ran onto larger partitions. Currently, the machine dimension onto which a grid can be mapped is limited by the second and third indices since they are the "BLOCK" dimensions in the data layout. Thus, the present parallel scalability study will focus on upto a 32-processor Origin 2000 partition. The operation count for the overall algorithm is obtained from C1 O(N4)+ C2 O(N3 log2(N))+ C3 O(N3) where N is a representative dimension for the 3-D arrays.

Performance:

The performance of the shared-memory OpenMP programming paradigm described above is evaluated in a model problem, the unsteady flow past a sphere at zero angle of attack. The operating system (OS) is IRIX6.5SE and the code is compiled with F90 Version 7.2.1MR and linked to the SGI/CRAY Scientific Library, SCSL 1.0. Computations are performed with 64-bit precision and 64-bit ABI. The timings reported correspond to the system clock counters obtained from irtc functionality. The single code performance of the PVP CRAY C-90 and T-90 is computed to be approximately 0.5 and 1.2 GFLOPS, respectively. The bulk of the computations is taken by the matrix-matrix multiplication routines for real and complex arrays, SGEMM and CGEMM.

Figure 2 shows the MFLOPS (Million Floating Point Operations per Second) rate on various processor partitions for three problem sizes of 128x128x128, 256x128x128, & 192x192x192 mesh points. It is observed that the single processor performance is approximately 250MFLOPS representing 2/3 of the peak CPU speed. The performance is seen to increase linearly (on a logarithmic scale) with increase in number of processors. This means that the performance rate can be defined as follows MFLOPS = C (NP)K, where K varies between 1.7 and 1.8. A peak performance of 4.3GFLOPS is achieved for the largest problem size on 32-processor partition. Also presented in Figure 2 is the MFLOPS normalized by the number of processors. It is seen that the rate decreases with increasing processor size due to overhead incurred from communication costs. However, the slope drops faster for the smaller problem size (128x128x128) compared to the larger one of 192x192x192 at higher processor partition sizes. In general, the MFLOPS normalized by partition size exceeds 130MFLOPS (i.e. 1/3 of the CPU Peak Performance) on 16 and 32-processor partitions for the two smallest problem sizes and the large one, respectively. Hence, the current algorithm recovers the performance obtained on a single-processor PVP T90 with only 8-processor Origin 2000. Further, for the largest problem size considered of 7-Million grid points, the CPU requirement on a 32-processor Origin 2000 is 6microseconds/time-step/grid-point.

Figure 2. MFLOPS versus number of Processors (NCPUs)

The parallel efficiency of the algorithm is evaluated by computing the speed-up factor. The speed-up factor, sigma, is defined as the ratio of the algorithm performance on a partition of size N over the algorithm performance on a single processor and is computed as sigma = (CPU)N/(CPU)1. Figure 1 presents the speed-up factors against the number of processors for the three problem sizes considered. Also shown is the theoretical limit defined as the partition size. At low processor size, the speed-up factor is close to the theoretical limit. It is seen that the parallel efficiency decreases from 81% to 54% for the 192x192x192 mesh problem because of the increased communication costs.

Figure 3. Parallel Scaleup factor versus number of Processor.

Conclusions:

A portable scalable algorithm for spectral based method CFD solver has been described. The implementation focused on the cc-NUMA Origin 2000 with a shared-memory programming paradigm based on the newly-devised OpenMP. A native compiler implementation on the Origin 2000 allows us to take advantage of the low overhead cost incurred from the API layer. Parallel performance exceeding 4GFLOPS have been obtained for a problem size of 7 Million grid point on a 32-processor partition. The code achieves a performance of 1/3 the CPU peak performance on a large partition size.

References:

1.
C. Canuto, M.Y. Hussaini, A. Quarteroni, A. and T.A. Zang.
Spectral Methods in Fluid Dynamics, Springer-Verlag, 1988.

2.
R. Mittal
"A Fourier-Chebyshev Spectral Collocation Method for Simulating Flow Past Spheres and Spheroids,"
submitted for publications, 1998.

3.
R. Mittal, and S. Balachandar.
"Direct Numerical Simulations of Flow Past Elliptic Cylinders,"
J. Comp. Phys., Vol. 124, 351-367, 1996.

4.
OpenMP Tutorial, http:/www.openmp.org.

5.
Dagum, L., and R. Menon.
"OpenMP: An Industry-Standard API for Shared-Memory Programming,"
IEEE Computational Science & Engineering, Jan-March 1998, pp. 46-55.

6.
MIPSpro 7 Fortran 90 Commands and Directives Reference Manual, SGI Technical Publications, http://techpubs.sgi.com/library.

7.
Cortese, T.A., and S. Balachandar
"High- Performance Spectral Simulation of Turbulent Flows in Massively Parallel Machines with Distributed Memory",
International Journal of Supercomputer Applications, Volume 9, No. 3, Fall, pp. 187-204.

8.
F.M. Najjar, and S.P. Vanka.
"Simulations of Unsteady Fluid Flows on the CM-5",
ASME Symposium on Advances in Computational Methods in Fluid Dynamics, FED-Vol. 196, pp. 277-285, 1994.

Footnotes

 1Center for Simulation of Advanced Rockets, University of Illinois at Urbana-Champaign, 1304 West Springfield Avenue, Urbana, IL 61801. Phone: (217) 333-8895. Fax: (217) 333-1910

 2Mechanical Engineering, University of Florida, P.O. Box 116300, Gainesville, FL 32611-6300. Phone: (352) 392-6751. Fax: (352) 392-1071