Testing and Implementing Some New Algorithms Using the FFTW Library on Massively Parallel Supercomputers

Guarrasi, Massimiliano; Ning, Li; Frigio, Sandro; Emerson, Andrew; Erbacci, Giovanni

doi:10.3233/978-1-61499-381-0-375

The aim of this paper is to provide a strategy for overcoming the limits of codes employing the FFTW library by implementing a more powerful parallel domain decomposition algorithm and by refining the auto-tuning mechanism that is already implemented in this library. In the first part of this paper we identify some of the major performance bottlenecks present in the current FFTW implementation, in particular the auto-tuning mechanism provided in FFTW. To do this we have tested for the first time on a Blue Gene/Q system a 2D Parallel Domain Decomposition algorithm provided by the 2DECOMP&FFT library. We found that on massively parallel supercomputers such as Blue Gene/Q clusters the performance of this new algorithm is significantly higher. To demonstrate the benefits of the algorithm in a real application we included the library in a CFD code, BlowupNS, where we found a marked improvement in parallel scalability.