EnuMath 2019

10:40 MS39: Flexible software design and performance tuning for modern HPC architectures (Part 2)
Chair: Dominik Goeddeke

10:40 25 mins	Better, faster and more portable: how to code differently George Bosilca Abstract: Next generation of HPC platforms presents a variety of challenges, including a growing need for asynchrony, increased hardware diversity, performance non-uniformity, and decreased hardware reliability. These threaten the maintainability of increasingly complex scientific code bases and questions the continued viability of message passing as a model for direct, application-level interactions. This talk presents an extension to the PaRSEC task-based runtime, providing users with a Domain Specific Language allowing the dynamic description of dataflow application. This approach singularly improves the performance portability and maintainability of codes. Qualitative data about the new programming model and some of it's applications as well as quantitative performance data will be presented.
11:05 25 mins	Balancing Run-Time Customization and Compile-Time Optimization in HPC Karl Rupp Abstract: Many publications in high performance computing deal with the optimization of one or a few specific computational kernels on a given hardware architecture. The typical implicit assumption at the beginning of such research is that the hardware and software stack is known a-priori. Thus, hardware details such as cache sizes are explicitly known and software switches such as compiler flags can be optimized without constraints. When providing software for high performance computing --- either in the form of standalone packages or libraries --- the target machine is likely to consist of different hardware and a different software stack. Thus, providing optimal performance for an a-priori unknown target machine is a significant challenge, even if optimal implementations for several hardware and software stacks are known to the developers of the software. On the other hand, typical research involves a high degree of experimentation; for example, different preconditioners for the solution of large sparse systems of equations should be compared with reasonable effort, thus requiring the ability to customize functionality at run-time. Such run-time customization is sooner or later in conflict with compile-time optimizations, where indirections should be avoided whenever reasonable. This talk explores which trade-offs between run-time customization and compile-time optimizations can be considered reasonable. While there is no general answer to this question, this talk discusses some lessons learnt during the development of the free open source libraries PETSc and ViennaCL. In particular, the current approaches and future directions in dealing with a broad range of graphics processing units as well as vector extensions in central processing units are presented.
11:30 25 mins	Fully algebraic two-level overlapping Schwarz preconditioners for elasticity problems Alexander Heinlein, Christian Hochmuth, Axel Klawonn Abstract: Different parallel two-level overlapping Schwarz preconditioners with Generalized Dryja--Smith--Widlund (GDSW) and Reduced dimension GDSW (RGDSW) coarse spaces for elasticity problems are considered. GDSW type coarse spaces can be constructed from the fully assembled system matrix, but they additionally need the index set of the interface of the corresponding nonoverlapping domain decomposition and the null space of the elasticity operator, i.e., the rigid body motions. In this paper, fully algebraic variants, which are constructed solely from the uniquely distributed system matrix, are compared to the classical variants which make use of this additional information; the fully algebraic variants use an approximation of the interface and an incomplete algebraic null space. Nevertheless, the parallel performance of the fully algebraic variants is competitive compared to the classical variants for a stationary homogeneous model problem and a dynamic heterogenous model problem with coefficient jumps in the shear modulus; the largest parallel computations were performed on 4,096 MPI (Message Passing Interface) ranks. The parallel implementations are based on the Trilinos package FROSch.
11:55 25 mins	Simple and fast software approaches for building differential equation solvers Garth Wells Abstract: There are now many excellent, freely available software libraries in the research community that support application and methods research, as well as research into the details of software and hardware approaches and implementations. The diversity of approaches and the willingness to share has enhanced the research community. I will reflect upon my involvement with the FEniCS Project (https://fenicsproject.org) over a number of years, and in particular approaches and strategies that have been successful as well as approaches that proved less fruitful. Recently, the range of available software tools has blossomed (particularly with respect to high-level, just-in-time compiled languages) and this is led to changes in direction and changes in ideas for developing simple yet fast tools. I will present some examples, which will include examples of how recent developments have changed my views on how scientific software can be best developed and shared.