# The Wonderful People Who Pay My Salary

- 2015-2017, $400,000, Intel Parallel Computing Center

Extending PETSc with Adaptive Mesh Refinement and Optimal Solvers, Applied to PFLOTRAN, and Optimized for Modern Intel ProcessorsWe will extend and optimize PETSc for structured adaptive mesh refinement (SAMR) discretizations with initial deployment to the premier open source, state-of-the-art massively parallel subsurface flow and reactive transport code – PFLOTRAN. This work is significant in that it radically changes algorithms, data access, and low level computational organization in order to maximize performance and scalability on modern Intel architectures, and encapsulates this knowledge in the PETSc libraries for the broadest possible impact.

PETSc will provide a new solver interface to SAMR, enabling the efficient representation of multi-scale phenomenon, while maintaining the simplicity of structured grid kernel computations. These kernel computations are amenable to the high number of vector lanes characteristic of current and future Intel architectures, and we will aggressively vectorize the existing code. Moreover, we propose to use the most asymptotically efficient solvers for equations of interests to PFLOTRAN such as Richards’ equations: matrix-free full approximation scheme (FAS) nonlinear full multigrid (FMG) methods. FMG has the potential to solve nonlinear systems of equations to truncation error accuracy with a few (5-20) work units or residual calculations. These formulations have a unique opportunity to leverage modern architectures to deliver fast, accurate, versatile solvers for subsurface flows and other PETSc applications.

- 2012-2015, $117,710, NSF SI2-SSE: 1147680

SPIKE --- An Implementation of a Recursive Divide-and-Conquer Parallel Strategy for Solving Large Systems of Linear EquationsDrs. Negrut, Sameh, and Knepley will investigate, produce, and maintain a methodology and its software implementation that leverage emerging heterogeneous hardware architectures to solve billion-unknowns linear systems in a robust, scalable, and efficient fashion. The two classes of problems targeted under this project are banded dense and sparse general linear systems.

This project is motivated by the observation that the task of solving a linear system is one of the most ubiquitous ingredients in the numerical solution of Applied Mathematics problems. It is relied upon for the implicit integration of Ordinary Differential Equation (ODE) and Differential Algebraic Equation (DAE) problems, in the numerical solution of Partial Differential Equation (PDE) problems, in interior point optimization methods, in least squares approximations, in solving eigenvalue problems, and in data analysis. In fact, the vast majority of nonlinear problems in Scientific Computing are solved iteratively by drawing on local linearizations of nonlinear operators and the solution of linear systems. Recent advances in (a) hardware architecture; i.e., the emergence of General Purpose Graphics Processing Unit (GP-GPU) cards, and (b) scalable solution algorithms, provide an opportunity to develop a new class of parallel algorithms, called SPIKE, which can robustly and efficiently solve very large linear systems of equations.

Drawing on its divide-and-conquer paradigm, SPIKE builds on several algorithmic primitives: matrix reordering strategies, dense linear algebra operations, sparse direct solvers, and Krylov subspace methods. It provides a scalable solution that can be deployed in a heterogeneous hardware ecosystem and has the potential to solve billion-unknown linear systems in the cloud or on tomorrow?s exascale supercomputers. Its high degree of scalability and improved efficiency stem from (i) optimized memory access pattern owing to an aggressive pre-processing stage that reduces a generic sparse matrix to a banded one through a novel reordering strategy; (ii) good exposure of coarse and fine grain parallelism owing to a recursive, divide-and-conquer solution strategy; (iii) efficient vectorization in evaluating the coupling terms in the divide-and-conquer stage owing to a CPU+GPU heterogeneous computing approach; and (iv) algorithmic polymorphism, given that SPIKE can serve both as a direct solver or an effective preconditioner in an iterative Krylov-type method.

In Engineering, SPIKE will provide the Computer Aided Engineering (CAE) community with a key component; i.e., fast solution of linear systems, required by the analysis of complex problems through computer simulation. Examples of applications that would benefit from this technology are Structural Mechanics problems (Finite Element Analysis in car crash simulation), Computational Fluid Dynamics problems (solving Navier-Stokes equations in the simulation of turbulent flow around a wing profile), and Computational Multibody Dynamics problems (solving Newton-Euler equations in large granular dynamics problems).

SPIKE will also be interfaced to the Portable, Extensible Toolkit for Scientific Computation (PETSc), a two decades old flexible and scalable framework for solving Science and Engineering problems on supercomputers. Through PETSc, SPIKE will be made available to a High Performance Computing user community with more than 20,000 members worldwide. PETSc users will be able to run SPIKE without any modifications on vastly different supercomputer architectures such as the IBM BlueGene/P and BlueGene/Q, or the Cray XT5. SPIKE will thus run scalably on the largest machines in the world and will be tuned for very different network and hardware topologies while maintaining a simple code base.

The experience collected and lessons learned in this project will augment a graduate level class, ?High Performance Computing for Engineering Applications? taught at the University of Wisconsin-Madison. A SPIKE tutorial and research outcomes will be presented each year at the International Conference for High Performance Computing, Networking, Storage and Analysis. A one day High Performance Computing Boot Camp will be organized each year in conjunction with the American Society of Mechanical Engineers (ASME) conference and used to disseminate the software outcomes of this effort. Finally, this project will shape the research agendas of two graduate students working on advanced degrees in Computational Science.

- 2012-2015, $240,000, DOE Applied Math Research, U.S. DOE Contract DE-AC02-06CH11357

Extending PETSc's Composable Hierarchically Nested Linear SolversWe propose to focus on the following four complementary subprojects, which have the common theme of addressing issues in scalability and robustness through hierarchical algorithms:

- Scalable preconditioners for Stokes and Karush-Kuhn-Tucker (KKT) systems;
- Unstructured geometric-algebraic multigrid methods;
- Hierarchical Krylov methods designed to scale Krylov solvers up to million-core simulations; and
- Interactive eigenanalysis of composite linear solvers, allowing one to understand what portions of the solution space are being handled well by the solver and which require additional preconditioning procedures.

- 2010-2015, $650,000, Subcontract, NSF EAR-0949446

The Computational Infrastructure for GeodynamicsCIG is fully described at http://www.geodynamics.org.

- 2009-2014, $550,000, DOE Math-CS Institute, U.S. DOE Contract DE-AC02-06CH11357

Nonlinear Algorithms to Circumvent the Memory Bandwidth Limitations of Implicit PDE SimulationsSparse matrix computations, which form the computational kernel of many DOE simulation codes, are severely memory bandwidth limited. This bottleneck holds true for single core systems, multicore systems, the new general-purpose graphics processing systems (GP GPUs), and the slightly more exotic IBM Roadrunner Cell processor. Under the best of circumstances one cannot achieve more than 15 to 20 percent of the processors' performance in sparse matrix computations because the processors are waiting on memory loads up to 85 percent of the time. With current architectural trends this situation is unlikely to change; if anything, it will get worse.

Sparse matrix computations arise because the paradigm for solving nonlinear partial differential equations (PDEs) is to iteratively perform a global linearization and then solve a sparse matrix linear problem (that is, Newton's method). The linearization involves computing a sparse matrix approximation to a nonlinear problem, which is (because of its large size) stored in main memory. The linear solution process requires constantly moving the sparse matrix entries from main memory into the CPU, where a small number of computations are performed using those values, before more values need to be loaded, thus allowing no register or cache reuse of the sparse matrix entries.

To better utilize computer systems, we need to

**eliminate**the "global linearize, then linear solve" paradigm and replace it with a "partially linearize, then solve" paradigm that acknowledges the memory hierarchy and moves the best possible CPU utilization, from 15% to over 75%, resulting in run times that are one-fifth as long. To achieve this transformation will require the joint efforts of mathematicians and computer scientists. The full approximation scheme (FAS), also called nonlinear multigrid, provides an algorithmic framework for the "partially linearize, then solve" approach and will form the starting point for our work. Loosely speaking, we propose a transition from Newton-multigrid algorithms to multigrid-Newton.**This project will develop the mathematical algorithms, analysis, and computer science techniques (including parsing, code generation, code optimization, and automatic differentiation) to move the "partially linearize, then solve" paradigm from concept to practice.**This research will result in PDE-based simulations running**up to five times faster**than traditional linear algebra approaches on the next generation of computer systems. Preliminary work indicates the "partially linearize, then solve" approach has great potential. The two key research questions we will answer under this work are as follows:- How will we maintain roughly the same convergence rates with the nonlinear schemes as with Newton-multigrid? Obviously if the convergence rates decay so badly that we end up doing many more floating-point operations, then the higher utilization is not a good measure of performance.
- How will we provide a software system that end users will accept and use? This likely means not requiring a radical shifting of the way application scientists develop code. But it may involve a radical shift in the way we develop ``libraries'' for implicit methods for PDEs.

- 2010-2012, $45,000, Subcontract, Army Research Office W911NF-09-1-0488

Classical Density Functional Theory of Fluids: Ions at a Dielectric InterfaceThe overall (three-year) goal of the proposed work is to develop a classical density functional theory (DFT) of ions and charged polymers near dielectric interfaces in a three-dimensional system. Over the 3 years of the grant, we have had great success and some setbacks, as might be expected for any research project. These include:

- The biggest success was publication of the paper describing the general theory in Journal of Physical Chemistry Letters. This is the big theory paper to come out of this project and represents the most important goal of the proposed work (i.e., constructing the functional itself). In the year since publication, it has already been cited 3 times, indicating an impact of the work.
- Numerical implementation of an efficient algorithm to solve the dielectric DFT equations is still being implemented and work on this front will continue beyond the completion of the grant. This is the biggest setback of the project. The finding of an efficient algorithm to numerically solve the dielectric DFT equations was much more difficult than aniticipated. The root of the problem is that the reaction field is not a radial interaction between ions. Because of this, fast Fourier transforms (FFTs) can no longer be used and one must be clever in making the program scale with \(N \log N\) rather than \(N^2\) where \(N\) is the number of grid points to be solved.
- On the other hand, one great numerical success was in implementing fast computational methods for the reaction in 3D. Matt Knepley at the Computational Institute of the University of Chicago, through a subcontract of this grant, has implemented a Fast Multipole Method (FMM) and published a paper on this topic.
- An exciting offshoot of the DFT work was a collaboration with Dezso Boda in Hungary that resulted an alternative method (not using DFT at all) of ions at dielectric interfaces that we developed. The specific goal of the proposed work is to develop a DFT of ions at a dielectric interface and this has been done (as just mentioned). The broader goal of the proposed to work is develop methods that any scientist can use when working with ions at a dielectric interface, especially in three dimensional geometries. We have started to develop another method with Dr. Boda, a long-time collaborator, called the Local Equilibrium Grand Canonical Monte Carlo (LE-GCMC). Because we have already developed a fast method to include dielectrics in Monte Carlo, this method will allow for computing ion currents (currently at the drift-diffusion level, but not limited to that) in complex 3D geometries that will be very challenging with DFT. The grant sponsored Dr. Bodaís one month visit to the PIís institution to work on this. It is important to note that this work is in addition to the DFT development; LE-GCMC came out of discussions that Dr. Boda and the PI had in discussing the dielectric DFT. Moreover, Dr. Boda's visit helped greatly in moving toward an efficient algorithm to numerically solve the dielectric DFT equations.
- Another offshoot of this work is the hope of coupling the new dielectric DFT functional to experiments. This work, which will continue beyond the completion of the grant, is with experimentalists working on nanoáuidic devices. This resulted in 2 papers (a third is in preparation), including one in the prestigeous Nano Letters.

The final report is available here.

- 2001-2011, $1,200,000, DOE SciDAC ISIC

Towards Optimal Petascale SimulationsTOPS is fully described at http://www.scidac.gov/math/TOPS.html.

- 2009-2011, $90,000, NSF Grant OCI-0850680

Mechanical Transformation of Knowledge to LibrariesThe proposed project approaches the problem of library development by a fundamentally different approach -- one in which the knowledge abstractions at an appropriate level are represented and then this knowledge is converted into implementations optimized for different target platforms and circumstances. The primary intellectual merit here is the automation of high quality numerical code generation.

The broader impact is that abstraction driven automation for library developers can significantly cut down the "software gap", delay between development of new architectures and new software that maximizes its use. The impact of this will be broad across all disciplines that use HPC.

- 2005-2010, $650,000, Subcontract, NSF EAR-0426271

The Computational Infrastructure for GeodynamicsCIG is fully described at http://www.geodynamics.org.

- 2005-2007, $210,000, Subcontract, DOE Reactor Core Modeling