# Submitted Papers

- Finite Element Integration with Quadrature on Accelerators, Matthew G. Knepley, Karl Rupp and Andy Terrel, 2014.
We present a novel, quadrature-based finite element integration method for low-order elements on GPUs, using a pattern we call

*thread transposition*to avoid reductions while vectorizing aggressively. On the NVIDIA GTX580, which has a nominal single precision peak flop rate of 1.5 TF/s and a memory bandwidth of 192 GB/s, we achieve more than 300 GF/s for element integration on first-order discretization of the Laplacian operator, which is 90% of our measured achievable bandwidth peak of 310 GF/s. We also present results for variable coefficient operators.

# Journal Papers

- Composing Scalable Nonlinear
Algebraic Solvers, Peter R. Brune, Matthew G. Knepley, Barry F. Smith, and Xuemin Tu, SIAM Review, to appear, 2015. [ANL]
Most efficient linear solvers use composable algorithmic components, with the most common model being the combination of a Krylov accelerator and one or more preconditioners. A similar set of concepts may be used for nonlinear algebraic systems, where nonlinear composition of different nonlinear solvers may significantly improve the time to solution. We describe the basic concepts of nonlinear composite combination and preconditioning and present a number of solvers applicable to nonlinear partial differential equations. We have developed a software framework in order to easily explore the possible combinations of solvers. We show that the performance gains from using composed solvers can be substantial compared with gains from standard Newton-Krylov methods.

- Analytical Nonlocal
Electrostatics Using Eigenfunction Expansions of Boundary-Integral Operators, Jaydeep P. Bardhan, Matthew G. Knepley, and
Peter R. Brune, Molecular Based Mathematical Biology,
**3**, 1, 2015. [MBMB] [arXiv]In this paper, we present an exact, infinite-series solution to Lorentz nonlocal continuum electrostatics for an arbitrary charge distribution in a spherical solute. Our approach relies on two key steps: (1) re-formulating the PDE problem using boundary-integral equations, and (2) diagonalizing the boundary-integral operators using the fact that their eigenfunctions are the surface spherical harmonics. To introduce this uncommon approach for calculations in separable geometries, we first re-derive Kirkwood's classic results for a protein surrounded concentrically by a pure-water ion-exclusion layer and then a dilute electrolyte (modeled with the linearized Poisson-Boltzmann equation). The eigenfunction-expansion approach provides a computationally efficient way to test some implications of nonlocal models; after estimating the reasonable range of the nonlocal length-scale parameter \(\lambda\), we conduct simple p\(K_\mathrm{a}\) calculations. Our results suggest that nonlocal solvent response may help to reduce the need for very high dielectric constants in calculating pH-dependent protein behavior, though more sophisticated nonlocal models are needed to resolve this question in full. An open-source MATLAB implementation of our approach is freely available online, and is included as supplemental information.

- Modeling Charge-Sign Asymmetric Solvation Free Energies With Nonlinear Boundary
Conditions, Jaydeep P. Bardhan and Matthew G. Knepley, Journal of Chemical Physics,
**141**, 131103, 2014. [JCP] [arXiv]We show that charge-sign-dependent asymmetric hydration can be modeled accurately using linear Poisson theory but replacing the standard electric-displacement boundary condition with a simple

**nonlinear boundary condition**. Using a single multiplicative scaling factor to determine atomic radii from molecular dynamics Lennard-Jones parameters, the new model accurately reproduces MD free-energy calculations of hydration asymmetries for (i) monatomic ions, (ii) titratable amino acids in both their protonated and unprotonated states, and (iii) the Mobley ``bracelet'' and ``rod'' test problems. Remarkably, the model also justifies the use of linear response expressions for charging free energies. Our boundary-element method implementation demonstrates the ease with which other continuum-electrostatic solvers can be extended to include asymmetry. - Run-time extensibility and
librarization of simulation software, Jed Brown, Matthew G. Knepley and Barry F. Smith, IEEE Computing in
Science and Engineering, 10.1109/MCSE.2014.95, 2014. [arXiv] [IEEE]
Build-time configuration and environment assumptions are hampering progress and usability in scientific software. That which would be utterly unacceptable in non-scientific software somehow passes for the norm in scientific packages. The community needs reusable software packages that are easy use and flexible enough to accommodate next-generation simulation and analysis demands.

- A Domain Decomposition
Approach to Implementing Fault Slip in Finite-Element Models of Quasi-static and Dynamic Crustal Deformation,
Brad Aagaard, Matthew G. Knepley, and Charles Williams, Journal of Geophysical Research,
**118**(6), 3059-3079, 2013. [JGR] [arXiv]We employ a domain decomposition approach with Lagrange multipliers to implement fault slip in a finite-element code, PyLith, for use in both quasi-static and dynamic crustal deformation applications. This integrated approach to solving both quasi-static and dynamic simulations leverages common finite-element data structures and implementations of various boundary conditions, discretization schemes, and bulk and fault rheologies. We have developed a custom preconditioner for the Lagrange multiplier portion of the system of equations that provides excellent scalability with problem size compared to conventional additive Schwarz methods. We demonstrate application of this approach using benchmarks for both quasi-static viscoelastic deformation and dynamic spontaneous rupture propagation that verify the numerical implementation in PyLith.

This paper concerns the fault friction implementation. Each aspect of the model, algorithms, and code is examined to show design decisions that were made to support the fault model. We show good algorithmic scaling, although not constant yet, and good scalability outside of the ML AMG setup phase.

- Analysis of fast boundary-integral approximations for modeling electrostatic contributions of molecular binding, Amy Kreienkamp, Lucy Y. Liu, Mona S. Minkara, Matthew G. Knepley, Jaydeep P. Bardhan, and Mala L. Radhakrishnan, Molecular-Based Mathematical Biology,
**1**, pp.124-150, 2013. [Mol. Based Math. Bio.]We analyze and suggest improvements to a recently developed approximate continuum-electrostatic model for proteins. The model, called BIBEE/I (boundary-integral based electrostatics estimation with interpolation), was able to estimate electrostatic solvation free energies to within a mean unsigned error of 4% on a test set of more than 600 proteins---a significant improvement over previous BIBEE models. In this work, we tested the BIBEE/I model for its capability to predict residue-by-residue interactions in protein-protein binding, using the widely studied model system of trypsin and bovine pancreatic trypsin inhibitor (BPTI). Finding that the BIBEE/I model performs surprisingly less well in this task than simpler BIBEE models, we seek to explain this behavior in terms of the models' differing spectral approximations of the exact boundary-integral operator. Calculations of analytically solvable systems (spheres and tri-axial ellipsoids) suggest two possibilities for improvement. The first is a modified BIBEE/I approach that captures the asymptotic eigenvalue limit correctly, and the second involves the dipole and quadrupole modes for ellipsoidal approximations of protein geometries. Our analysis suggests that fast, rigorous approximate models derived from reduced-basis approximation of boundary-integral equations might reach unprecedented accuracy, if the dipole and quadrupole modes can be captured quickly for general shapes.

- Implementation of a multigrid
solver on GPU for Stokes equations with strongly variable viscosity based on Matlab and CUDA, Liang Zheng,
Taras Gerya, Matthew G. Knepley, David A. Yuen, Huai Zhang, and Yaolin Shi, IJHPCA, 2013. [IJHPCA]
The Stokes equations are frequently used to simulate geodynamic processes, including mantle convection, lithospheric dynamics, lava flow, and among others. In this study, the multigrid (MG) method is adopted to solve Stokes and continuity equations with strongly temperature-dependent viscosity. By taking advantage of the enhanced computing power of graphics processing units (GPUs) and the new version of Matlab 2010b, MG codes are optimized through Compute Unified Device Architecture (CUDA). Herein, we illustrate the approach that implements a GPU-based MG solver with Red-Black Gauss-Seidel (RBGS) smoother for the three-dimensional Stokes and continuity equations, in a hope that it helps solve the synthetic incompressible sinking problem in a cubic domain with strongly variable viscosity, and finally analyze our solver's efficiency on a GPU.

- Unstructured Geometric Multigrid in Two
and Three Dimensions on Complex and Graded Meshes, Peter R. Brune, Matthew G. Knepley, and L. Ridgway Scott, SIAM Journal on Scientific Computing,
**35**(1), A173-A191, 2013. [SIAM] [arXiv]The use of multigrid and related preconditioners with the finite element method is often limited by the difficulty of applying the algorithm effectively to a problem, especially when the domain has a complex shape or the mesh has adaptive refinement. We introduce a simplification of a general topologically-motivated mesh coarsening algorithm for use in creating hierarchies of meshes for geometric unstructured multigrid methods. The connections between the guarantees of this technique and the quality criteria necessary for multigrid methods for non-quasi-uniform problems are noted. The implementation details, in particular those related to coarsening, remeshing, and interpolation, are discussed. Computational tests on pathological test cases from adaptive finite element methods show the performance of the technique.

- Computational science and
re-discovery: open-source implementations of ellipsoidal harmonics for problems in potential theory, Jaydeep
P. Bardhan and Matthew G. Knepley, Computational Science & Discovery,
**5**, 014006, 2012. [CSD] [arXiv]We present two open-source (BSD) implementations of ellipsoidal harmonic expansions for solving problems of potential theory using separation of variables. Ellipsoidal harmonics are used surprisingly infrequently, considering their substantial value for problems ranging in scale from molecules to the entire solar system. In this article, we suggest two possible reasons for the paucity relative to spherical harmonics. The first is essentially historical---ellipsoidal harmonics developed during the late 19th century and early 20th, when it was found that only the lowest-order harmonics are expressible in closed form. Each higher-order term requires the solution of an eigenvalue problem, and tedious manual computation seems to have discouraged applications and theoretical studies. The second explanation is practical: even with modern computers and accurate eigenvalue algorithms, expansions in ellipsoidal harmonics are significantly more challenging to compute than those in Cartesian or spherical coordinates. The present implementations reduce the "barrier to entry" by providing an easy and free way for the community to begin using ellipsoidal harmonics in actual research. We demonstrate our implementation using the specific and physiologically crucial problem of how charged proteins interact with their environment, and ask: what other analytical tools await re-discovery in an era of inexpensive computation?

- Finite Element
Integration on GPUs, Matthew G. Knepley and Andy R. Terrel, Transactions on Mathematical Software,
39 (2), 2012. [ACM] [arXiv]We present a novel finite element integration method for low order elements on GPUs. We achieve more than 100GF for element integration on first order discretizations of both the Laplacian and Elasticity operators on an NVIDIA GTX285, which has a nominal single precision peak flop rate of 1 TF/s and bandwidth of 159 GB/s, corresponding to a bandwidth limited peak of 40 GF/s.

- Accessible, Extensible,
Scalable Tools for Wave Propagation Problems, David I. Ketcheson, Kyle T. Mandli, Aron Ahmadia, Amal
Alghamdi, Manuel Quezada, Matteo Parsani, Matthew G. Knepley, and Matthew Emmett, SIAM Journal on Scientific Computing,
**34**(4), pp.C210-C231, 2012. [arXiv] [SISC]Development of scientific software involves tradeoffs between ease of use, generality, and performance. We describe the design of a general hyperbolic PDE solver that can be operated with the convenience of MATLAB yet achieves efficiency near that of hand-coded Fortran and scales to the largest supercomputers. This is achieved by using Python for most of the code while employing automatically-wrapped Fortran kernels for computationally intensive routines, and using Python bindings to interface with a parallel computing library and other numerical packages. The software described here is PyClaw, a Python-based structured grid solver for general systems of hyperbolic PDEs. PyClaw provides a powerful and intuitive interface to the algorithms of the existing Fortran codes Clawpack and SharpClaw, simplifying code development and use while providing massive parallelism and scalable solvers via the PETSc library. The package is further augmented by use of PyWENO for generation of efficient high-order weighted essentially non-oscillatory reconstruction code. The simplicity, capability, and performance of this approach are demonstrated through application to example problems in shallow water flow, compressible flow and elasticity.

- Mathematical Analysis of the
BIBEE Approximation for Molecular Solvation: Exact Results for Spherical Inclusions, Jaydeep P. Bardhan and
Matthew G. Knepley, Journal of Chemical Physics,
**135**(12), pp.124107-124117, 2011. [arXiv] [JCP]We analyze the mathematically rigorous BIBEE (boundary-integral based electrostatics estimation) approximation of the mixed-dielectric continuum model of molecular electrostatics, using the analytically solvable case of a spherical solute containing an arbitrary charge distribution. Our analysis, which builds on Kirkwood's solution using spherical harmonics, clarifies important aspects of the approximation and its relationship to Generalized Born models. First, our results suggest a new perspective for analyzing fast electrostatic models: the separation of variables between material properties (the dielectric constants) and geometry (the solute dielectric boundary and charge distribution). Second, we find that the eigenfunctions of the reaction-potential operator are exactly preserved in the BIBEE model for the sphere, which supports the use of this approximation for analyzing charge-charge interactions in molecular binding. Third, a comparison of BIBEE to the recent GB\(\epsilon\) theory suggests a modified BIBEE model capable of predicting electrostatic solvation free energies to within 4% of a full numerical Poisson calculation. This modified model leads to a projection-framework understanding of BIBEE and suggests opportunities for future improvements.

- Optimal, scalable forward models for computing gravity anomalies, Dave A. May and Matthew G. Knepley,
Geophysical Journal International,
**187**(1), pp.161-177, 2011. [arXiv] [GJI]We describe three approaches for computing a gravity signal from a density anomaly. The first approach consists of the classical "summation" technique, whilst the remaining two methods solve the Poisson problem for the gravitational potential using either a Finite Element (FE) discretization employing a multilevel preconditioner, or a Green's function evaluated with the Fast Multipole Method (FMM). The methods utilizing the PDE formulation described here differ from previously published approaches used in gravity modeling in that they are optimal, implying that both the memory and computational time required scale linearly with respect to the number of unknowns in the potential field. Additionally, all of the implementations presented here are developed such that the computations can be performed in a massively parallel, distributed memory computing environment. Through numerical experiments, we compare the methods on the basis of their discretization error, CPU time and parallel scalability. We demonstrate the parallel scalability of all these techniques by running forward models with up to 108 voxels on 1000's of cores.

- Biomolecular Electrostatics
Simulation by an FMM-based BEM on 512 GPUs, Rio Yokota, Jaydeep P. Bardhan, Matthew G. Knepley, L. A. Barba,
and Tsuyoshi Hamada, Computer Physics Communications,
**182**(6), pp.1272-1283, 2011. [arXiv] [CPC]We present teraflop-scale calculations of biomolecular electrostatics enabled by the combination of algorithmic and hardware acceleration. The algorithmic acceleration is achieved with the fast multipole method (FMM) in conjunction with a boundary element method (BEM) formulation of the continuum electrostatic model, as well as the BIBEE approximation to BEM. The hardware acceleration is achieved through graphics processors, GPUs. We demonstrate the power of our algorithms and software for the calculation of the electrostatic interactions between biological molecules in solution. The applications demonstrated include the electrostatics of protein--drug binding and several multi-million atom systems consisting of hundreds to thousands of copies of lysozyme molecules. The parallel scalability of the software was studied in a cluster at the Nagasaki Advanced Computing Center, using 128 nodes, each with 4 GPUs. Delicate tuning has resulted in strong scaling with parallel efficiency of 0.8 for 256 and 0.5 for 512 GPUs. The largest application run, with over 20 million atoms and one billion unknowns, required only one minute on 512 GPUs. We are currently adapting our BEM software to solve the linearized Poisson-Boltzmann equation for dilute ionic solutions, and it is also designed to be flexible enough to be extended for a variety of integral equation problems, ranging from Poisson problems to Helmholtz problems in electromagnetics and acoustics to high Reynolds number flow.

- PetFMM: A dynamically
load-balancing parallel fast multipole library, Felipe A. Cruz, Matthew G. Knepley, and L.A. Barba,
International Journal of Numerical Methods in Engineering,
**85**(4), pp. 403-428, 2010. [arXiv] [IJNME]Fast algorithms for the computation of \(N\)-body problems can be broadly classified into mesh-based interpolation methods, and hierarchical or multiresolution methods. To this last class belongs the well-known fast multipole method (FMM), which offers \(\mathcal{O}(N)\) complexity. This paper presents an extensible parallel library for \(N\)-body interactions utilizing the FMM algorithm, built on the framework of PETSc. A prominent feature of this library is that it is designed to be extensible, with a view to unifying efforts involving many algorithms based on the same principles as the FMM and enabling easy development of scientific application codes. The paper also details an exhaustive model for the computation of tree-based \(N\)-body algorithms in parallel, including both work estimates and communications estimates. With this model, we are able to implement a method to provide automatic, a priori load balancing of the parallel execution, achieving optimal distribution of the computational work among processors and minimal inter-processor communications. Using a client application that performs the calculation of velocity induced by N vortex particles, ample verification and testing of the library was performed. Strong scaling results are presented with close to a million particles in up to 64 processors, including both speedup and parallel efficiency. The library is currently able to achieve over 85% parallel efficiency for 64 processors. The software library is open source under the PETSc license; this guarantees the maximum impact to the scientific community and encourages peer-based collaboration for the extensions and applications.

- PetRBF: A parallel \(\mathcal{O}(N)\) algorithm for
radial basis function interpolation, Rio Yokota, L.A. Barba, and Matthew G. Knepley, Computer Methods in
Applied Mechanics and Engineering,
**199**(25-28), pp. 1793-1804, 2010. [arXiv] [CMAME]We have developed a parallel algorithm for radial basis function (RBF) interpolation that exhibits \(\mathcal{O}(N)\) complexity,requires \(\mathcal{O}(N)\) storage, and scales excellently up to a thousand processes. The algorithm uses a GMRES iterative solver with a restricted additive Schwarz method (RASM) as a preconditioner and a fast matrix-vector algorithm. Previous fast RBF methods, achieving at most \(\mathcal{O}(N \log N)\) complexity, were developed using multiquadric and polyharmonic basis functions. In contrast, the present method uses Gaussians with a small variance (a common choice in particle methods for fluid simulation, our main target application). The fast decay of the Gaussian basis function allows rapid convergence of the iterative solver even when the subdomains in the RASM are very small. The present method was implemented in parallel using the PETSc library (developer version). Numerical experiments demonstrate its capability in problems of RBF interpolation with more than 50 million data points, timing at 106 seconds (19 iterations for an error tolerance of \(10^{-15}\) on 1024 processors of a Blue Gene/L (700 MHz PowerPC processors). The parallel code is freely available in the open-source model.

- An Efficient Algorithm for Classical
Density Functional Theory in Three Dimensions: Ionic Solutions, Matthew G. Knepley, Dmitry A. Karpeev, Seth
Davidovits, Robert S. Eisenberg, and Dirk Gillespie, Journal of Chemical Physics,
**132**(12), pp.124101-124111, 2010. [arXiv] [JCP]Classical density functional theory (DFT) of fluids is a valuable tool to analyze inhomogeneous fluids. However, few numerical solution algorithms for three-dimensional systems exist. Here we present an efficient numerical scheme for fluids of charged, hard spheres that uses \(\mathcal{O}(N \log N)\) operations and \(\mathcal{O}(N)\) memory, where N is the number of grid points. This system-size scaling is significant because of the very large \(N\) required for three-dimensional systems. The algorithm uses fast Fourier transforms (FFT) to evaluate the convolutions of the DFT Euler-Lagrange equations and Picard (iterative substitution) iteration with line search to solve the equations. The pros and cons of this FFT/Picard technique are compared to those of alternative solution methods that use real-space integration of the convolutions instead of FFTs and Newton iteration instead of Picard. For the hard-sphere DFT we use Fundamental Measure Theory. For the electrostatic DFT we present two algorithms. One is for the "bulk-fluid" functional of Rosenfeld [Y. Rosenfeld, 1993] that uses \(\mathcal{O}(N \log N)\) operations. The other is for the "reference fluid" (RFD) functional [D. Gillespie et al., 2002]. This functional is significantly more accurate than the bulk-fluid functional, but the RFD algorithm requires \(\mathcal{O}(N^2)\) operations.

- Reproducible Research,
Stodden et.al., Computing in Science and Engineering,
**12**(5), pp. 8-13, 2010. [CS&E]Roundtable participants identified ways of making computational research details readily available, which is a crucial step in addressing the current credibility crisis.

- Mesh Algorithms for PDE with
Sieve I: Mesh Distribution, Matthew G. Knepley and Dmitry A. Karpeev, Scientific Programming,
**17**(3), pp.215-230, 2009. [arXiv] [SP]We have developed a new programming framework, called Sieve, to support parallel numerical PDE algorithms operating over distributed meshes. We have also developed a reference implementation of Sieve in C++ as a library of generic algorithms operating on distributed containers conforming to the Sieve interface. Sieve makes instances of the incidence relation, or

*arrows*, the conceptual first-class objects represented in the containers. Further, generic algorithms acting on this arrow container are systematically used to provide natural geometric operations on the topology and also, through duality, on the data. Finally, coverings and duality are used to encode not only individual meshes, but all types of hierarchies underlying PDE data structures, including multigrid and mesh partitions. In order to demonstrate the usefulness of the framework, we show how the mesh partition data can be represented and manipulated using the same fundamental mechanisms used to represent meshes. We present the complete description of an algorithm to encode a mesh partition and then distribute a mesh, which is independent of the mesh dimension, element shape, or embedding. Moreover, data associated with the mesh can be similarly distributed with exactly the same algorithm. The use of a high level of abstraction within the Sieve leads to several benefits in terms of code reuse, simplicity, and extensibility. We discuss these benefits and compare our approach to other existing mesh libraries. - Bounding the Electrostatic
Free Energies Associated with Linear Continuum Models of Molecular Solvation, Jaydeep P. Bardhan, Matthew
G. Knepley, and Mihai Anitescu, Journal of Chemical Physics,
**130**(10), 2008. Selected for the March 15, 2009 issue of Virtual Journal of Biological Physics Research. [JCP]The importance of electrostatic interactions in molecular biology has driven extensive research toward the development of accurate and efficient theoretical and computational models. Linear continuum electrostatic theory has been surprisingly successful, but the computational costs associated with solving the associated partial differential equations(PDEs) preclude the theory’s use in most dynamical simulations. Modern generalized-Born models for electrostatics can reproduce PDE-based calculations to within a few percent and are extremely computationally efficient but do not always faithfully reproduce interactions between chemical groups. Recent work has shown that a boundary-integral-equation formulation of the PDE problem leads naturally to a new approach called boundary-integral-based electrostatics estimation (BIBEE) to approximate electrostatic interactions. In the present paper, we prove that the BIBEE method can be used to rigorously bound the actual continuum-theory electrostaticfree energy. The bounds are validated using a set of more than 600 proteins. Detailed numerical results are presented for structures of the peptide met-enkephalin taken from a molecular-dynamics simulation. These bounds, in combination with our demonstration that the BIBEE methods accurately reproduce pairwise interactions, suggest a new approach toward building a highly accurate yet computationally tractable electrostatic model.

- Automated FEM Discretizations for the Stokes Equation, Andy R. Terrel, L. Ridgway Scott, Matthew
G. Knepley, and Robert C. Kirby, BIT,
**48**(2), 2008. [pdf] [BIT]Current FEM software projects have made significant advances in various automated modeling techniques. We present some of the mathematical abstractions employed by these projects that allow a user to switch between finite elements, linear solvers, mesh refinement and geometry, and weak forms with very few modifications to the code. To evaluate the modularity provided by one of these abstractions, namely switching finite elements, we provide a numerical study based upon the many different discretizations of the Stokes equations.

- Numerical
simulation of geodynamic processes with the Portable Extensible Toolkit for Scientific Computation, Richard
F. Katz, Matthew G. Knepley, Barry F. Smith, Marc Spiegelman, and Ethan Coon, Phys. Earth Planet. In.,
**163**, pp.52-68, 2007. [PEPI]Geodynamics simulations are characterized by rheological nonlinearity, localization, three-dimensional effects, and separate but interacting length scales. These features represent a challenge for computational science. We discuss how a leading software framework for advanced scientific computing (the Portable Extensible Toolkit for Scientific Computation, PETSc) can facilitate the development of geodynamics simulations. To illustrate our use of PETSc, we describe simulations of (i) steady-state, non-Newtonian passive flow and thermal structure beneath a mid-ocean ridge, (ii) magmatic solitary waves in the mantle, and (iii) the formation of localized bands of high porosity in a two-phase medium being deformed under simple shear. We highlight two supplementary features of PETSc, structured storage of application parameters and self-documenting output, that are especially useful for geodynamics simulations.

- Optimizing the
Evaluation of Finite Element Matrices, Robert C. Kirby, Matthew G. Knepley, and L. Ridgway Scott, SIAM Journal
on Scientific Computing,
**27**(3), pp.741-758, 2005. [arXiv] [SISC]Assembling stiffness matrices represents a significant cost in many finite element computations. We address the question of optimizing the evaluation of these matrices. By finding redundant computations, we are able to significantly reduce the cost of building local stiffness matrices for the Laplace operator and for the trilinear form for Navier-Stokes. For the Laplace operator in two space dimensions, we have developed a heuristic graph algorithm that searches for such redundancies and generates code for computing the local stiffness matrices. Up to cubics, we are able to build the stiffness matrix on any triangle in less than one multiply-add pair per entry. Up to sixth degree, we can do it in less than about two. Preliminary low-degree results for Poisson and Navier-Stokes operators in three dimensions are also promising.

- Search for disoriented chiral
condensate at the Fermilab Tevatron, Minimax Collaboration, Physical Review D,
**61**(3), 2000. [arXiv] [PRD]We present results from MiniMax (Fermilab T-864), a small test/experiment at the Tevatron designed to search for the production of disoriented chiral condensate (DCC) in \(p-\bar p\) collisions at \(\sqrt{s} = 1.8\) TeV in the forward direction, \(\sim 3.4 < \eta < \sim 4.2\). Data, consisting of \(1.3 \times 10^6\) events, are analyzed using the robust observables developed in an earlier paper. The results are consistent with generic, binomial-distribution partition of pions into charged and neutral species. Limits on DCC production in various models are presented.

- Analysis of Charged Particle/Photon
Correlations in Hadronic Multiparticle Production, Minimax Collaboration, Physical Review D,
**55**(9), 1997. [PRD]In order to analyze data on joint charged-particle–photon distributions from an experimental search (T-864, MiniMax) for disoriented chiral condensate (DCC) at the Fermilab Tevatron collider, we have identified robust observables, ratios of normalized bivariate factorial moments, with many desirable properties. These include insensitivity to many efficiency corrections and the details of the modeling of the primary pion production, and sensitivity to the production of DCC, as opposed to the generic, binomial-distribution partition of pions into charged and neutral species. The relevant formalism is developed and tested in Monte Carlo simulations of the MiniMax experimental conditions.

- MiniMax: What has been learned thus far, Mary E. Convery, W. L. Davis, Ken W. Del Signore, Tom L. Jenkins, Erik Kangas, Matthew G. Knepley, Ken L. Kowalski, Cyrus C. Taylor, C. H. Wang, S. H. Oh, W. D. Walker, P. L. Colestock, B. Hanna, M. Martens, J. Streets, R. C. Ball, H. R. Gustafson, L. W. Jones, M. J. Longo, J. D. Bjorken, N. Morgan, and C. A. Pruneau, Nuovo Cimento,
**19**(1), pp. 1045-1049, 1996. [Nuovo]A small experiment, MiniMax, has been set up in the C0 intersection region of the Fermilab Tevatron to seek evidence for disoriented chiral condensates and to study other forward physics phenomena. The experiment consists of a proportional wire chamber telescope accompanied by scintillation (trigger) counters, a lead converter, and followed by an electromagnetic calorimeter. The solid angle accepted is a cone centered at pseudorapidity (\(\eta\)) of 4.1 and of radius (in \(\eta-\phi\) space) of about 0.6. Over 2.5 million events thus far have demonstrated the successful operation of the apparatus, however to date the analysis has not progressed sufficiently to permit any conclusions concerning disoriented chiral condensates.

- Closed strings with low
harmonics and kinks, Robert W. Brown, Mary Convery, Scott Hotes, Matthew G. Knepley, and Labros Petropolous,
Physical Review D,
**48**(6), 1993. [arXiv] [PRD]Low-harmonic formulas for closed relativistic strings are given. General parametrizations are presented for the addition of second- and third-harmonic waves to the fundamental wave. The method of determination of the parametrizations is based upon a product representation found for the finite Fourier series of string motion in which the constraints are automatically satisfied. The construction of strings with kinks is discussed, including examples. A procedure is laid out for the representation of kinks that arise from self-intersection, and subsequent intercommutation, for harmonically parametrized cosmic strings.

# Peer-Reviewed Conference Papers

- Preliminary Implementation of PETSc
Using GPUs, Victor Minden, Barry F. Smith and Matthew G. Knepley, GPU Solutions to Multi-scale Problems in Science and
Engineering, Lecture Notes in Earth System Sciences, pp. 131-140, 2013.
[Springer]
[ANL Preprint]
PETSc is a scalable solver library for the solution of algebraic equations arising from the discretization of partial differential equations and related problems. PETSc is organized as a class library with classes for vectors, matrices, Krylov methods, preconditioners, nonlinear solvers, and differential equation integrators. A new subclass of the vector class has been introduced that performs its operations on NVIDIA GPU processors. In addition, a new sparse matrix subclass that performs matrix-vector products on the GPU was introduced. The Krylov methods, nonlinear solvers, and integrators in PETSc run unchanged in parallel using these new subclasses. These can be used transparently from existing PETSc application codes in C, C++, Fortran, or Python. The implementation is done with the Thrust and Cusp C++ packages from NVIDIA.

- Accurate Evaluation of Local Averages
on GPGPUs, Dmitry A. Karpeev, Matthew G. Knepley, and Peter R. Brune, GPU Solutions to Multi-scale Problems in Science and
Engineering, Lecture Notes in Earth System Sciences, pp. 487-501, 2013.
[Springer]
We discuss fast and accurate evaluation of local averages on GPGPU. This work was motivated by the need to calculate reference fluid densities in the classical density functional theory (DFT) of electrolytes proposed in Gillespie et al. (2002). In Knepley et al. (2010) we developed efficient algorithms for the minimization of three-dimensional DFT models of biological ion channel permeation and selectivity. One of the essential bottlenecks of 3D DFT models is the evaluation of local screening averages of the chemical species’ densities. But the problem has wider applicability and fast evaluation of averages over the local spherical screening neighborhood of every grid point are typically inaccurate due to the use of collocation approximations of densities on Cartesian grids. Accurate evaluations based spectral quadrature were proposed and used in Knepley et al. (2010), but they are significantly more computationally expensive because of their nonlocal nature in the Fourier space. Here we show that the passage to the Fourier space can, in fact, make the averaging calculation much more amenable to efficient implementation on GPGPU architectures. This allows us to take advantage of both improved accuracy and hardware acceleration to arrive at a fast and accurate screening calculations.

- GPU Implementation of Multigrid
Solver for Stokes equation with strongly variable viscosity, Larry Zheng, Taras V. Gerya, Matthew G. Knepley, David
A.Yuen, Huai Zhang, and Yaolin Shi, GPU Solutions to Multi-scale Problems in Science and Engineering, Lecture
Notes in Earth System Sciences, pp. 321-333, 2013.
[Springer]
[MSI]
Solving Stokes flow problem is commonplace for numerical modeling of geodynamic processes, because the lithosphere and mantle can be always regarded as incompressible flow for long geological time scales. For Stokes flow, the Reynold Number is effectively zero so that one can ignore the advective transport of momentum equation thus resulting in the slowly creeping flow. Because of the ill-conditioned matrix due to the saddle points problem that coupling mass and momentum partial differential equations together, it is still extremely to efficiently solve this elliptic PDE system, especially with the strongly variable coefficients due to rheological structure of the earth. However, since NVIDIA issued the CUDA programming framework in 2007, scientists can use commodity CPU-GPU system to do such geodynamic simulation efficiently with the advantage of CPU and GPU respectively. In this paper, we try to implement a GPU solver for Stokes Equations with variable viscosity based on CUDA using geometric multigrid methods on the staggered grids. For 2D version, we used a mixture of Jacobi and Gauss-Seidel iteration with conservative finite difference as the smoother. For 3D version, we called the GPU smoother which is rewritten with the Red-Black Gauss-Seidel updating method to avoid the problem of disordered threads with Matlab 2010b.

- Why scientists and engineers need gpus
today, Matthew G. Knepley and David A. Yuen, GPU Solutions to Multi-scale Problems in Science and
Engineering, Lecture Notes in Earth System Sciences, pp. 3-11, 2013.
[Springer]
Recently, a paradigm shift in computer architecture has offered computational science the prospect of a vast increase in capability at relatively little cost. The tremendous computational power of graphics processors (GPU) provides a great opportunity for those willing to rethink algorithms and rewrite existing simulation codes. In this introduction, we give a brief survey of GPU computing, and its potential capabilities, intended for the general scientific and engineering audience. We will also review some challenges facing the users in adapting the large toolbox of scientific computing to these changes in computer architecture, and what the community can expect in the near future.

- Composable linear
solvers for multiphysics, Jed Brown, Matthew G. Knepley, Dave A. May, Lois C. McInnes, Barry F. Smith,
11th International Symposium on Parallel and Distributed Computing (ISPDC 2012), 2012. Also Preprint
ANL/MCS-P2017-0112, Argonne National Laboratory.
[ANL]
The Portable, Extensible Toolkit for Scientific computing (PETSc), which focuses on the scalable solution of problems based on partial differential equations, now incorporates new components that allow full composability of solvers for multiphysics and multilevel methods. Through strong encapsulation, we achieve arbitrary, dynamic com- position of hierarchical methods for coupled problems and allow customization of all components in composite solvers. For example, we support block decompositions with nested multigrid as well as multigrid on the fully coupled system with block-decomposed smoothers. This paper provides an overview of PETSc’s new multiphysics capabilities, which have been used in parallel applications including lithosphere dynamics, sub- duction and mantle convection, ice sheet dynamics, subsurface reactive flow, fusion, mesoscale materials modeling, and power networks.

- PetClaw: A scalable parallel
nonlinear wave propagation solver for Python, Amal Alghamdi, Aron Ahmadia, David I. Ketcheson, Matthew
G. Knepley, Kyle T. Mandli, and Lisandro Dalcin, In Proceedings of SpringSim 2011. ACM, 2011.
- Numerical simulation of reservoir
stimulation - a variational approach, Blaise Bourdin, Matthew G. Knepley, and C. Maurini, In Proceedings
of the 37th Stanford Geothermal Workshop, Stanford, CA, 2010.
[SGW]
We present recent results on the modeling and nu- merical simulation of reservoir stimulation in Hot Dry Rocks. Our approach is based on a mechanistically faithful yet mathematically sound model, generalizing Griffith's idea of competition between bulk and surface energies. At each time increment, the fracture and displacement configuration of a reservoir is sought as the minimizer of a global energy. In doing so, the variational approach allows full crack path identification, interaction between multiple cracks, crack initiation and branching in two and three space dimensions. The numerical approach is based on building a regularized energy where cracks are represented by a smooth function, similar in spirit to a phase-field approach. In this paper, we focus on

*thermal stimulation*. - Secondary thermal cracks in EGS:
a variational approach, Blaise Bourdin, Matthew G. Knepley, and C. Maurini, In Proceedings of the 34th
Annual Meeting of the Geothermal Resources Council, Sacramento, CA, 2010.
[LSU]
We present a mechanistically faithful and mathematically sound approach to the numerical simulation of secondary thermal cracks propagation in EGS based on the variational approach to fracture [1-4]. While remaining compatible with classical theories, this approach provides a unified framework to crack nucleation, full path identification, including interaction between multiple cracks, branching or kinking in 2D and 3D. We present large scale numerical experiments, and compare our results with the literature.

- Fast multipole method for particle
interactions: an open source parallel library component, Felipe A. Cruz, Lorena A. Barba, and Matthew
G. Knepley, In Tromeur-Dervout et. al., editor, Proceedings of ParCFD2008. Elsevier, 2008.
[Springer]
The fast multipole method is used in many scientific computing applications such as astrophysics, fluid dynamics, electrostatics and others. It is capable of greatly accelerating calculations involving pair-wise interactions, but one impediment to a more widespread use is the algorithmic complexity and programming effort required to implement this method. We are developing an open source, parallel implementation of the fast multipole method, to be made available as a component of the PETSc library. In this process, we also contribute to the understanding of how the accuracy of the multipole approximation depends on the parameter choices available to the user. Moreover, the proposed parallelization strategy provides optimizations for automatic data decomposition and load balancing.

- Modeling of multiple earthquake cycles in southern
california using the SCEC community fault model, Charles A. Williams, Carl Gable, Bradford H. Hager,
Brendan Meade, Brad Aagaard, and Matthew G. Knepley, In Proceedings of Geosciences '08, Wellington, NZ,
November 2008.
We are using the CIG finite element code PyLith [Aagaard et al., 2008] to investigate the effects of various rheologies on the predicted deformation field for models of multiple earthquake cycles in southern California. Our present models include 55 of the faults defined by the SCEC Community Fault Model (CFM). We use the LaGriT meshing package [meshing.lanl.gov] to mesh the complex geometry provided by the CFM, and we then run the models through several thousand years of simulated earthquakes. We use rotation poles determined from the block model of Meade and Hager [2005] to provide coseismic slip distributions and velocity boundary conditions for the external boundaries of the mesh.

In our initial models, we investigate the effects of variations in elastic properties by using the properties defined by the SCEC Community Velocity Model (CVM-H), and we also consider the effects of viscoelastic behavior using both a standard Maxwell model and a generalized Maxwell model (two or more Maxwell models in parallel). All of these effects significantly alter the predicted deformation field compared to a homogeneous elastic model (see Figure). Our present models use regular recurrence times for each fault segment. We plan to use this same method to "spin up" our models, and then to include known earthquake sequences. This will allow us to determine which set of properties yields results most consistent with geodetic observations.

- Multilevel preconditioning for parallel CFD,
Matthew G. Knepley, Vivek Sarin, and Ahmed H. Sameh, In International Conference On Preconditioning Techniques
For Large Sparse Matrix Problems In Industrial Applications, Minneapolis, MN, June 1999.
I really cannot remember what is in this paper.

- Design of large scale parallel simulations,
Matthew Knepley, Ahmed H. Sameh, and Vivek Sarin, In Proceedings of Parallel CFD '99. Elsevier, 1999.
[UMN]
We present an overview of the design of software packages called {\em Particle Movers} that have been developed to simulate the motion of particles in two and three dimensional domains. These simulations require the solution of nonlinear Navier−Stokes equations for fluids coupled with Newton’s equations for particle dynamics. Furthermore, realistic simulations are extremely computationally intensive, and are feasible only with algorithms that can exploit parallelism effectively. We describe the computational structure of the simulation as well as the data objects required in these packages. We present a brief description of a particle mover code in this framework, and concentrating on the following features: design modularity, portability, extensibility, and parallelism. Simulations on the SGI Origin2000 demonstrate very good speedup on a large number of processors.

- Algorithm development for
large scale computing, Matthew Knepley and Vivek Sarin, In Proceedings of the SIAM Workshop on Object
Oriented Methods for Inter-operable Scientific and Engineering Computing. SIAM, 1999.
[Google Books]
We present a case study examining the creation of an efficient preconditioner for solution of incompressible flow problems. We trace algorithm development from theoretical considerations, through initial implementation and testing, to parallel performance for large-scale systems. We demonstrate that early coupling of the development process with existing code can reduce debugging time and ease the transition from toy problems to real applications.

- Parallel building blocks
for finite element simulations: Application to solid-liquid mixture flows, Matthew Knepley and Denis
Vanderstraeten, In Proceedings of Parallel CFD '97, pages 281-287. Elsevier, 1998.
This article is lost in the mists of time.

- Parallel simulation of particulate flows,
Matthew Knepley, Ahmed H. Sameh, and Vivek Sarin, In Solving Irregularly Structured Problems in Parallel, volume
1457 of Lecture Notes in Computer Science, pages 226-237, 1998.
[Springer]
Simulation of particles in fluids requires the solution of nonlinear Navier-Stokes equations for fluids coupled with Newton's equations for particle dynamics, in which the most time consuming part is the solution of nonsymmetric and indefinite sparse linear systems. In this paper, we present a comprehensive algorithm for the simulation of particulate flows in two dimensional domains. A backward Euler method is used for time evolution, and a variant of Newton's method is used to solve the nonlinear systems. The linear systems are solved efficiently by a novel multilevel algorithm that generates discrete divergence-free space for the incompressible fluid. Unlike incomplete factorization preconditioners, our technique has the desirable properties of robust and effective preconditioning along with efficient implementation on parallel computers. We present experiments on the SGI Origin2000 that demonstrate the parallel performance of our algorithm, and discuss various aspects of the simulation package and the associated software design.

- Preliminary Results from a Search for Disoriented
Chiral Condensates at MiniMax, T.C.Brooks, M.E.Convery, W.L.Davis, K.W.Delsignore, T.L.Jenkins, E.Kangas,
M.G.Knepley, K.L.Kowalski, C.C.Taylor, S.H.Oh, W.D.Walker, P.L.Colestock, B.Hanna, M.Martens, J.Streets, R.Ball,
H.R.Gustafson, L.W.Jones, M.J.Longo, J.D.Bjorken, A.Abashian, N.Morgan, C.A.Pruneau, in proceedings of DPF 96.
We report on the progress of a search for disoriented chiral condensates (DCCs) in the far forward region at \(\sqrt{s}=1.8\) TeV. MiniMax is a small collider experiment situated at the C0 interaction region of the Tevatron and has been designed to measure the ratio of charged to neutral pions produced at \(\eta \sim 4.1\). The distribution of this ratio is expected to be very different for the generic binomial and DCC particle production models. We present a method used for the search using the factorial moments of particle generating functions, which can be extracted directly from the data. A preliminary comparison of robust ratios of measured values and those expected from the different models is also presented.

# Book Chapters

- Numerical Methods for Mantle
Convection, Treatise on Geophysics, Second Edition, Elsevier, 2015. [Amazon]
The governing equations for mantle convection are derived from conservation laws of mass, momentum and energy. The nonlinear nature of mantle rheology with its strong temperature- and stress-dependence and nonlinear coupling between flow velocity and temperature in the energy equation require that numerical methods be used to solve these governing equations. Understanding the dynamical effects of phase transitions (e.g., olivine to spinel phase transition) and multi-component flow also demands numerical methods. Numerical modeling of mantle convection has a rich history since the late 1960's [e.g. Torrance and Turcotte, 1971; McKenzie et al., 1974]. Great progress in computer architecture along with improved numerical techniques has helped advance the field of mantle convection into its own niche in geophysical fluid dynamics [e.g. Yuen et al., 2000].

In this chapter, we will present several commonly used numerical methods in studies of mantle convection with the primary aim of reaching out to students and new researchers in the field. First, we will present the governing equations, boundary and initial conditions for a given problem in mantle convection, and discuss the general efficient strategy to solve this problem numerically (section 2). We will then briefly discuss finite-differences, finite-volume and spectral methods in section 3. Since finite-elements have attained very high popularity in the user community, we will discuss finite-elements in greater details as the most basic numerical tool (section 4). For simplicity and clarity, we will focus our discussion on homogeneous, incompressible fluids with the Boussinesq approximation. However, we will also describe methods for more complicated and realistic mantle situations by including compressibility, non-Newtonian rheology, solid-state phase transitions, and thermo-chemical (i.e. multi-component) convection (section 5). Finally in section 6 we will discuss some new developments in computational sciences, such as software development and visualization, which may impact our future studies of mantle convection modeling.

- Programming Languages for Scientific
Computing, Encyclopedia of Applied and Computational Mathematics, Springer,
2012. [arXiv]
Scientific computation is a discipline that combines numerical analysis, physical understanding, algorithm development, and structured programming. Several yottacycles per year on the world's largest computers are spent simulating problems as diverse as weather prediction, the properties of material composites, the behavior of biomolecules in solution, and the quantum nature of chemical compounds. This article is intended to review specfic languages features and their use in computational science. We will review the strengths and weaknesses of different programming styles, with examples taken from widely used scientific codes.

- Finite elements for
incompressible fluids, A.R. Terrel, R.C. Kirby, M.G. Knepley, L.R. Scott, and G.N. Wells, in A. Logg,
K.A. Mardal, and G. N. Wells, editors. Automated solutions of differential equations by the finite element
mehod. Springer-Verlag. Lecture Notes in Computational Science and Engineering, Vol 84,
2011. [Springer]
The structure of the finite element method offers a user a range of choices. This is especially true for solving incompressible fluid problems, where theory points to a number of stable finite element formulations. Using automation tools, we implement and examine various stable formulations for the steady-state Stokes equations. It is demonstrated that the expressiveness of the FEniCS Project components allows solvers for the Stokes problem that use various element formulations to be created with ease.

- Discrete optimization of finite
element matrix evaluation, R.C. Kirby, M.G. Knepley, A. Logg, L.R. Scott, and A.R. Terrel, in A. Logg,
K.A. Mardal, and G. N. Wells, editors. Automated solutions of differential equations by the finite element
method. Springer-Verlag. Lecture Notes in Computational Science and Engineering, Vol 84,
2011. [Springer]
[Simula]
The tensor contraction structure for the computation of the element tensor AT obtained in Chapter 8,enables not only the construction of a compiler for variational forms,but an optimizing compiler.For typical variational forms,the reference tensor A0 has significant structure that allows the element tensor AT to be computed on an arbitrary cell T at a lower computational cost.

- Developing a geodynamics
simulator with PETSc, Matthew G. Knepley, Richard F. Katz, and Barry Smith, in Are Magnus Bruaset and Aslak
Tveito, editors, Numerical Solution of Partial Differential Equations on Parallel Computers, volume 51 of Lecture
Notes in Computational Science and Engineering, pages 413-438. Springer Berlin Heidelberg,
2006. [Springer]
Most high-performance simulation codes are not written from scratch but begin as desktop experiments and are subsequently migrated to a scalable, parallel paradigm. This transition can be painful, however, because the restructuring required in conversion forces most authors to abandon their serial code and begin an entirely new parallel code. Starting a parallel code from scratch has many disadvantages, such as the loss of the original test suite and the introduction of new bugs. We present a disciplined, incremental approach to parallelization of existing scientific code using the PETSc framework. In addition to the parallelization, it allows the addition of more physics (in this case strong nonlinearities) without the user having to program anything beyond the new pieces of discretization code. Our approach permits users to easily develop and experiment on the desktop with the same code that scales efficiently to large clusters with excellent parallel performance. As a motivating example, we present work integrating PETSc into an existing plate tectonic subduction code.

# Other Conference Papers and Technical Reports

- PETSc users manual,
Satish Balay, Jed Brown, Kris Buschelman, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley,
Lois Curfman McInnes, Barry F. Smith, and Hong Zhang, Technical Report ANL-95/11 - Revision 3.5, Argonne
National Laboratory, 2014. [ANL]
This is the constantly updated PETSc manual.

- Achieving High Performance with Unified Residual
Evaluation, Matthew G. Knepley, Jed Brown, Karl Rupp, and Barry Smith, 2013. [arXiv]
We examine residual evaluation, perhaps the most basic operation in numerical simulation. By raising the level of abstraction in this operation, we can eliminate specialized code, enable optimization, and greatly increase the extensibility of existing code.

- PETSc's software strategy
for the design space of composable extreme-scale solvers, Barry Smith, Lois Curfman McInnes, Emil
Constantinescu, Mark Adams, Satish Balay, Jed Brown, Matthew G. Knepley, and Hong Zhang, Preprint
ANL/MCS-P2059-0312, Argonne National Laboratory, 2012. DOE Exascale Research Conference, April 16-18, 2012,
Portland, OR. [ANL]
Emerging extreme-scale architectures present new opportunities for broader scope of simulations as well as new challenges in algorithms and software to exploit unprecedented levels of parallelism. Composable, hierarchical solver algorithms and carefully designed portable software are crucial to the success of extreme- scale simulations, because solver phases often dominate overall simulation time. This paper presents the PETSc design philogophy and recent advances in the library that enable application scientists to investi- gate the design space of composable linear, nonlinear, and timestepping solvers. In particular, separation of the control logic of the algorithms from the computational kernels of the solvers is crucial to allow in- jecting new hardware-specific computational kernels without having to rewrite the entire solver software library. Progress in this direction has already begun in PETSc, with new support for pthreads, OpenMP, and GPUs as a first step toward hardware-independent, high-level control logic for computational kernels. This multipronged software strategy for composable extreme-scale solvers will help exploit unprecedented extreme-scale computational power in two important ways: by facilitating the injection of newly developed scalable algorithms and data structures into fundamental components, and by providing the underlying foundation for a paradigm shift that raises the level of abstraction from simulation of complex systems to the design and uncertainty quantification of these systems.

- Composing scalable nonlinear
solvers, Peter Brune, Matthew Knepley, Barry Smith, and Xuemin Tu, Preprint ANL/MCS-P2010-0112, Argonne
National Laboratory, 2012. [ANL]
Most efficient linear solvers use composable algorithmic components, with the most common model being the combination of a Krylov accelerator and one or more preconditioners. A similar set of concepts may be used for nonlinear algebraic systems, where nonlinear composition of different nonlinear solvers may significantly improve the time to solution. We describe the basic concepts of nonlinear composite combination and preconditioning and present a number of solvers applicable to nonlinear partial differential equations. We have developed a software framework in order to easily explore the possible combinations of solvers. We show that the performance gains from using composed solvers can be substantial compared with gains from standard Newton-Krylov methods.

- Exponential grids in high-dimensional
space, Peter R. Brune, Matthew G. Knepley, and L. Ridgway Scott, TR-2011-07, Computer Science Department,
University of Chicago, 2011. [UC]
We consider the approximation of functions that are localized in space. We show that it is possible to define meshes to approximate such functions with the property that the number of vertices grows only linearly in dimension. In one dimension, we discuss the optimal mesh for approximating exponentially decreasing functions. We review the use of Cartesian product grids in multiple dimensions introduced in a paper of Bank and Scott.

- Conference review: High performance
computing and hybrid programming concepts for hyperbolic pde codes, David I. Ketcheson, Aron Ahmadia, and
Matthew G. Knepley, SIAM News, 44(7), September 2011. [SIAM]
Review of the HPC3 Conference held at KAUST in 2011.

- Pylith: A finite-element code for modeling
quasi-static and dynamic crustal deformation, Charles A. Williams, Brad Aagaard, and Matthew G. Knepley,
In Eos Transactions of the AGU. American Geophysical Union, 2011. Fall Meeting Supplemental, Abstract DI14A-08.
We have developed open-source finite-element software for 2-D and 3-D dynamic and quasi-static modeling of crustal deformation. This software, PyLith (current release is version 1.6) can be used for quasi-static viscoelastic modeling, dynamic spontaneous rupture and/or ground-motion modeling. Unstructured and structured finite-element discretizations allow for spatial scales ranging from tens of meters to hundreds of kilometers with temporal scales in dynamic problems ranging from milliseconds to minutes and temporal scales in quasi-static problems ranging from minutes to thousands of years. PyLith development is part of the NSF funded Computational Infrastructure for Geodynamics (CIG) and the software runs on a wide variety of platforms (laptops, workstations, and Beowulf clusters). Binaries (Linux, Darwin, and Windows systems) and source code are available from geodynamics.org. PyLith uses a suite of general, parallel, graph data structures called Sieve for storing and manipulating finite-element meshes. This permits use of a variety of 2-D and 3-D cell types including triangles, quadrilaterals, hexahedra, and tetrahedra. Current PyLith features include prescribed fault ruptures with multiple earthquakes and aseismic creep, spontaneous fault ruptures with a variety of fault constitutive models, time-dependent Dirichlet and Neumann boundary conditions, absorbing boundary conditions, time-dependent point forces, and gravitational body forces. PyLith supports infinitesimal and small strain formulations for linear elastic rheologies, linear and generalized Maxwell viscoelastic rheologies, power-law viscoelastic rheologies, and Drucker-Prager elastoplastic rheologies. Current software development focuses on coupling quasi-static and dynamic simulations to resolve multi-scale deformation across the entire seismic cycle and the coupling of elasticity to heat and/or fluid flow.

- Removing the Barrier to
Scalability in Parallel FMM, Matthew G. Knepley, 2010. [arXiv]
The Fast Multipole Method (FMM) is well known to possess a bottleneck arising from decreasing workload on higher levels of the FMM tree [Greengard and Gropp, 1990]. We show that this potential bottleneck can be eliminated by overlapping multipole and local expansion computations with direct kernel evaluations on the finest level grid.

- Languages and compilers
for variational forms, Robert C. Kirby, Matthew G. Knepley, and L. Ridgway Scott, Technical Report TR-2010-09,
University of Chicago, October 2010. [UC]
We discuss a language for variational forms that provides a way to specify the spaces used to define boundary conditions in the finite element method. This is linked with a geometry language that allows natural definitions of projections onto subspaces that incorporate boundary conditions.

- Evaluation of the action of
finite element operators, Robert C. Kirby, Matthew G. Knepley, and L. Ridgway Scott, Technical Report
TR-2010-08, University of Chicago, October 2010.
[UC]
The Krylov methods frequently used to solve linear systems associated with finite element discretizations of PDE rely only on the matrix-vector product. We consider the relative costs, in terms both of floating point operations and memory traffic, of several approaches to computing the matrix action. These include forming and using a global sparse matrix, building local element matrices and using them with a local-to-global indexing, and computing the action of the local matrices directly by numerical quadrature. Which approach is most efficient depends on several factors, including the relative cost of computation to memory access, how quickly local element matrices may be formed, and how quickly a function expressed in a finite element basis may be differentiated at the quadrature points.

- Energetics of Calcium Selectivity: A
Three-Dimensional Classical Density Functional Theory Approach, M. G. Knepley, D. A. Karpeev, R. S. Eisenberg,
and D. Gillespie, Biophysical Journal, 96:661, Feb 2009.
Selectivity of a calcium channel is explored with three-dimensional density functional theory of fluids (DFT). The model pore has millimolar Ca2+ affinity similar to the ryanodine receptor calcium channel. The four flexible aspartate side chains in the selectivity filter are modeled as four carboxyl groups (each as two independent, half-charged oxygen atoms) that are free to move within the selectivity filter, but cannot leave it. These oxygens coordinate the permeating ions. We examine how the ions are coordinated by computing radial correlation functions around the permeating ions. We also examine how this coordination changes in wide and narrow selectivity filters. The energetics of selectivity are computed and their components (e.g., electrostatics, excluded volume) show that the coordination of the ions by the oxygens determines the Ca2+ selectivity of the pore by the charge/space competition mechanism. In this mechanism, selectivity is determined by a balance of electrostatic and steric interactions of ions and amino acid side chains in the crowded selectivity filter. Our approach of combining three-dimensional DFT with state of the art computational techniques is unique in channel selectivity. Moreover, the DFT approach allows a natural decomposition of the energies involved in selectivity. The convolution-type calculations at the heart of the DFT are computed using a combination of fast transforms and analytical results. The software itself is built upon the PETSc framework from Argonne National Laboratory.

- CitcomSX: Robust
preconditioning in CitcomS via PETSc, Dave A. May, Matthew G. Knepley, and Michael Gurnis, In Eos Transactions
of the AGU. American Geophysical Union, 2009. Fall Meeting Supplemental, Abstract P31A-A1241.
The Citcom family of mantle convection codes are in wide spread use through out the geodynamics community. Since the inception of the original Cartesian version written in the early 1990's, many variants have been developed. Two important contributions were made by Shijie Zhong in the form of the parallel 3D Cartesian version and the parallel, full spherical version. Maintenance and support of CitcomS through CIG has seen a further increase in the development and usage of this particular version of Citcom. Such improvements have primarily been concerned with the introduction of new physics (rheology, compressibility), coupling with other software, including additional geologically relevant input/outputs and improved portability. Today, CitcomS is routinely used to solve mantle convection and subduction models with approximately one hundred million unknowns on large distributed memory clusters. Many advances have been made in both numerical linear algebra and the software encapsulating these concepts since the time of the original Citcom, however, the solver used by all Citcom software has remained largely unchanged from the original version. Incorporating modern techniques into CitcomS has the potential to greatly improve the flexibility and robustness of the method used to solve the underlying saddle point problem. Here we describe how PETSc (www.mcs.anl.gov/petsc), a flexible linear algebra package, has been integrated into CitcomS in a non-invasive fashion which i) preserves all the pre-existing functionality and ii) enables a rich infrastructure of preconditioned Krylov methods to be used to solve the discrete Stokes flow problem. The "extension" of the solver capabilities in CitcomS has prompted this version to be referred to as CitcomSX. We demonstrate the advantages of CitcomSX by comparing the convergence rate and solution time of the new Stokes solver with the original CitcomS approach. The BFBt preconditioner we utilise is robust and yields convergence rates largely independent of the element resolution and the viscosity contrast. Using this preconditioner, we can accommodate higher viscosity contrasts than were possible with the original CitcomS solver. The test problems used for this comparison include variable viscosity mantle convection (regional, full spherical) and slab subduction (regional).

- GPU implementation of
Stokes equation with strongly variable coefficients, Liang Zheng, Taras Gerya, David A. Yuen, Matthew
G. Knepley, Huai Zhang, and Yaolin Shi, In Eos Transactions of the AGU. American Geophysical Union, 2010. Fall
Meeting Supplemental, Abstract IN41A-1350.
Solving Stokes flow problem is commonplace for numerical modeling of geodynamical processes as the lithosphere and mantle can be always regarded as incompressible for long geological time scales. For Stokes flow the Reynold Number is effectively zero so that we can ignore advective transport of momentum thus resulting in the slowly creeping flow called Stokes' equation. Because of the ill-conditioned matrix due to the saddle points in the matrix system coupling mass and momentum partial differential equations it is still extremely to efficiently solve this elliptic PDE system with strongly variable coefficients due to rheology which is coupled to the conservation of mass equation. Since NVIDIA issued the CUDA programming framework in 2007 scientists can use commodity CPU-GPU system to do such geodynamical simulation efficiently with the advantage of CPU and GPU respectively. We implemented a finite difference solver for Stokes Equations with variable viscosity based on CUDA using geometry multi-grid methods in staggered grids. In 2D version we use a mixture of Jacobi and Gauss-Seidel iteration with SOR as the smoother. In 3D version we use the Red-Black updating method to avoid the problem of disordered threads. We are also considering the deployment of the multigrid solver as a preconditioner for Krylov subspace scheme within the PETSc.

- The coming role of GPU in computational
geodynamics, David A. Yuen, Matthew G. Knepley, Gordon Erlebacher, and Grady B. Wright, In Eos
Transactions of the AGU. American Geophysical Union, 2009. Fall Meeting Supplemental, Abstract DI22A-05.
With the proliferation of GPU ( graphics accelerator board) the computing landscape has changed enormously in the last 3 years. The new additional capabilities of the GPU , such as larger shared memories and load-store operations , allow it to be considered as a viable stand-alone computational and visualization engine. Today the massive threading and computing capability of GPU can deliver at least an order of magnitude performance over the multi-core CPU architecture. The cost of a GPU system is also considerably cheaper than a CPU cluster by more than an order of magnitude.The introduction of CUDA and ancillary software aids, such as Jackets, have allowed the rapid translation of many venerable codes into software usable on GPU. We will discuss our experience acquired over the past year in attacking five different computational problems in the geosciences, using the GPU. They include (1.) 3-D seismic wave propagation with the spectral-element method (2.)2-D shallow water equation as applied to tsunami wave propagation, using finite-differences (3.) 3-D mantle convection with constant viscosity using a 4th order compact finite-difference operator (4.) non-linear heat-diffusion equation in 2-D using a collocation method based on radial basis functions over an elliptical area . Grid points are divided so as to lie on a centroidal Voronoi mesh . Derivatives are calculated at each grid point using a point-dependent stencil computed from the nearest neighbors .(5.) Stokes flow with variable viscosity by means of a pre-conditioner calculated on the GPU based on the vortex method using Green’s functions, along with the radial basis functions and the fast multi-pole method. The Krylov method is used on the CPU for the final iterative step .We will discuss the relative speed-ups of the GPU over the CPU in each of these cases. We will point out the need to go to more computationally intensive mode with multiple GPUs, which calls for key CPUs to control the message passing between the different computational domains by means of MPI. In our CPU-GPU system a separate GPU, the Nvidia GTX 295 , is also devoted for visualizing the ongoing computed results. According to some circles, the future roadmap of GPU seems to be at least one and a half order of magnitude brighter than the CPU in the next 6 years.

- PyLith: A finite-element code for modeling
quasi-static and dynamic crustal deformation, Charles A. Williams, Brad Aagaard, and Matthew G. Knepley, In Eos
Transactions of the AGU. American Geophysical Union, 2009. Fall Meeting Supplemental, Abstract T21B-1798.
We have developed open-source finite-element software for 2-D and 3-D dynamic and quasi-static modeling of crustal deformation. This software, PyLith (current release is version 1.4), combines the quasi-static viscoelastic modeling functionality of PyLith 0.8 and its predecessors (LithoMop and Tecton) and the wave propagation modeling functionality of EqSim. The target applications contain spatial scales ranging from tens of meters to hundreds of kilometers with temporal scales for dynamic modeling ranging from milliseconds to minutes and temporal scales for quasi-static modeling ranging from minutes to thousands of years. PyLith development is part of the NSF funded Computational Infrastructure for Geodynamics (CIG) and the software runs on a wide variety of platforms (laptops, workstations, and Beowulf clusters). Binaries and source code are available from geodynamics.org. It uses a suite of general, parallel, graph data structures called Sieve for storing and manipulating finite-element meshes. This permits use of a variety of 2-D and 3-D cell types including triangles, quadrilaterals, hexahedra, and tetrahedra. Current features include kinematic fault ruptures with multiple sequential earthquakes and aseismic creep, time-dependent Dirichlet and Neumann boundary conditions, absorbing boundary conditions, time-dependent point forces, linear elastic rheologies, generalized Maxwell and Maxwell linear viscoelastic rheologies, power-law rheologies, and gravitational body forces. Current development focuses on implementing dynamic fault interface conditions (employing fault constitutive models) and additional viscoelastic and viscoplastic materials. Future development plans include support for large deformation and automated calculation of suites of Green's functions. We also plan to extend PyLith to allow coupling multiple simultaneous simulations. For example, this could include (1) coupling an interseismic deformation simulation to a spontaneous earthquake rupture simulation (each using subsets of the software), (2) coupling a spontaneous earthquake rupture simulation to a global wave propagation simulation, or (3) coupling a short-term crustal deformation simulation to a mantle convection simulation and an orogenesis and basin formation simulation.

- Two new approaches in solving the
nonlinear shallow water equations for tsunamis, C. Zhang, M. G. Knepley, D. A. Yuen, and Y. Shi, Preprint
ANL/MCS-P1459-0907, ANL, September 2007.
[ANL]
One key component of tsunami research is numerical simulation of tsunamis, which helps us to better understand the fundamental physics and phenomena and leads to better mitigation decisions. However, writing the simulation program itself imposes a large burden on the user. In this survey, we review some of the basic ideas behind the numerical simulation of tsunamis, and introduce two new approaches to construct the simulation using powerful, general-purpose software kits, PETSc and FEPG. PETSc and FEPG support various discretization methods such as finite-difference, finite-element and finite-volume, and provide a stable solution to the numerical problem. Our application uses the nonlinear shallow-water equations in Cartesian coordinates as the governing equations of tsunami wave propagation.

- Mesh algorithms for PDE
with Sieve I: Mesh distribution, Matthew G. Knepley and Dmitry A. Karpeev, Technical Report
ANL/MCS-P1455-0907, Argonne National Laboratory, February 2007.
[ANL]
We have developed a new programming framework, called Sieve, to support parallel numerical PDE algorithms operating over dis- tributed meshes. We have also developed a reference implementation of Sieve in C++ as a library of generic algorithms operating on distributed containers conforming to the Sieve interface. Sieve makes instances of the incidence relation, or arrows, the conceptual first-class objects represented in the containers. Further, generic algorithms acting on this arrow container are systematically used to provide natural geometric operations on the topology and also, through duality, on the data. Finally, coverings and duality are used to encode not only individual meshes, but all types of hierarchies underlying PDE data structures, including multigrid and mesh partitions.

In order to demonstrate the usefulness of the framework, we show how the mesh partition data can be represented and manipulated using the same fundamental mechanisms used to represent meshes. We present the complete description of an algorithm to encode a mesh partition and then distribute a mesh, which is independent of the mesh dimension, element shape, or embedding. Moreover, data associated with the mesh can be similarly distributed with exactly the same algorithm. The use of a high level of abstraction within the Sieve leads to several benefits in terms of code reuse, simplicity, and extensibility. We discuss these benefits and compare our approach to other existing mesh libraries.

- Development of software
for studying earthquakes across multiple spatial and temporal scales by coupling quasi-static and dynamic
simulations, Charles A. Williams, Brad Aagaard, and Matthew G. Knepley, In Eos Transactions of the AGU,
volume 86. American Geophysical Union, 2005. Fall Meeting Supplemental, Abstract S53A-1072.
Earthquake dynamics involves processes with different physical mechanisms operating on several different spatial and temporal scales. In the past, the tendency has been to focus on one set of mechanisms pertinent to the problem, and approximate the effects due to the other mechanisms. With increasingly powerful computational abilities, however, it is now becoming more feasible to address the entire scope of these complex problems. As a step toward this goal, we are developing a finite element code capable of modeling both quasi-static and dynamic behavior in the solid earth. The quasi-static code, presently known as LithoMop, has evolved from our previous version of the TECTON finite element code. We plan to combine this code with the EqSim dynamic rupture propagation code to provide a new package known as PyLith. This combined package will be able to simulate crustal behavior over a wide range of spatial and temporal scales. For example, it will be possible to simulate stress evolution over numerous earthquake cycles (a quasi-static problem) as well as the rapid stress changes occurring during each earthquake in the series (a dynamic problem). We describe here the current development status of the PyLith components, provide a roadmap for code development, and demonstrate the usage of the LithoMop component of the package. The merged code will make use of the Pyre simulation framework, allowing it to couple with other modeling codes and, thus, extend the range of physical mechanisms it is able to simulate. Code parallelization is accomplished using the PETSc parallel libraries. The package also makes use of a powerful and flexible new method of representing computational meshes, Sieve, presently being developed as a part of PETSc. Sieve greatly simplifies the task of parallelizing the code and will make it much easier to generalize the code to different dimensions and element types.

- Flexible representation
of computational meshes, Matthew G. Knepley and Dmitry A. Karpeev, Technical Report ANL/MCS-P1295-1005,
Argonne National Laboratory, October 2005.
[ANL]
A new representation of computational meshes is proposed in terms of a covering relation defined by discrete topological objects we call sieves. Fields over a mesh are handled locally by using the notion of refinement, dual to covering, and are later reassembled. In this approach fields are modeled by sections of a fiber bundle over a sieve. This approach cleanly separates the topology of the mesh from its geometry and other value-storage mechanisms. With these abstractions, finite element calculations are expressed using algorithms that are independent of mesh dimension, global topology, element shapes, and the finite element itself. Extensions and other applications are discussed.

- Optimal evaluation of
finite element matrices, Robert C. Kirby, Matthew G. Knepley, and L. Ridgway Scott, Technical Report
TR-2004-04, University of Chicago, May 2004.
[UC]
Assembling stiffness matrices represents a significant cost in many finite element computations. We address the question of optimizing the evaluation of these matrices. By finding redundant computations, we are able to significantly reduce the cost of building local stiffness matrices for the Laplace operator and for the trilinear form for Navier-Stokes. For the Laplace operator in two space dimensions, we have developed a heuristic graph algorithm that searches for such redundancies and generates code for computing the local stiffness matrices. Up to cubics, we are able to build the stiffness matrix on any triangle in less than one multiply-add pair per entry. Up to sixth degree, we can do it in less than about two. Preliminary low-degree results for Poisson and Navier-Stokes operators in three dimensions are also promising.

- Solvers as operators,
Andrew Cleary and Matthew G. Knepley, Technical Report UCRL-ID-135342, Lawrence Livermore National Laboratory,
1999. [LLNL]
This proposal details an interface hierarchy for objects whose major function is the mapping of a finite dimensional vector space of dimension \(m\) to another vector space of dimension \(n\). This includes many important objects in a solver library, including matrices, their transposes and inverses. solvers. preconditioners, iterative methods, and nonlinear maps. A unifying framework for finite dimensional operators and solvers is proposed which utilizes the composttton operation from the operator algebra to achieve great functionality while reducing the size of the interface and complexity of the class structure. A second composition operation is introduced to handle the composition of approximate solution techniques, and related to several common preconditioning techniques.

# In Progress

- A new modularization for physics
simulators, Matthew G. Knepley and Jed Brown.
We present an algorithmic architecture and implementation for a more modular formulation of both the finite element method (FEM) and the finite volume method (FVM). We allow the user to specify only pointwise equations and pointwise boundary conditions, from which we derive a complete schemes, such as a second-order TVD finite volume method on unstructured polyhedral grids in both 2D and 3D. We present representative results for several physical systems.

- Finite Element Integration with Quadrature on the GPU, Matthew G. Knepley, Karl Rupp, and Andy Terrel
- Unstructured Overlapping Mesh Distribution in Parallel, with Michael Lange and Gerard Gorman
- Synchronization avoiding Krylov Methods with Hannah Morgan, Sean Laguna, Jed Brown, and Ridgway Scott
- Scalable Solvation with Asymmetry and Saturation with Jay Bardhan
- Reduced basis BIBEE with Jay Bardhan
- Scalable surface approximation for integral equations with Jay Bardhan
- Practical Library Extensibility with Jed Brown
- Foundations of Classical DFT with Dirk Gillespie
- Multigrid for matrices with a powerlaw graph using Fan Chung's work
- Flexible FEM using DMPlex, with Andy Terrel, Peter Brune, and Ridgway Scott
- PetFMM paper overlapping direct and tree computation, with Dave May
- Optimality of PetFMM partitioning