Large-Scale Systems Group
Large-Scale Systems Group (LSSG) @ University of Chicago
GVR (Global View Resilience) is a user-level library that enables portable, efficient, application-controlled resilience.  The primary target of GVR is HPC applications that require both extreme scalability and performance as well as resilience.  GVR's key approaches include independent versioning of application arrays, efficient partial or whole restoration, open resilience to maximize the number of errors that can be handled (minimize fail-stop occurrences).  Application knowledge can be exploited to control overhead, maximize error coverage, and maximize recoverable errors.

The latest GVR 1.0.0 was released  under BSD licensing  on Oct. 4th, 2014, and inculdes the following features,
  • Portable application-controlled resilience and recovery with incremental code change
  • Versioned distributed arrays with global naming (a portable abstraction)
  • Reliable storage of the versioned arrays in memory, local disk/SSD, or global file system
  • Whole version navigation and efficient restoration
  • Partial version efficient restoration (incremental "materialization")
  • Independent array versioning (each at its own pace)
  • Open Resilience framework to maximize cross-layer error handling
    • application-defined error handling
    • unified application and system error descriptors
    • attribute based composition for easy extensibility at application, operating system, and hardware levels
  • C native APIs and Fortran bindings
Release requirements:
  • Requires only an MPI library which is compatible with MPI-3 standard.
  • Standard "autotools" preparation
  • Requires no root privilege
  • Runs on several platforms including x86-64 Linux cluster, Cray XC30 and IBM Blue Gene/Q
The document is available as  Global View Resilience (GVR) Documentation, Release 1.0 , University of Chicago, Computer Science Technical Report 2014-10.
GVR has been developed by University of Chicago and Argonne National Laboratory, under the lead of Prof. Andrew A. Chien and Dr. Pavan Balaji. It has been supported by the U.S. Department of Energy, Office of Science / ASCR under awards DE-SC0008603/57K68-00-145.

