John Reppy's Publications

Manticore


From Folklore to Fact: Comparing Implementations of Stacks and Continuations.
Kavon Farvardin and John Reppy. In Proceedings of the SIGPLAN 2020 Conference on Programming Language Design and Implementation, pages 75--90, New York, NY, June 2020. ACM. Awarded Distinguished Paper.
bib | DOI | .pdf ]

The efficient implementation of function calls and non-local control transfers is a critical part of modern language implementations and is important in the implementation of everything from recursion, higher-order functions, concurrency and coroutines, to task-based parallelism. In a compiler, these features can be supported by a variety of mechanisms, including call stacks, segmented stacks, and heap-allocated continuation closures.

An implementor of a high-level language with advanced control features might ask the question “what is the best choice for my implementation?” Unfortunately, the current literature does not provide much guidance, since previous studies suffer from various flaws in methodology and are outdated for modern hardware. In the absence of recent, well-normalized measurements and a holistic overview of their implementation specifics, the path of least resistance when choosing a strategy is to trust folklore, but the folklore is also suspect.

This paper attempts to remedy this situation by providing an “apples-to-apples” comparison of six different approaches to implementing call stacks and continuations. This comparison uses the same source language, compiler pipeline, LLVM-backend, and runtime system, with the only differences being those required by the differences in implementation strategy. We compare the implementation challenges of the different approaches, their sequential performance, and their suitability to support advanced control mechanisms, including supporting heavily threaded code. In addition to the comparison of implementation strategies, the paper's contributions also include a number of useful implementation techniques that we discovered along the way.

Compiling with Continuations and LLVM.
Kavon Farvardin and John Reppy. In Kenichi Asai and Mark Shinwell, editors, Proceedings ML Family Workshop / OCaml Users and Developers workshops (September 22-23, 2016), Volume 285 of Electronic Proceedings in Theoretical Computer Science, pages 131--142. Open Publishing Association, 2018.
bib | DOI | .pdf ]

LLVM is an infrastructure for code generation and low-level optimizations, which has been gaining popularity as a backend for both research and industrial compilers, including many compilers for functional languages. While LLVM provides a relatively easy path to high-quality native code, its design is based on a traditional runtime model which is not well suited to alternative compilation strategies used in high-level language compilers, such as the use of heap-allocated continuation closures.

This paper describes a new LLVM-based backend that supports heap-allocated continuation closures, which enables constant-time callcc and very-lightweight multithreading. The backend has been implemented in the Parallel ML compiler, which is part of the Manticore system, but the results should be useful for other compilers, such as Standard ML of New Jersey, that use heap-allocated continuation closures.

Compiling with continuations and LLVM.
Kavon Farvardin and John Reppy. In Proceedings of the 2016 ACM SIGPLAN Workshop on ML, September 2016.
bib | .pdf ]

This paper describes a new LLVM-based backend for the Parallel ML compiler (part of the Manticore system). This backend is novel in that it supports heap-allocated first-class continuations (a first for LLVM), which, in turn enables language features, such as callcc, lightweight concurrency mechanisms, and PML's parallelism features.

Practical and Effective Higher-Order Optimizations.
Lars Bergstrom, Matthew Fluet, Mike Rainey, Matthew Le, John Reppy, and Nora Sandler. In Proceedings of the 19th ACM SIGPLAN International Conference on Functional Programming (ICFP 2014), New York, NY, September 2014. ACM.
bib | DOI | .pdf ]

Inlining is an optimization that replaces a call to a function with that function's body. This optimization not only reduces the overhead of a function call, but can expose additional optimization opportunities to the compiler, such as removing redundant operations or unused conditional branches. Another optimization, copy propagation, replaces a redundant copy of a still-live variable with the original. Copy propagation can reduce the total number of live variables, reducing register pressure and memory usage, and possibly eliminating redundant memory-to-memory copies. In practice, both of these optimizations are implemented in nearly every modern compiler.

These two optimizations are practical to implement and effective in first-order languages, but in languages with lexically-scoped first-class functions (aka, closures), these optimizations are not available to code programmed in a higher-order style. With higher-order functions, the analysis challenge has been that the environment at the call site must be the same as at the closure capture location, up to the free variables, or the meaning of the program may change. Olin Shivers' 1991 dissertation called this family of optimizations Super-β and he proposed one analysis technique, called reflow, to support these optimizations. Unfortunately, reflow has proven too expensive to implement in practice. Because these higher-order optimizations are not available in functional-language compilers, programmers studiously avoid uses of higher-order values that cannot be optimized (particularly in compiler benchmarks).

This paper provides the first practical and effective technique for Super-β (higher-order) inlining and copy propagation, which we call unchanged variable analysis. We show that this technique is practical by implementing it in the context of a real compiler for an ML-family language and showing that the required analyses have costs below 3% of the total compilation time. This technique's effectiveness is shown through a set of benchmarks and example programs, where this analysis exposes additional potential optimization sites.

Data-Only Flattening for Nested Data Parallelism.
Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. In Proceedings of the 2013 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP 2013), pages 81--92, New York, NY, February 2013. ACM.
bib | DOI | .pdf ]

Data parallelism has proven to be an effective technique for high-level programming of a certain class of parallel applications, but it is not well suited to irregular parallel computations. Blelloch and others proposed nested data parallelism (NDP) as a language mechanism for programming irregular parallel applications in a declarative data-parallel style. The key to this approach is a compiler transformation that flattens the NDP computation and data structures into a form that can be executed efficiently on a wide-vector SIMD architecture. Unfortunately, this technique is ill suited to execution on today's multicore machines. We present a new technique, called data-only flattening, for the compilation of NDP, which is suitable for multicore architectures. Data-only flattening transforms nested data structures in order to expose programs to various optimizations while leaving control structures intact. We present a formal semantics of data-only flattening in a core language with a rewriting system. We demonstrate the effectiveness of this technique in the Parallel ML implementation and we report encouraging experimental results across various benchmark applications.

Lazy Tree Splitting.
Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Journal of Functional Programming, 22(4-5):382--438, September 2012.
bib ]

Nested data-parallelism (NDP) is a language mechanism that supports programming irregular parallel applications in a declarative style. In this paper, we describe the implementation of NDP in Parallel ML (PML), which is part of the Manticore system. One of the main challenges of implementing NDP is managing the parallel decomposition of work. If we have too many small chunks of work, the overhead will be too high, but if we do not have enough chunks of work, processors will be idle. Recently the technique of Lazy Binary Splitting was proposed to address this problem for nested parallel loops over flat arrays. We have adapted this technique to our implementation of NDP, which uses binary trees to represent parallel arrays. This new technique, which we call Lazy Tree Splitting (LTS), has the key advantage of performance robustness; i.e., that it does not require tuning to get the best performance for each program. We describe the implementation of the standard NDP operations using LTS and we present experimental data that demonstrates the scalability of LTS across a range of benchmarks.

Garbage Collection for Multicore NUMA Machines.
Sven Auhagen, Lars Bergstrom, Matthew Fluet, and John Reppy. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC 2011), pages 51--57, New York, NY, June 2011. ACM.
bib | .pdf ]

Modern high-end machines feature multiple processor packages, each of which contains multiple independent cores and integrated memory controllers connected directly to dedicated physical RAM. These packages are connected via a shared bus, creating a system with a heterogeneous memory hierarchy. Since this shared bus has less bandwidth than the sum of the links to memory, aggregate memory bandwidth is higher when parallel threads all access memory local to their processor package than when they access memory attached to a remote package. This bandwidth limitation has traditionally limited the scalability of modern functional language implementations, which seldom scale well past 8 cores, even on small benchmarks.

This work presents a garbage collector integrated with our strict, parallel functional language implementation, Manticore, and shows that it scales effectively on both a 48-core AMD Opteron machine and a 32-core Intel Xeon machine.

Implicitly threaded parallelism in Manticore.
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Journal of Functional Programming, 20(5-6):537--576, 2011.
bib | DOI ]

The increasing availability of commodity multicore processors is making parallel computing ever more widespread. In order to exploit its potential, programmers need languages that make the benefits of parallelism accessible and understandable. Previous parallel languages have traditionally been intended for large-scale scientific computing, and they tend not to be well suited to programming the applications one typically finds on a desktop system. Thus, we need new parallel-language designs that address a broader spectrum of applications. The Manticore project is our effort to address this need. At its core is Parallel ML, a high-level functional language for programming parallel applications on commodity multicore hardware. Parallel ML provides a diverse collection of parallel constructs for different granularities of work. In this paper, we focus on the implicitly threaded parallel constructs of the language, which support fine-grained parallelism. We concentrate on those elements that distinguish our design from related ones, namely, a novel parallel binding form, a nondeterministic parallel case form, and the treatment of exceptions in the presence of data parallelism. These features differentiate the present work from related work on functional data-parallel language designs, which have focused largely on parallel problems with regular structure and the compiler transformations --- most notably, flattening --- that make such designs feasible. We present detailed examples utilizing various mechanisms of the language and give a formal description of our implementation.

Lazy Tree Splitting.
Lars Bergstrom, Mike Rainey, John Reppy, Adam Shaw, and Matthew Fluet. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming (ICFP 2010), pages 93--104, New York, NY, September 2010. ACM.
bib | .pdf ]

Nested data-parallelism (NDP) is a declarative style for programming irregular parallel applications. NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions. In this paper, we describe the implementation of NDP in Parallel ML (PML), part of the Manticore project. Managing the parallel decomposition of work is one of the main challenges of implementing NDP. If the decomposition creates too many small chunks of work, performance will be eroded by too much parallel overhead. If, on the other hand, there are too few large chunks of work, there will be too much sequential processing and processors will sit idle.

Recently the technique of Lazy Binary Splitting was proposed for dynamic parallel decomposition of work on flat arrays, with promising results. We adapt Lazy Binary Splitting to parallel processing of binary trees, which we use to represent parallel arrays in PML. We call our technique Lazy Tree Splitting (LTS). One of its main advantages is its performance robustness: per-program tuning is not required to achieve good performance across varying platforms. We describe LTS-based implementations of standard NDP operations, and we present experimental data demonstrating the scalability of LTS across a range of benchmarks.

Programming in Manticore, a Heterogeneous Parallel Functional Language.
Matthew Fluet, Lars Bergstrom, Nic Ford, Mike Rainey, John Reppy, Adam Shaw, and Yingqi Xiao. In Zoltan Horváth, editor, Central European Functional Programming School, Volume 6299 of Lecture Notes in Computer Science, pages 94--145, New York, NY, 2010. Springer-Verlag.
bib ]

Arity Raising in Manticore.
Lars Bergstrom and John Reppy. In 21st International Symposia on Implementation and Application of Functional Languages (IFL 2009), Volume 6041 of Lecture Notes in Computer Science, pages 90--106, New York, NY, September 2009. Springer-Verlag.
bib | .pdf ]

Compilers for polymorphic languages are required to treat values in programs in an abstract and generic way at the source level. The challenges of optimizing the boxing of raw values, flattening of argument tuples, and raising the arity of functions that handle complex structures to reduce memory usage are old ones, but take on newfound import with processors that have twice as many registers. We present a novel strategy that uses both control-flow and type information to provide an arity raising implementation addressing these problems. This strategy is conservative --- no matter the execution path, the transformed program will not perform extra operations.

Parallel Concurrent ML.
John Reppy, Claudio Russo, and Yingqi Xiao. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming (ICFP 2009), pages 257--268, New York, NY, September 2009. ACM.
bib | DOI | .pdf ]

Concurrent ML (CML) is a high-level message-passing language that supports the construction of first-class synchronous abstractions called events. This mechanism has proven quite effective over the years and has been incorporated in a number of other languages. While CML provides a concurrent programming model, its implementation has always been limited to uniprocessors. This limitation is exploited in the implementation of the synchronization protocol that underlies the event mechanism, but with the advent of cheap parallel processing on the desktop (and laptop), it is time for Parallel CML.

Parallel implementations of CML-like primitives for Java and Haskell exist, but build on high-level synchronization constructs that are unlikely to perform well. This paper presents a novel, parallel implementation of CML that exploits a purpose-built optimistic concurrency protocol designed for both correctness and performance on shared-memory multiprocessors. This work extends and completes an earlier protocol that supported just a strict subset of CML with synchronization on input, but not output events. Our main contributions are a model-checked reference implementation of the protocol and two concrete implementations. This paper focuses on Manticore's functional, continuation-based implementation but briefly discusses an independent, thread-based implementation written in C# and running on Microsoft's stock, parallel runtime. Although very different in detail, both derive from the same design. Experimental evaluation of the Manticore implementation reveals good performance, despite the extra overhead of multiprocessor synchronization.

Implicitly-threaded parallelism in Manticore.
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming (ICFP 2008), pages 119--130, September 2008.
bib | .pdf ]

The increasing availability of commodity multicore processors is making parallel computing available to the masses. Traditional parallel languages are largely intended for large-scale scientific computing and tend not to be well-suited to programming the applications one typically finds on a desktop system. Thus we need new parallel-language designs that address a broader spectrum of applications. In this paper, we present Manticore, a language for building parallel applications on commodity multicore hardware including a diverse collection of parallel constructs for different granularities of work. We focus on the implicitly-threaded parallel constructs in our high-level functional language. We concentrate on those elements that distinguish our design from related ones, namely, a novel parallel binding form, a nondeterministic parallel case form, and exceptions in the presence of data parallelism. These features differentiate the present work from related work on functional data parallel language designs, which has focused largely on parallel problems with regular structure and the compiler transformations --- most notably, flattening --- that make such designs feasible. We describe our implementation strategies and present some detailed examples utilizing various mechanisms of our language.

A scheduling framework for general-purpose parallel languages.
Matthew Fluet, Mike Rainey, and John Reppy. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming (ICFP 2008), pages 241--252, September 2008.
bib | .pdf ]

The trend in microprocessor design toward multicore and manycore processors means that future performance gains in software will largely come from harnessing parallelism. To realize such gains, we need languages and implementations that can enable parallelism at many different levels. For example, an application might use both explicit threads to implement course-grain parallelism for independent tasks and implicit threads for fine-grain data-parallel computation over a large array. An important aspect of this requirement is supporting a wide range of different scheduling mechanisms for parallel computation.

In this paper, we describe the scheduling framework that we have designed and implemented for Manticore, a strict parallel functional language. We take a micro-kernel approach in our design: the compiler and runtime support a small collection of scheduling primitives upon which complex scheduling policies can be implemented. This framework is extremely flexible and can support a wide range of different scheduling policies. It also supports the nesting of schedulers, which is key to both supporting multiple scheduling policies in the same application and to hierarchies of speculative parallel computations.

In addition to describing our framework, we also illustrate its expressiveness with several popular scheduling techniques. We present a (mostly) modular approach to extending our schedulers to support cancellation. This mechanism is essential for implementing eager and speculative parallelism. We finally evaluate our framework with a series of benchmarks and an analysis.

Toward a parallel implementation of Concurrent ML.
John Reppy and Yingqi Xiao. In Proceedings of the Workshop on Declarative Aspects of Multicore Programming (DAMP 2008), January 2008.
bib | .pdf ]

Concurrent ML (CML) is a high-level message-passing language that supports the construction of first-class synchronous abstractions called events. This mechanism has proven quite effective over the years and has been incorporated in a number of other languages. While CML provides a concurrent programming model, its implementation has always been limited to uniprocessors. This limitation is exploited in the implementation of the synchronization protocol that underlies the event mechanism, but with the advent of cheap parallel processing on the desktop (and laptop), it is time for Parallel CML.

We are pursuing such an implementation as part of the Manticore project. In this paper, we describe a parallel implementation of Asymmetric CML (ACML), which is a subset of CML that does not support output guards. We describe an optimistic concurrency protocol for implementing CML synchronization. This protocol has been implemented as part of the Manticore system.

Specialization of CML message-passing primitives.
John Reppy and Yingqi Xiao. In Proceedings of the 34th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2007), pages 315--326, January 2007.
bib | DOI | .pdf ]

Concurrent ML (CML) is a statically-typed higher-order concurrent language that is embedded in Standard ML. Its most notable feature is its support for first-class synchronous operations. This mechanism allows programmers to encapsulate complicated communication and synchronization protocols as first-class abstractions, which encourages a modular style of programming where the underlying channels used to communicate with a given thread are hidden behind data and type abstraction.

While CML has been in active use for well over a decade, little attention has been paid to optimizing CML programs. In this paper, we present a new program analysis for statically-typed higher-order concurrent languages that enables the compile-time specialization of communication operations. This specialization is particularly important in a multiprocessor or multicore setting, where the synchronization overhead for general-purpose operations are high. Preliminary results from a prototype that we have built demonstrate that specialized channel operations are much faster than the general-purpose operations.

Our analysis technique is modular (i.e., it analyzes and optimizes a single unit of abstraction at a time), which plays to the modular style of many CML programs. The analysis consists of three steps: the first is a type-sensitive control-flow analysis that uses the program's type-abstractions to compute more precise results. The second is the construction of an extended control-flow graph using the results of the CFA. The last step is an iterative analysis over the graph that approximates the usage patterns of known channels. Our analysis is designed to detect special patterns of use, such as one-shot channels, fan-in channels, and fan-out channels. We have proven the safety of our analysis and state those results.

Manticore: A heterogeneous parallel language.
Matthew Fluet, Mike Rainey, John Reppy, Adam Shaw, and Yingqi Xiao. In Proceedings of the Workshop on Declarative Aspects of Multicore Programming (DAMP 2007), pages 37--44, January 2007.
bib | .pdf ]

The Manticore project is an effort to design and implement a new functional language for parallel programming. Unlike many earlier parallel languages, Manticore is a heterogeneous language that supports parallelism at multiple levels. Specifically, we combine CML-style explicit concurrency with NESL/Nepal-style data-parallelism. In this paper, we describe and motivate the design of the Manticore language. We also describe a flexible runtime model that supports multiple scheduling disciplines (e.g., for both fine-grain and course-grain parallelism) in a uniform framework. Work on a prototype implementation is ongoing and we give a status report.


This file was generated by bibtex2html 1.98.


Last updated on February 22, 2022
Comments to John Reppy.