Falkon: Data Diffusion
Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a “data diffusion” approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and “cached” to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-to-peer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have developed Falkon, which provides dynamic acquisition and release of resources and the dispatch of analysis tasks to those resources. We have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astro-physics, medicine, and other domains, with varying datasets, workloads, and analysis codes.
Ioan Raicu. “Harnessing Grid Resources with Data-Centric Task Farms”, University of Chicago, Computer Science Department, PhD Proposal, December 2007, Chicago, Illinois.
Ioan Raicu, Yong Zhao, Ian Foster, Alex Szalay. “A Data Diffusion Approach to Large Scale Scientific Exploration”, to appear in the Microsoft Research eScience Workshop 2007.
Alex Szalay, Julian Bunn, Jim Gray, Ian Foster, Ioan Raicu. “The Importance of Data Locality in Distributed Computing Applications”, NSF Workflow Workshop 2006.
![]()