TopicDayRequired ReadingOptional ReadingNotes
Data Lakes and Spaces Tue Data warehousing and analytics infrastructure at facebook
Thur From databases to dataspaces: a new abstraction for information management
Principles of dataspace systems
Webtables: exploring the power of tables on the web
Collaborative Analytics & Metadata Management Tue Pay-as-you-go user feedback for dataspace systems
Goods: Organizing Google’s Datasets
A dataspace odyssey: The iMeMex personal dataspace management system
Thur Ground: A Data Context Service
The Data Civilizer System
Datahub: Collaborative data science & dataset version management at scale
Metadata-drive social collaborative data analysis
Towards Large-Scale Data Discovery
Data Ingestion & Representation Tue Invisible loading: access-driven data transfer from raw files into database systems
Don't Hold My Data Hostage - A Case For Client Protocol Redesign
Thur A Partitioning Framework for Aggressive Data Skipping
A Robust Partitioning Scheme for Ad-Hoc Query Workloads
Instant loading for main memory databases
Data Integration Tue CLAMS: Bringing Quality to Data Lakes
Web-scale Data Integration: You can only afford to Pay As You Go
Data Integration for the Relational Web
Thur SLiMFast: Guaranteed Results for Data Fusion and Source Reliability.
Data Tamer

Data Cleaning Tue CrowdER: Crowdsourcing Entity Resolution
Potter's wheel: An interactive data cleaning system


Schema & Structure Extraction Tue From dirt to shovels: fully automatic tool generation from ad hoc data
Tupni: automatic reverse engineering of input formats
Navigating the Data Lake with Datamaran: Automatically Extracting Structure from Log Datasets

The PADS project: an overview
Thur DeepDive: declarative knowledge base construction(CACM)
FlashExtract: a framework for data extraction by examples

DEByE – Data Extraction By Example
Ad-Hoc Exploration &
In-Situ Analytics
Parallel data analysis directly on scientific file formats
Distributed and Interactive Cube Exploration
Slalom: Coasting Through Raw Data via Adaptive Partitioning and Indexing
FluxQuery: An Execution Framework for Highly Interactive Query Workloads
Thur Alpine: Efficient In-Situ Data Exploration in the Presence of Updates
DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary Data

Fast Queries Over Heterogeneous Data Through Engine Customization
Approximation & Estimation Tue Quickr: Lazily Approximating Complex AdHoc Queries inBigData Clusters
Histograms as a side effect of data movement for big data
Approximate Query Processing: No Silver Bullet

Interfaces Tue DataWrangling: The Challenging Journey from the Wild to the Lake
Wrangler: Interactive visual specification of data transformation scripts

Proactive wrangling: mixed-initiative end-user programming of data transformation scripts
Thur Staging User Feedback toward Rapid Conflict Resolution in Data Fusion
AVA-Chatbot for Data Science

Desiderata Tue