Towards Understanding Data Analysis Workflows using a Large Notebook Corpus


The advent of big data analysis as a profession as well as a hobby has brought an increase in novel forms of data exploration and analysis, particularly ad-hoc analysis. Analysis of raw datasets using frameworks such as pandas and R have become very popular [8]. Typically these types of workflows are geared towards ingesting and transforming data in an exploratory fashion in order to derive knowledge while minimizing time-to-insight. However, there exists very little work studying usability and performance concerns of such unstructured workflows.

Proceedings of the 2019 International Conference on Management of Data - SIGMOD ‘19