End-to-end Provenance

View on GitHub

Technology continues to change the way that scientists work. Nearly all scientific data are analyzed with computers and increasingly data are collected directly in electronic form. A good example is provided by sensor networks, which may utilize electronic sensors and wireless networks to collect vast quantities of data at very fast rates. Scientific programs, ranging from Excel spreadsheets to supercomputer applications, manipulate the collected data to produce scientific results. Scientists can then disseminate both the raw and processed data quickly and to a broad, unknown audience by publishing it on their websites.

Good science requires more than results. It requires reproducibility, verifiability and authentication. Reproducibility is necessary to ensure that the results are not an accidental outcome, but the result of genuine, carefully-performed experimentation and analysis. Verifiability is necessary to assure that the results really did derive from the data, even if reproducing the experiment is not a viable option. Finally, authentication is necessary to believe that the raw data used in the scientific work is itself valid. Without confidence in these issues, the credibility of data posted on the Internet has the same level as the typical Wikipedia article.

For example, data may be collected by sensors and downloaded to a computer, perhaps run through some scripts to perform calibration and cleaning, posting the results for public use on a website, without a scientist checking their validity. What can go wrong? An anemometer might freeze in an icestorm, reporting a windspeed of 0 incorrectly. A sensor might slip out of calibration over time, but the amount of slippage will remain unknown until the sensor is shipped back to the manufacturer for calibration tests, most likely long after the data have been made publicly available. And so on. With the pace at which sensors produce data and programs manipulate data, it is clear that documentation of the data's provenance itself must be automated, so that there can be some hope of understanding the data and correcting for errors that arise in its collection or handling.


Authors and Contributors