End-To-End-Provenance
Provenance (also known as pedigree or lineage) refers to the complete history of a document. In the scientific community, provenance refers to the information that describes data in sufficient detail to facilitate reproduction and enable validation of results. In the archival community, provenance refers to the chain of ownership and the transformations a document has undergone. However, in most computer systems today, provenance is an after-thought, implemented as an auxiliary indexing structure parallel to the actual data.
Our goal in this project is to design, build, and study an end-to-end system that extends all the way from original data analyses by real scientists to management and analysis of the resulting provenance in a common framework with common tools.
- R Tools: Provenance tools for the R statistical language
- Extended PROV-JSON: An extension of the W3C PROV-JSON standard for fine-grained provenance
- CamFlow: A practical whole-system provenance capture mechanism for the Linux OS
- Unicorn: An intrusion detection system
- Xanthus: An automated reproducibility platform
Active Contributors
Past Contributors
- Shay Adams, Mount Holyoke College
- Orenna Brand, Columbia University
- Vasco Carinhas, Universidad de Puerto Rico en Arecibo
- Marios Dardas, Holy Cross College
- Elizabeth Fong. Mount Holyoke College
- Connor Gregorich-Trevor, Grinnell College
- Nikki Hoffler, Mount Holyoke College
- Jen Johnson, Middlebury College
- Andy Kaldunski, Ripon College
- Alex Liu, Amherst College
- Khanh Ngo, Mount Holyoke College
- Erick Oduniyi, University of Kansas
- Miruna Oprescu, Harvard University
- Huiyun Peng, Mount Holyoke College
- Luis Perez, Harvard University
- Lia Poulos, Mount Holyoke College
- Moe Pwint Phyu, Mount Holyoke College
- Garrett Rosenblatt, University of Rochester
- Rose Sheehan, Mount Holyoke College
- Sofiya Taskova, Mount Holyoke College
- Cory Teshera-Sterne, Mount Holyoke Colege
- Morgan Vigil, Westmont College
- Mohammed Rahman, Indian Institutes of Technology
- Yujia Zhou, Dickinson College
Publications
2022
- Barbara Lerner, Emery Boose, Orenna Brand, Aaron M. Ellison, Elizabeth Fong, Matthew Lau, Khanh Ngo, Thomas Pasquier, Luis A. Perez, Margo Seltzer, Rose Sheehan and Joseph Wonsil , Making Provenance Work for You, The R Journal, Volume 14, Number 4, December 2022, pages 141-159.
2020
- Xueyuan Han, James Mickens, Ashish Gehani, Margo Seltzer, and Thomas Pasquier. 2020. Xanthus: Push-button Orchestration of Host Provenance Data Collection Proceedings of the Third Annual Workshop on Practical Reproducible Evaluation of Computer Systems (Co-located with HPDC 2020).
- Aaron Ellison, Emery Boose, Barbara Lerner, Elizabeth Fong, and Margo Seltzer. 2020. People of Data: The End-to-End Provenance Project.
- Matthew Lau, Thomas Pasquier, and Margo Seltzer. 2020. Rclean: A Tool for Writing Cleaner, More Transparent Code. JOSS: The Journal of Open Source Software.
- Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer. 2020. Unicorn: Runtime Provennace-Based Detector for Advanced Persistent Threats. Network and Distributed System Security Symposium (NDSS).
- Jingmei Hu, J., Jiwon Joung, Maia Jacobs, Krzysztof Gajos, and Margo Seltzer. 2020. Improving Data Scientist Efficiency with Provenance. International Conference on Software Engineering.
2019
- Sheung Chi Chan, James Cheney, Pramod Bhatotia, Thomas Pasquier, Ashish Gehani, Hassaan Irshad, Lucian Carata, and Margo Seltzer. 2019. ProvMark: A provenance expressiveness benchmarking system. International Middleware Conference.
- Thomas Pasquier, David Eyers, and Margo Seltzer. 2019. Visonpaper – From Here to Provtopia. Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data (POLY, co-located with VLDB 2019).
2018
- Xueyuan Han, Thomas Pasquier, and Margo Seltzer. 2018. Provenance-based Intrusion Detection: Opportunities and Challenges. In Workshop on the Theory and Practice of Provenance (TaPP’18). USENIX.
- Barbara Lerner, Emery Boose, and Luis Perez. 2018. Using Introspection to Collect Provenance in R. Informatics, 5, 12.
- Thomas Pasquier, Xueyuan Han, Thomas Moyer, Adam Bates, Olivier Hermant, David Eyers, Jean Bacon, and Margo Seltzer. 2018. Runtime Analysis of Whole-System Provenance. In Conference on Computer and Communications Security (CCS’18). ACM.
- Thomas Pasquier, Matthew Lau, Xueyuan Han, Elizabeth Fong, Barbara Lerner, Emery Boose, Merce Crosas, Aaron Ellison, and Margo Seltzer. 2018. Sharing and Preserving Computational Analyses for Posterity with encapsulator. IEEE Computing in Science and Engineering (CiSE).
2017
- Emery Boose and Barbara Lerner. 2017. Replication of Data Analyses: Provenance in R. In Stepping in the Same River Twice: Replication in Biological Research, edited by A. Shavit and A. M. Ellison. Yale University Press.
- Xueyuan Han, Thomas Pasquier, Tanvi Ranjan, Mark Goldstein, and Margo Seltzer. 2017. FRAPpuccino: Fault-detection through Runtime Analysis of Provenance. In Workshop on Hot Topics in Cloud Computing (HotCloud’17). USENIX.
- Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Margo Seltzer, and Jean Bacon. 2017. Practical Whole-System Provenance Capture. In Symposium on Cloud Computing (SoCC’17). ACM.
- Thomas Pasquier, Matthew K. Lau, Ana Trisovic, Emery R. Boose, Ben Couturier, Mercè Crosas, Aaron M. Ellison, Valerie Gibson, Chris R. Jones, and Margo Seltzer. 2017. If these data could talk. Nature Scientific Data, 4.
- Thomas Pasquier, Jatinder Singh, Julia Powles, David Eyers, Margo Seltzer, and Jean Bacon. 2017. Data provenance to audit compliance with privacy policy in the Internet of Things. Springer Personal and Ubiquitous Computing.
2015
2014
- Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Seltzer, and Andy Hopper. 2014. A primer on provenance Communications of the ACM, 57, 5, Pp. 52–60.
2013
- Michelle A Borkin, Chelsea S Yeh, Madelaine Boyd, Peter Macko, Krzysztof Z Gajos, Margo Seltzer, and Hanspeter Pfister. 2013. Evaluation of filesystem provenance visualization tools IEEE Transactions on Visualization and Computer Graphics, 19, 12, Pp. 2476–2485.
- Peter Macko, Daniel Margo, and Margo Seltzer. 2013. Local clustering in provenance graphs In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, Pp. 835–840. ACM.
2011
2010
- Kiran-Kumar Muniswamy-Reddy and Margo Seltzer. 2010. Provenance as first class cloud data ACM SIGOPS Operating Systems Review, 43, 4, Pp. 11–16.
- Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo I Seltzer. 2010. Provenance for the Cloud In FAST, 10: Pp. 15–14.
- Aaron M Ellison. 2010. Repeatability and transparency in ecological research Ecology, 91, 9, Pp. 2536–2539. Publisher’s Version
- Uri Braun, Margo I Seltzer, Adriane Chapman, Barbara T Blaustein, David M Allen, and Len Seligman. 2010. Towards Query Interoperability: PASSing PLUS In TaPP, Pp. 1–10.
Pages
2009
- Daniel W Margo and Margo I Seltzer. 2009. The Case for Browser Provenance In Workshop on the Theory and Practice of Provenance.
- Kiran-Kumar Muniswamy-Reddy, Uri Braun, David A Holland, Peter Macko, Diana L MacLean, Daniel W Margo, Margo I Seltzer, and Robin Smogor. 2009. Layering in Provenance Systems In USENIX Annual technical conference.
- Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo I Seltzer. 2009. Making a Cloud Provenance-Aware In Workshop on the Theory and Practice of Provenance.
- James Cheney, Stephen Chong, Nate Foster, Margo Seltzer, and Stijn Vansummeren. 2009. Provenance: a future history In Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications, Pp. 957–964. ACM.
2008
- David A Holland, Uri Jacob Braun, Diana Maclean, Kiran-Kumar Muniswamy-Reddy, and Margo I Seltzer. 2008. Choosing a data model and query language for provenance.
- David A Holland, Margo I Seltzer, Uri Braun, and Kiran-Kumar Muniswamy-Reddy. 2008. PASSing the provenance challenge Concurrency and Computation: Practice and Experience, 20, 5, Pp. 531–540.
- Uri Braun, Avraham Shinnar, and Margo I Seltzer. 2008. Securing Provenance In HotSec.
2007
2006
- Uri Braun, Simson Garfinkel, David A Holland, Kiran-Kumar Muniswamy-Reddy, and Margo I Seltzer. 2006. Issues in automatic provenance collection In International Provenance and Annotation Workshop, Pp. 171–183. Springer.
- Kiran-Kumar Muniswamy-Reddy, David A Holland, Uri Braun, and Margo I Seltzer. 2006. Provenance-aware storage systems In USENIX Annual Technical Conference, General Track, Pp. 43–56.
2005
- Jonathan Ledlie, Chaki Ng, and David A Holland. 2005. Provenance-aware sensor data storage In Data Engineering Workshops, 2005. 21st International Conference on, Pp. 1189–1189. IEEE.
Acknowledgments
R tool development started in 2013 at the Harvard Forest and was supported by NSF grants DEB-1237491, DBI-1459519, and SSI-1450277, a Charles Bullard Fellowship at Harvard University, and a faculty fellowship from Mount Holyoke College, and is a contribution from the Harvard Forest Long-Term Ecological Research (LTER) program. This project benefited greatly from earlier work by Lee Osterweil, Lori Clarke, and Alexander Wise at the University of Massachusetts Amherst.
CamFlow development started in 2014 at the University of Cambridge’s Opera Research Group under grant EPSRC EP/K011510/1. Later development was supported by Harvard University’s Center for Research on Computation and Society under NSF grant SSI-1450277. We also acknowledge the support of the University of Cambridge’s Digital Technology Group and the University of Bristol Cyber Security Group.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).
Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG).