Facilitating Trustworthy Data Science

Published by Zack Ives on February 18, 2018February 18, 2018

When a data analysis result is published or shared, how do we know whether it is trustworthy? Conceptually, we must show that each step of the analysis (and each set of parameters used in the step) was justified, and that the original data was itself trustworthy.

A team of Penn CIS and biomedical researchers are developing solutions for arbitrary data science computations: we instrumenting the analysis code to capture all of the computational stages, input records, and output values used during analysis — this is termed data provenance. We then learn how much to trust each computational stage and input file, in part by getting feedback from human experts, and in part by looking at whether the same stages and input files were used to produce scientifically valid answers for other computations. When mistakes are made, we provide debugging techniques and tools that use provenance to “trace back” faulty answers to the responsible inputs and code. The work is being deployed in high-throughput gene sequencing and neuroscience data analysis tasks here at Penn. More on the project is available at PennProvenance.NET, and the tools are being deployed in IEEG.org, a Penn-developed neuroscience data portal highlighted by NIH Director Francis Collins in a blog post.

For more information, see PennProvenance.NET and the Penn Database Group.

Faculty

Zachary G. Ives
Susan B. Davidson
Val Tannen
Sampath Kannan
Junhyong Kim (Biology, Secondary in CIS)
Brian Litt (Bioengineering and Neurology)

Sponsorship

Funded by grants from the NIH and NSF.

Students

Nan Zheng
Soonbo Han
Yi Zhang
Leshang Chen
Rishabh Gupta (MSE)
Prateek Singhal (MSE)

Facilitating Trustworthy Data Science

Faculty

Sponsorship

Students

Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text

Research and Resources: A Peak at CIS Highlights

C.I.S. Strong: Meet Edo Roth

Facilitating Trustworthy Data Science

Faculty

Sponsorship

Students

Related Posts

Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text

Research and Resources: A Peak at CIS Highlights

C.I.S. Strong: Meet Edo Roth