CHICAGO — Reproducibility is the cornerstone of science. In order for scientists to make advancements, they must be able to validate and build on each other’s work. Now that so much science relies on computations and data, many researchers are struggling to share their computational artifacts in ways that are usable for others, said Tanu Malik, an assistant professor in DePaul University’s College of Computing and Digital Media.
“We have results that are generated through computational artifacts but are being presented on PDF papers. As a researcher, there are no easy means for verifying the results being presented,” said Malik. “Emailing and sharing through websites are old methods. We need more efficient and usable methods to verify results from complex scientific experiments.”
Now, the National Science Foundation has awarded Malik a Faculty Early Career Development (CAREER) grant to support her work to lay the foundation for establishing reproducibility of real-world computational and data science. Malik’s project will also increase awareness of the need for computational reproducibility tools through a research and education plan involving scientists, students and instructors. The $498,889, five-year research grant is NSF’s most prestigious award in support of early-career faculty.
Hitting on an idea
Malik knew she was onto something in 2013 as a research associate scientist at the University of Chicago while working with a group of geoscientists. Spread across seven universities, they were trying to collect and run their computations together, but it wasn’t working. Malik and her colleagues created a product, called the Sciunit container (http://sciunit.run), that could align not just the data but also the programs and environments where the information had been created. The geoscientists had been trying to share data and computation for several years.
Malik’s system gave them results in 30 minutes.
“They were able to run this tool, and it gathered everything from different machines and made it portable. It became a huge thing,” Malik said. She had discovered that it wasn’t enough just to share a program code and data, but researchers also need what’s called the “compute environment” to ensure that data is being run in the same way, getting relatively the same outputs. Malik likened it to trying to download a new program on your personal computer, but it just won’t run. “That’s the kind of situation we’re trying to avoid.”
The solution, said Malik, is to make it all portable — the data, the program, the operating system — so that others can move ahead and reproduce research, faster. At that time, NSF recognized the importance of the work with a $1.3 million grant, and Malik moved her research to DePaul in 2016.
“DePaul gave me the bandwidth to actually go deeper into this problem and really think from a computational aspect. I am looking at how containers should be designed to make them really robust for different kinds of computations,” said Malik, who co-directs the Data Systems and Optimization Lab in DePaul’s School of Computing.
Reproducibility as a spectrum
Malik’s work will also make it easier for researchers to judge whether their own attempts at an experiment are reproducible or not. Her research aims to define the phases of reproducibility in computational research.
“You may want to do verification with different data sets, with different input parameters. So how do we make that verification fast? The underlying technology that we use in all of this is what is known as data provenance. It’s capturing the provenance of the entire compute, or the history of how exactly it happened. And this time, this is what you have changed,” Malik explained.
The term data provenance is derived from the art world, said Malik, and it refers to how data was created.
“Data always interests me,” said Malik. “And the provenance of data seemed like a cool thing to study. You always look at your files — and I think, ‘how did I generate this file?’ These are questions that come very naturally when I’m working, and I felt that provenance is important and wanted to explore it more.”
Recognition and work ahead
The CAREER grant is awarded to scientists who have the potential to serve as academic role models in research and education, and who can lead advances in the mission of their department or organization.
At the heart of this exploration is Malik’s work with students at DePaul. This spring, she created an advanced graduate course in the School of Computing about containers and reproducibility, and she said students were enjoying the work. The CAREER grant will allow Malik to engage more students with her work, especially in DePaul’s data science program. She hopes to engage more women in the work, as representation of women in computer science is still lagging, said Malik.
“The number of women who get funded in this area is abysmally low — so I think it’s a big deal,” said Malik. “I just feel honored to have that opportunity. If I could share somehow that would be fantastic.”
Malik added that coming to DePaul has helped give her the time and space to do the work she “always wanted to do.”
“I have been doing this work for some time now, and the fact that this work is being recognized, that we did make an impact in a few lives by making it simpler, it feels good,” said Malik. “NSF has recognized my work, and is helping us to expand this further to make a greater impact. That’s the ultimate fun, to make a dent in this hard problem.”