By Tara Laskowski
When Disneyland was first constructed, Walt Disney told the builders not to pave the sidewalks. Let’s first see where the people walk, he said.
This logic is behind the concept of data science—looking for the hidden, unexpected patterns in the information first instead of trying to guess what the outcome might be.
Data science, or “big data,” is fundamentally changing the way we conduct science. And that is what Kirk Borne finds so exciting about his job.
“In traditional science, you create a hypothesis—you speculate what you think might happen, and then you set off to prove yourself right or wrong,” says Borne, professor of astrophysics and computational sciences at Mason. “Now, in this age where we have so much data, we don’t have to guess. We just have to sift through all the information and find the patterns.”
These patterns are possibilities and connections that would be impossible to guess without big data. Though the algorithms and mathematics behind data mining are very complex, the concept itself is quite simple. If you collect enough information about something—in the same way a forensic investigator can collect enough information about a crime scene—then you can put a complete picture together.
“There are two things that are great about big data,” Borne says. “You can have the best statistical analysis ever of normal things and also the ability to discover the unusual.”
Borne first became intrigued with data science more than a dozen years ago when he was working for NASA. Back then, as now, he was considered one of the leading experts on the subject and was asked to brief President George W. Bush on the possibilities, although other circumstances prevented that briefing from happening. “After 9/11, the government was very interested in learning how to look for red flags to try to prevent something like that from happening again,” Borne says.
Since then, big data have become more and more essential to business, government, and science. And despite privacy concerns and worries about what all this information means, the pros outweigh the cons in many people’s minds. And in a world of information sharing through social networks, it is almost impossible to hide from data tracking and still function normally in society.
Log into your Netflix account and all your recommended movies are based on a data science cluster algorithm that analyzes the renting patterns of other people who like the movie you watched last week. Visit a doctor while traveling and electronic medical records will enable you to receive better treatment. Check out at the grocery store and the coupons you receive are targeted to your past transactions.
And for scientists like Borne, data science may lead to the discovery of a new physical law or process. It means revealing new planets, galaxies, and objects entirely.
“My favorite part of data science is discovering the outlier—the thing that doesn’t fit,” he says. Borne calls this “surprise discovery.” Think of the Sesame Street song “One of These Things Is Not Like the Other”—but instead of four Muppets from the neighborhood, Borne is looking at billions of objects in the universe.
“In the era of big data, we’re going to start to see the one-in-quintillion thing that you could never have imagined to discover—the needle in the haystack (if the haystack was the size of Earth). And before, if you did find it, you might think it’s an anomaly—but if you find a bunch of them you can begin to see its reality, begin to see the pattern emerge,” he says.
Borne is working on the Large Synoptic Survey Telescope (LSST) project, a powerful telescope that researchers hope will be built and online by the end of the decade. The LSST will create a 10-year movie of the section of sky visible from its perch atop a mountain in South America.
Borne will help design the data mining techniques that will sift through all of the massive amounts of data gathered and analyzed by the telescope. He is also leading a nationwide scientific collaboration group that will conduct data science research with the LSST data repository, which will be one of the largest scientific databases ever assembled. The LSST data archive will consist of nearly 100 petabytes of data, roughly equivalent to 100 times all the words printed in all the books in all the libraries in the world.
“There are not enough graduate students in the world to look at all those,” he jokes. “So the science part is finding the best algorithms to help make the discovery. The LSST project team will provide open public access to all of these data–it will be the telescope for everyone. The scientific knowledge discovery potential of the LSST database is staggering. And I cannot wait.”