A Magazine for the George Mason University Community

Using Algorithms to Conduct Large-Scale Metagenome Analysis

March 12, 2013

By Robin Herron

Huzefa Rangwala is a problem solver; no, make that a big problem solver.

Huzefa Rangwala

Huzefa Rangwala

As a computer scientist, Rangwala is hungry for data on which to test his algorithms. And metagenomics—the collective genome of communities of microbes—provides lots of data.

“If you ask a computer scientist what a genome is, he’ll say it’s a long, long string of characters—it’s a big sequence. Computer scientists get very excited about these kinds of structures,” he explains.

In 2007, the National Institutes of Health launched the Human Microbiome project to study the microbial communities that live within the body and even on a person’s skin: bacteria, fungi, and viruses. More recently, the Earth Microbiome project was established to analyze the microbial communities that live on the planet. Understanding the functions and relationships of these microbial communities to one another—and to their hosts—begins with sequencing the DNA of these communities or reading their structure. But they are so numerous that innovative and scalable computational algorithms must be developed to do it. This is where Rangwala comes in.

As an undergraduate in his native India, Rangwala was trained strictly in computer science. But when he took a bioinformatics class at the University of Minnesota while studying for his PhD, he says, “I got more and more interested in understanding the biology and developing algorithms or methods that might be applicable to biologists.”

Shortly after joining Mason in 2008, Rangwala connected with Mason environmental biologist Patrick Gillevet, director of the Microbiome Analysis Center. A mutually fruitful collaboration developed, and Rangwala has collaborated on several joint projects.

Rangwala says, “Pat has all these biological questions with big challenges, and I’m here to solve those for him. Metagenome association is one of the projects he’s working on related to the field of data mining—finding patterns in data.” “Huzefa has been a very valuable colleague and collaborator,” Gillevet says. “He has helped develop novel algorithms and tools to analyze the microbiome data.”

Jennifer Barrett

In developing algorithms to analyze such data, Rangwala aims for speed, efficiency, and accuracy.

“If your algorithm is faster than someone else’s algorithm, then you’ll be able to process the data much faster. We want to split the algorithm and run this algorithm on several machines. Doing that can be challenging because different parts of the algorithm may need to finish at the same time or they might have some sort of dependencies on each other. So, when I’m designing my algorithms, I’m thinking, what is the best way to find concurrency?”

Above that, Rangwala wants to develop algorithms that are user friendly.

When he first began working with Gillevet, Rangwala developed a laboratory information management system for him that included a web interface.

Gillevet’s lab has a sequencing machine that produces about 100,000 sequences of data per run. “How do you store this data efficiently, how do you back it up, how do you transfer it over your Internet or cable? All these factors become core issues, and he was facing these kinds of problems,” Rangwala says.

With the management system in place, the two were able to collaborate more efficiently. “He’s providing me some biological expertise and data, and I’m providing him these tools within a web interface that are really helpful to him. He can proceed with his analysis on his own and repeat them as often as he likes,” Rangwala says.

“He also wanted to analyze these data sets individually on a single machine,” Rangwala continues. “But that takes too long. So we came up with some ideas on approximations. Instead of analyzing the entire set, could we cluster them?” A recent paper Rangwala published with colleagues explained how they developed a process to do that: putting similar examples or similar objects in the same groups and analyzing the representatives for each group.

With the technology available, Rangwala can sequence all the bacteria on a slide sample. But since all the bacteria are mixed together, the sequence doesn’t tell biologists what they need to know: what kinds of bacteria there are, how abundant they are, which are the dominant species, and what the bacteria do.

“That’s a huge problem,” Rangwala says. He’s now working on approaches that will extract the underlying relationship between these different problems—combine them and produce a better annotation of the bacterial species.

In addition to working with Gillevet, Rangwala is collaborating with other researchers at Mason and Rush University Medical Center in Chicago, the University of Minnesota, and the U.S. Department of Agriculture (USDA). His research is supported by grants from Mason, the National Institutes of Health, the National Science Foundation, the Defense Advanced Research Projects Agency, and the USDA.

“My job is to create new algorithms, but I’m excited that [my work] has an impact in the fields of biology and environmental sciences right now,” Rangwala says.

No Comments Yet »

Leave a comment