If you think your last medical bill was expensive, consider this: The estimated cost to develop one drug today is $2.6 billion. And that number is only going up.
The research and development cost of new pharmaceuticals is tied largely to human and capital expenses over a period of years (typically a decade or longer). Discovery, clinical trials and the failed drugs that never make it to market are also folded into this estimate.
But Clemson’s Ilya Safro and his research team are working to reduce the time and cost of biomedical discovery through a unique text-mining technology that blends medical research and computer science into an innovative health technology. Created in partnership with the University of South Carolina College of Pharmacology and positioned to solve some of the most computationally difficult problems of our time, Safro’s scientific literature-based discovery uses natural language processing — a branch of artificial intelligence that helps computers understand, interpret and manipulate human language — on the Palmetto Cluster to discern answers, or at least possible hypotheses, from millions of previously unconnected biomedical research papers and notions.
Natural language processing, or NLP, draws from many disciplines including computer science and computational linguistics with the goal of bridging human communication and computer understanding.
Safro’s technology, called MOLIERE, is a biomedical hypothesis generator emerging as a crucial time-saving technique that sorts through 27 million public biomedical articles written over the last century; identifies concepts such as genes, diseases and side effects in the texts; then discovers unknown connections between those things that have been documented in the vast medical universe previously.
“Scientists are trying to connect disconnected things,” Safro explains. “They’re trying to come up with new notions and connect them with known information.” Those connections could make the future better, healthier and safer.
At the heart of his approach is a network of biomedical objects extracted from datasets of the National Center for Biotechnology Information. Widely and publicly available, these datasets include scientific papers, keywords, genes, proteins, diseases and much more. Safro’s group takes that historical data and writes algorithms that generate scientific hypotheses, homing in on the most significant information. That network, its implementation and all the resulting data are then made publicly available for the broader scientific community.
“Part of the problem is that the information we use for biology is much younger than [the information] we use than for many other sciences,” Safro says. “However, ironically, this allows algorithms to operate less blindly to discover new information.” And quickening the sharing and understanding of that information is key to new discovery. Twenty years ago, Safro was already doing this work for a drug design company — searching medical texts, extracting words and phrases, putting them in a network and looking for distant connections.
“The problem was, I could take a very limited number of papers in a certain domain and then process those papers on my desktop,” Safro says, “and that’s all that I could do.”
Using the Palmetto Cluster, in 10 to 20 minutes, he can build a query that pulls relevant studies and articles spanning more than 100 years. Better yet, his students have the opportunity to participate in this method and go on to do this work for pharmaceutical and biomedical companies.