Dating a computer science
He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day.The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.In the past, he has found his access blocked by publishers who have spotted his software crawling over their sites.“I spend 90% of my time just contacting publishers or writing software to download papers,” says Häussler.“This is not every journal article ever written, but it’s a lot,” Malamud says.
They get all excited and they say, ‘Oh gosh, this is wonderful’,” says Malamud. Malamud, who contacted several intellectual-property (IP) lawyers before starting work on the depot, hopes to avoid a lawsuit.
Such limits are a big problem, says John Mc Naught, deputy director of the National Centre for Text Mining at the University of Manchester, UK.
“A limit of, say, one article every five seconds, which sounds fast for a human, is painfully slow for a machine.
The JNU data store could sweep aside barriers that still deter scientists from using software to analyse research, says Max Häussler, a bioinformatics researcher at the University of California, Santa Cruz (UCSC).
“Text mining of academic papers is close to impossible right now,” he says — even for someone like him who already has institutional access to paywalled articles.