collecting articles and mining the content without permission.


Carl Malamud in front of the data store of 73 million articles that he 
plans to let scientists text mine. Credit: Smita Sharma for /Nature/

PDF version 

Carl Malamud is on a crusade to liberate information locked up behind 
paywalls — and his campaigns have scored many victories. He has spent 
decades publishing copyrighted legal documents, from building codes to 
court records, and then arguing that such texts represent public-domain 
law that ought to be available to any citizen online. Sometimes, he has 
won those arguments in court. Now, the 60-year-old American technologist 
is turning his sights on a new objective: freeing paywalled scientific 
literature. And he thinks he has a legal way to do it.

Over the past year, Malamud has — without asking publishers — teamed up 
with Indian researchers to build a gigantic store of text and images 
extracted from 73 million journal articles dating from 1847 up to the 
present day. The cache, which is still being created, will be kept on a 
576-terabyte storage facility at Jawaharlal Nehru University (JNU) in 
New Delhi. “This is not every journal article ever written, but it’s a 
lot,” Malamud says. It’s comparable to the size of the core collection 
in the Web of Science database, for instance. Malamud and his JNU 
collaborator, bioinformatician Andrew Lynn, call their facility the JNU 
data depot.

No one will be allowed to read or download work from the repository, 
because that would breach publishers’ copyright. Instead, Malamud 
envisages, researchers could crawl over its text and data with computer 
software, scanning through the world’s scientific literature to pull out 
insights without actually reading the text.

The unprecedented project is generating much excitement because it 
could, for the first time, open up vast swathes of the paywalled 
literature for easy computerized analysis. Dozens of research groups 
already mine papers to build databases of genes and chemicals, map 
associations between proteins and diseases, and generate useful 
scientific hypotheses 
<>. But publishers 
control — and often limit — the speed and scope of such projects, which 
typically confine themselves to abstracts, not full text. Researchers in 
India, the United States and the United Kingdom are already making plans 
to use the JNU store instead. Malamud and Lynn have held workshops at 
Indian government laboratories and universities to explain the idea. “We 
bring in professors and explain what we are doing. They get all excited 
and they say, ‘Oh gosh, this is wonderful’,” says Malamud.

But the depot’s legal status isn’t yet clear. Malamud, who contacted 
several intellectual-property (IP) lawyers before starting work on the 
depot, hopes to avoid a lawsuit. “Our position is that what we are doing 
is perfectly legal,” he says. For the moment, he is proceeding with 
caution: the JNU data depot is air-gapped, meaning that no one can 
access it from the Internet. Users have to physically visit the 
facility, and only researchers who want to mine for non-commercial 
purposes are currently allowed in. Malamud says his team does plan to 
allow remote access in the future. “The hope is to do this slowly and 
deliberately. We are not throwing this open right away,” he says.

    The power of data mining

The JNU data store could sweep aside barriers that still deter 
scientists from using software to analyse research, says Max Häussler, a 
bioinformatics researcher at the University of California, Santa Cruz 
(UCSC). “Text mining of academic papers is close to impossible right 
now,” he says — even for someone like him who already has institutional 
access to paywalled articles.

Since 2009, Häussler and his colleagues have been building the online 
UCSC Genome Browser 
which links DNA sequences in the human genome to parts of research 
papers that mention the same sequences. To do that, the researchers have 
contacted more than 40 publishers to ask permission to use software to 
rifle through research to find mentions of DNA. But 15 publishers have 
not responded or have denied permission <>. 
Häussler is unsure whether he can legally mine papers without 
permission, so he isn’t trying. In the past, he has found his access 
blocked <> 
by publishers who have spotted his software crawling over their sites. 
“I spend 90% of my time just contacting publishers or writing software 
to download papers,” says Häussler.

Chris Hartgerink, a statistician who works part-time at Berlin’s QUEST 
Center for Transforming Biomedical Research, says he now restricts 
himself to text-mining work from open-access publishers only, because 
“the hassles of dealing with these closed publishers are too much”. A 
few years ago, when Hartgerink was pursuing his PhD in the Netherlands, 
three publishers blocked his access to their journals after he tried to 
download articles in bulk for mining.

"Some countries have changed their laws to affirm that researchers on 
non-commercial projects don’t need a copyright-holder’s permission to 
mine whatever they can legally access. The United Kingdom passed such a 
law in 2014, and the European Union voted through a similar provision 
<> this year"

Paul Jackson