collecting articles and mining the content without permission.
Carl Malamud in front of the data store of 73 million articles that he
plans to let scientists text mine. Credit: Smita Sharma for /Nature/
Carl Malamud is on a crusade to liberate information locked up behind
paywalls — and his campaigns have scored many victories. He has spent
decades publishing copyrighted legal documents, from building codes to
court records, and then arguing that such texts represent public-domain
law that ought to be available to any citizen online. Sometimes, he has
won those arguments in court. Now, the 60-year-old American technologist
is turning his sights on a new objective: freeing paywalled scientific
literature. And he thinks he has a legal way to do it.
Over the past year, Malamud has — without asking publishers — teamed up
with Indian researchers to build a gigantic store of text and images
extracted from 73 million journal articles dating from 1847 up to the
present day. The cache, which is still being created, will be kept on a
576-terabyte storage facility at Jawaharlal Nehru University (JNU) in
New Delhi. “This is not every journal article ever written, but it’s a
lot,” Malamud says. It’s comparable to the size of the core collection
in the Web of Science database, for instance. Malamud and his JNU
collaborator, bioinformatician Andrew Lynn, call their facility the JNU
No one will be allowed to read or download work from the repository,
because that would breach publishers’ copyright. Instead, Malamud
envisages, researchers could crawl over its text and data with computer
software, scanning through the world’s scientific literature to pull out
insights without actually reading the text.
The unprecedented project is generating much excitement because it
could, for the first time, open up vast swathes of the paywalled
literature for easy computerized analysis. Dozens of research groups
already mine papers to build databases of genes and chemicals, map
associations between proteins and diseases, and generate useful
<https://www.nature.com/articles/d41586-019-01978-x>. But publishers
control — and often limit — the speed and scope of such projects, which
typically confine themselves to abstracts, not full text. Researchers in
India, the United States and the United Kingdom are already making plans
to use the JNU store instead. Malamud and Lynn have held workshops at
Indian government laboratories and universities to explain the idea. “We
bring in professors and explain what we are doing. They get all excited
and they say, ‘Oh gosh, this is wonderful’,” says Malamud.
But the depot’s legal status isn’t yet clear. Malamud, who contacted
several intellectual-property (IP) lawyers before starting work on the
depot, hopes to avoid a lawsuit. “Our position is that what we are doing
is perfectly legal,” he says. For the moment, he is proceeding with
caution: the JNU data depot is air-gapped, meaning that no one can
access it from the Internet. Users have to physically visit the
facility, and only researchers who want to mine for non-commercial
purposes are currently allowed in. Malamud says his team does plan to
allow remote access in the future. “The hope is to do this slowly and
deliberately. We are not throwing this open right away,” he says.
The power of data mining
The JNU data store could sweep aside barriers that still deter
scientists from using software to analyse research, says Max Häussler, a
bioinformatics researcher at the University of California, Santa Cruz
(UCSC). “Text mining of academic papers is close to impossible right
now,” he says — even for someone like him who already has institutional
access to paywalled articles.
Since 2009, Häussler and his colleagues have been building the online
UCSC Genome Browser
which links DNA sequences in the human genome to parts of research
papers that mention the same sequences. To do that, the researchers have
contacted more than 40 publishers to ask permission to use software to
rifle through research to find mentions of DNA. But 15 publishers have
not responded or have denied permission <http://text.soe.ucsc.edu>.
Häussler is unsure whether he can legally mine papers without
permission, so he isn’t trying. In the past, he has found his access
by publishers who have spotted his software crawling over their sites.
“I spend 90% of my time just contacting publishers or writing software
to download papers,” says Häussler.
Chris Hartgerink, a statistician who works part-time at Berlin’s QUEST
Center for Transforming Biomedical Research, says he now restricts
himself to text-mining work from open-access publishers only, because
“the hassles of dealing with these closed publishers are too much”. A
few years ago, when Hartgerink was pursuing his PhD in the Netherlands,
three publishers blocked his access to their journals after he tried to
download articles in bulk for mining.
"Some countries have changed their laws to affirm that researchers on
non-commercial projects don’t need a copyright-holder’s permission to
mine whatever they can legally access. The United Kingdom passed such a
law in 2014, and the European Union voted through a similar provision
<https://www.nature.com/articles/d41586-019-00614-y> this year"