I am in a project where we are looking to use Machine Learning to enhance
the bibliographic records of language resources (materials about languages,
or materials in languages, or materials by ethnolinguistic minority
Our project particularly is looking at minority languages (language
communities can be from anywhere in the world, but we are primed for
African language communities).
"What do we mean by 'enhance'?" we want to increase the specificity of the
language identified in the recording — in print media this might be a
subject term. Some of you might be familiar with MARC records. MARC records
have long had a special field for notating languages. For example, consider
MARC21 field 041 . The code in this field is often a three letter code
from ISO 639-2 . We are trying to align bibliography records to ISO
639-3  which has several thousand codes rather than the several hundred
in ISO 639-3. For several years now various standards like Dublin Core have
pointed to ISO 639-3 rather than 639-2 .
Many institutions don't use field 041, but might have a note field, or the
resource might be labeled with the name of an ethnolinguistic community in
the title or use Library of Congress Subject Headings instead to identify
Updating catalogue records which are often manually produced is likely
never to happen without a high degree of accuracy and a high degree of
As a linguist I have previously worked at an archive which specialized in
minority language holdings both print and audio. I know that special
collections can vary widely in their description level and consistency. My
partner in this project is a Data Scientist who specializes in Machine
Our project was initially looking to work with bibliographic records for
print media. Several university library institutions turned down our offer
because it would mean sharing their holdings records, which they either
have a policy of not doing, or they have a commitment to an OCLC  record
at a library consortium level  — we are willing to work with a
consortium too. (For those of you in libraries and archives please help me
understand the politics here, because I would have thought that sharing
bibliographic records is what library search engines/web sites were for,
and that better visibility would increase the social value of holdings). At
any rate, we are now open to the option of working with records of both
print media and audio media.
Academically, we are not the first in this field. Our work builds upon the
ground breaking work by Bird & Simons  and Hirt et al. . And the
metadata standard that a community of archivists and linguists agreed upon
 which has been adopted by at least 30 archives with language resource
collections. These institutions aggregate their open metadata and internet
users can search by language online . By focusing on print media
collections we were hoping to extend the kinds of materials in the
aggregator. I know there is still room for more audio materials records to
be included as I have identified at least 30 archives with audio language
resources which are not participating in OLAC.
We are looking at machine learning and named entity extraction for
automating identification of subject languages for published works, this is
in contrast to what Hirt et al. in  attempt, where they extract certain
fields from the MARC records. We are hoping to collect a few catalogs in
MARC XML formats. To that extent we are interested in finding a
collaborating researcher at an institution which is willing to share their
We are aware that this is a pretty big ask and maybe a bit unusual, though
Harvard does freely share their holdings records .
We would like to get several large library catalogs to train and test our
Machine Learning Models.
Interested parties can contact me directly.
all the best,
: Bird, Steven & Gary Simons. 2003. Extending Dublin Core Metadata to
Support the Description and Discovery of Language Resources. Computers and
the Humanities Vol. 37, No. 4: 375-388.
 Christopher Hirt, Gary Simons, and Joan Spanne. "Building a
MARC-to-OLAC Crosswalk: Repurposing Library Catalog Data for the Language
Resources Community", 08/01/2010-07/31/2011, 2009, "Proceedings of the
Joint Conference on Digital Libraries, page 393".