Print

Print


Is your project affiliated with an institution or is this a private 
venture?   I noticed that your email is not institutional.   "Big data" 
certainly has applications in the archival and AV fields, but there are 
also some very invasive and scary aspects to "computer learning" too, 
even for dusty old archives.  I'd certainly need to know more about the 
nature of your project before I would consider responding directly to 
your query.  That's the nature, unfortunately I think, of how things 
work now days in large measure because of computers and public/private 
partnerships.

Sincerely  --greg schmitz


On 01/24/2019 05:34 PM, Hugh Paterson III wrote:
> Greetings,
>
> I am in a project where we are looking to use Machine Learning to enhance
> the bibliographic records of language resources (materials about languages,
> or materials in languages, or materials by ethnolinguistic minority
> language communities).
>
> Our project particularly is looking at minority languages (language
> communities can be from anywhere in the world, but we are primed for
> African language communities).
>
> "What do we mean by 'enhance'?" we want to increase the specificity of the
> language identified in the recording — in print media this might be a
> subject term. Some of you might be familiar with MARC records. MARC records
> have long had a special field for notating languages. For example, consider
> MARC21 field 041 [1]. The code in this field is often a three letter code
> from ISO 639-2 [2]. We are trying to align bibliography records to ISO
> 639-3 [3] which has several thousand codes rather than the several hundred
> in ISO 639-3. For several years now various standards like Dublin Core have
> pointed to ISO 639-3 rather than 639-2 [4].
>
> Many institutions don't use field 041, but might have a note field, or the
> resource might be labeled with the name of an ethnolinguistic community in
> the title or use Library of Congress Subject Headings instead to identify
> the language.
>
> Updating catalogue records which are often manually produced is likely
> never to happen without a high degree of accuracy and a high degree of
> automation.
>
> As a linguist I have previously worked at an archive which specialized in
> minority language holdings both print and audio. I know that special
> collections can vary widely in their description level and consistency. My
> partner in this project is a Data Scientist who specializes in Machine
> Learning.
>
> Our project was initially looking to work with bibliographic records for
> print media. Several university library institutions turned down our offer
> because it would mean sharing their holdings records, which they either
> have a policy of not doing, or they have a commitment to an OCLC [5] record
> at a library consortium level [9] — we are willing to work with a
> consortium too. (For those of you in libraries and archives please help me
> understand the politics here, because I would have thought that sharing
> bibliographic records is what library search engines/web sites were for,
> and that better visibility would increase the social value of holdings). At
> any rate, we are now open to the option of working with records of both
> print media and audio media.
>
> Academically, we are not the first in this field. Our work builds upon the
> ground breaking work by Bird & Simons [6] and Hirt et al. [10]. And the
> metadata standard that a community of archivists and linguists agreed upon
> [7] which has been adopted by at least 30 archives with language resource
> collections. These institutions aggregate their open metadata and internet
> users can search by language online [8]. By focusing on print media
> collections we were hoping to extend the kinds of materials in the
> aggregator. I know there is still room for more audio materials records to
> be included as I have identified at least 30 archives with audio language
> resources which are not participating in OLAC.
>
> We are looking at machine learning and named entity extraction for
> automating identification of subject languages for published works, this is
> in contrast to what Hirt et al. in [10] attempt, where they extract certain
> fields from the MARC records. We are hoping to collect a few catalogs in
> MARC XML formats. To that extent we are interested in finding a
> collaborating researcher at an institution which is willing to share their
> holdings records.
>
> We are aware that this is a pretty big ask and maybe a bit unusual, though
> Harvard does freely share their holdings records [11].
>
> We would like to get several large library catalogs to train and test our
> Machine Learning Models.
>
> Interested parties can contact me directly.
>
> thank you,
> all the best,
>
> [1]: https://www.loc.gov/marc/bibliographic/bd041.html
> [2]: https://www.loc.gov/standards/iso639-2/php/code_list.php
> [3]: https://iso639-3.sil.org
> [4]: https://en.wikipedia.org/wiki/ISO_639-3#Usage
> [5]: https://www.oclc.org/en/home.html
> [6]: Bird, Steven & Gary Simons. 2003. Extending Dublin Core Metadata to
> Support the Description and Discovery of Language Resources. Computers and
> the Humanities Vol. 37, No. 4: 375-388.
> [7]: http://www.language-archives.org/OLAC/metadata.html
> [8]: http://search.language-archives.org/index.html
> [9]: https://www.orbiscascade.org
> [10] Christopher Hirt, Gary Simons, and Joan Spanne. "Building a
> MARC-to-OLAC Crosswalk: Repurposing Library Catalog Data for the Language
> Resources Community", 08/01/2010-07/31/2011,  2009, "Proceedings of the
> Joint Conference on Digital Libraries, page 393".
> https://scholars.sil.org/sites/scholars/files/gary_f_simons/poster/marc-to-olac.pdf
> [11]
> https://library.harvard.edu/services-tools/harvard-library-apis-datasets