Print

Print


.

On January 25, 2019 3:24:46 AM Greg Schmitz <[log in to unmask]> wrote:

> Is your project affiliated with an institution or is this a private
> venture?   I noticed that your email is not institutional.   "Big data"
> certainly has applications in the archival and AV fields, but there are
> also some very invasive and scary aspects to "computer learning" too,
> even for dusty old archives.  I'd certainly need to know more about the
> nature of your project before I would consider responding directly to
> your query.  That's the nature, unfortunately I think, of how things
> work now days in large measure because of computers and public/private
> partnerships.
>
> Sincerely  --greg schmitz
>
>
> On 01/24/2019 05:34 PM, Hugh Paterson III wrote:
>> Greetings,
>>
>> I am in a project where we are looking to use Machine Learning to enhance
>> the bibliographic records of language resources (materials about languages,
>> or materials in languages, or materials by ethnolinguistic minority
>> language communities).
>>
>> Our project particularly is looking at minority languages (language
>> communities can be from anywhere in the world, but we are primed for
>> African language communities).
>>
>> "What do we mean by 'enhance'?" we want to increase the specificity of the
>> language identified in the recording — in print media this might be a
>> subject term. Some of you might be familiar with MARC records. MARC records
>> have long had a special field for notating languages. For example, consider
>> MARC21 field 041 [1]. The code in this field is often a three letter code
>> from ISO 639-2 [2]. We are trying to align bibliography records to ISO
>> 639-3 [3] which has several thousand codes rather than the several hundred
>> in ISO 639-3. For several years now various standards like Dublin Core have
>> pointed to ISO 639-3 rather than 639-2 [4].
>>
>> Many institutions don't use field 041, but might have a note field, or the
>> resource might be labeled with the name of an ethnolinguistic community in
>> the title or use Library of Congress Subject Headings instead to identify
>> the language.
>>
>> Updating catalogue records which are often manually produced is likely
>> never to happen without a high degree of accuracy and a high degree of
>> automation.
>>
>> As a linguist I have previously worked at an archive which specialized in
>> minority language holdings both print and audio. I know that special
>> collections can vary widely in their description level and consistency. My
>> partner in this project is a Data Scientist who specializes in Machine
>> Learning.
>>
>> Our project was initially looking to work with bibliographic records for
>> print media. Several university library institutions turned down our offer
>> because it would mean sharing their holdings records, which they either
>> have a policy of not doing, or they have a commitment to an OCLC [5] record
>> at a library consortium level [9] — we are willing to work with a
>> consortium too. (For those of you in libraries and archives please help me
>> understand the politics here, because I would have thought that sharing
>> bibliographic records is what library search engines/web sites were for,
>> and that better visibility would increase the social value of holdings). At
>> any rate, we are now open to the option of working with records of both
>> print media and audio media.
>>
>> Academically, we are not the first in this field. Our work builds upon the
>> ground breaking work by Bird & Simons [6] and Hirt et al. [10]. And the
>> metadata standard that a community of archivists and linguists agreed upon
>> [7] which has been adopted by at least 30 archives with language resource
>> collections. These institutions aggregate their open metadata and internet
>> users can search by language online [8]. By focusing on print media
>> collections we were hoping to extend the kinds of materials in the
>> aggregator. I know there is still room for more audio materials records to
>> be included as I have identified at least 30 archives with audio language
>> resources which are not participating in OLAC.
>>
>> We are looking at machine learning and named entity extraction for
>> automating identification of subject languages for published works, this is
>> in contrast to what Hirt et al. in [10] attempt, where they extract certain
>> fields from the MARC records. We are hoping to collect a few catalogs in
>> MARC XML formats. To that extent we are interested in finding a
>> collaborating researcher at an institution which is willing to share their
>> holdings records.
>>
>> We are aware that this is a pretty big ask and maybe a bit unusual, though
>> Harvard does freely share their holdings records [11].
>>
>> We would like to get several large library catalogs to train and test our
>> Machine Learning Models.
>>
>> Interested parties can contact me directly.
>>
>> thank you,
>> all the best,
>>
>> [1]: https://www.loc.gov/marc/bibliographic/bd041.html
>> [2]: https://www.loc.gov/standards/iso639-2/php/code_list.php
>> [3]: https://iso639-3.sil.org
>> [4]: https://en.wikipedia.org/wiki/ISO_639-3#Usage
>> [5]: https://www.oclc.org/en/home.html
>> [6]: Bird, Steven & Gary Simons. 2003. Extending Dublin Core Metadata to
>> Support the Description and Discovery of Language Resources. Computers and
>> the Humanities Vol. 37, No. 4: 375-388.
>> [7]: http://www.language-archives.org/OLAC/metadata.html
>> [8]: http://search.language-archives.org/index.html
>> [9]: https://www.orbiscascade.org
>> [10] Christopher Hirt, Gary Simons, and Joan Spanne. "Building a
>> MARC-to-OLAC Crosswalk: Repurposing Library Catalog Data for the Language
>> Resources Community", 08/01/2010-07/31/2011,  2009, "Proceedings of the
>> Joint Conference on Digital Libraries, page 393".
>> https://scholars.sil.org/sites/scholars/files/gary_f_simons/poster/marc-to-olac.pdf
>> [11]
>> https://library.harvard.edu/services-tools/harvard-library-apis-datasets