Greg Schmitz
I work with an NGO registered in the USA as a 501(c)3 which does language
development work — SIL International: https://www.sil.org. I'm subscribed
to ARSCLIST on my personal email address, therefore I sent the email with
that address (I've been on ARSCLIST for close to 10 years). You can also
reach me at [log in to unmask] This project is being done as part
of my work with the NGO.
OLAC records are open data, working in partnership/collaboration with any
institution would certainly mean that we would need to figure out and
comply with any IP agreement formed as part of the basis of collaboration.
Though I would hope that there would be a sympathetic cause for open data.
I completely understand that not all data is open and we are willing to
work with and respect agreed upon limitations.
As an NGO we think there are several important reasons to enrich archive
holdings records:
1. we have our own archive, and would benefit from enrichment and the out
put of our work — but our approach is novel and to get the best results we
need multiple datasets for training and testing.
2. we more broadly want to see the literature about and the creative works
demonstrating the languages of the world "less hidden".
3. we want to see more humanitarian oriented NGOs pushing for social change
using the mother tongue languages rather than English or the national
language, therefore we anticipate that exposing these resources not only
makes them more readily accessible to the the communities that created
them, academics who are studying their linguistics, but also humanitarian
organizations.
I hope this clarifies and helps calm some fears related to the abuses we
have seen related to big data and machine learning. It is important to
realize that tools in themselves don't hurt people. People hurt people.
Some tools make it very easy to hurt lots of people very quickly, or
systematically only hurt some types of people, all tools need to be handled
with responsibility and respect for people.
- Hugh Paterson
On Fri, Jan 25, 2019 at 12:24 AM Greg Schmitz <[log in to unmask]> wrote:
> Is your project affiliated with an institution or is this a private
> venture? I noticed that your email is not institutional. "Big data"
> certainly has applications in the archival and AV fields, but there are
> also some very invasive and scary aspects to "computer learning" too,
> even for dusty old archives. I'd certainly need to know more about the
> nature of your project before I would consider responding directly to
> your query. That's the nature, unfortunately I think, of how things
> work now days in large measure because of computers and public/private
> partnerships.
>
> Sincerely --greg schmitz
>
>
> On 01/24/2019 05:34 PM, Hugh Paterson III wrote:
> > Greetings,
> >
> > I am in a project where we are looking to use Machine Learning to enhance
> > the bibliographic records of language resources (materials about
> languages,
> > or materials in languages, or materials by ethnolinguistic minority
> > language communities).
> >
> > Our project particularly is looking at minority languages (language
> > communities can be from anywhere in the world, but we are primed for
> > African language communities).
> >
> > "What do we mean by 'enhance'?" we want to increase the specificity of
> the
> > language identified in the recording — in print media this might be a
> > subject term. Some of you might be familiar with MARC records. MARC
> records
> > have long had a special field for notating languages. For example,
> consider
> > MARC21 field 041 [1]. The code in this field is often a three letter code
> > from ISO 639-2 [2]. We are trying to align bibliography records to ISO
> > 639-3 [3] which has several thousand codes rather than the several
> hundred
> > in ISO 639-3. For several years now various standards like Dublin Core
> have
> > pointed to ISO 639-3 rather than 639-2 [4].
> >
> > Many institutions don't use field 041, but might have a note field, or
> the
> > resource might be labeled with the name of an ethnolinguistic community
> in
> > the title or use Library of Congress Subject Headings instead to identify
> > the language.
> >
> > Updating catalogue records which are often manually produced is likely
> > never to happen without a high degree of accuracy and a high degree of
> > automation.
> >
> > As a linguist I have previously worked at an archive which specialized in
> > minority language holdings both print and audio. I know that special
> > collections can vary widely in their description level and consistency.
> My
> > partner in this project is a Data Scientist who specializes in Machine
> > Learning.
> >
> > Our project was initially looking to work with bibliographic records for
> > print media. Several university library institutions turned down our
> offer
> > because it would mean sharing their holdings records, which they either
> > have a policy of not doing, or they have a commitment to an OCLC [5]
> record
> > at a library consortium level [9] — we are willing to work with a
> > consortium too. (For those of you in libraries and archives please help
> me
> > understand the politics here, because I would have thought that sharing
> > bibliographic records is what library search engines/web sites were for,
> > and that better visibility would increase the social value of holdings).
> At
> > any rate, we are now open to the option of working with records of both
> > print media and audio media.
> >
> > Academically, we are not the first in this field. Our work builds upon
> the
> > ground breaking work by Bird & Simons [6] and Hirt et al. [10]. And the
> > metadata standard that a community of archivists and linguists agreed
> upon
> > [7] which has been adopted by at least 30 archives with language resource
> > collections. These institutions aggregate their open metadata and
> internet
> > users can search by language online [8]. By focusing on print media
> > collections we were hoping to extend the kinds of materials in the
> > aggregator. I know there is still room for more audio materials records
> to
> > be included as I have identified at least 30 archives with audio language
> > resources which are not participating in OLAC.
> >
> > We are looking at machine learning and named entity extraction for
> > automating identification of subject languages for published works, this
> is
> > in contrast to what Hirt et al. in [10] attempt, where they extract
> certain
> > fields from the MARC records. We are hoping to collect a few catalogs in
> > MARC XML formats. To that extent we are interested in finding a
> > collaborating researcher at an institution which is willing to share
> their
> > holdings records.
> >
> > We are aware that this is a pretty big ask and maybe a bit unusual,
> though
> > Harvard does freely share their holdings records [11].
> >
> > We would like to get several large library catalogs to train and test our
> > Machine Learning Models.
> >
> > Interested parties can contact me directly.
> >
> > thank you,
> > all the best,
> >
> > [1]: https://www.loc.gov/marc/bibliographic/bd041.html
> > [2]: https://www.loc.gov/standards/iso639-2/php/code_list.php
> > [3]: https://iso639-3.sil.org
> > [4]: https://en.wikipedia.org/wiki/ISO_639-3#Usage
> > [5]: https://www.oclc.org/en/home.html
> > [6]: Bird, Steven & Gary Simons. 2003. Extending Dublin Core Metadata to
> > Support the Description and Discovery of Language Resources. Computers
> and
> > the Humanities Vol. 37, No. 4: 375-388.
> > [7]: http://www.language-archives.org/OLAC/metadata.html
> > [8]: http://search.language-archives.org/index.html
> > [9]: https://www.orbiscascade.org
> > [10] Christopher Hirt, Gary Simons, and Joan Spanne. "Building a
> > MARC-to-OLAC Crosswalk: Repurposing Library Catalog Data for the Language
> > Resources Community", 08/01/2010-07/31/2011, 2009, "Proceedings of the
> > Joint Conference on Digital Libraries, page 393".
> >
> https://scholars.sil.org/sites/scholars/files/gary_f_simons/poster/marc-to-olac.pdf
> > [11]
> > https://library.harvard.edu/services-tools/harvard-library-apis-datasets
>
|