Although I also have regularly encountered two-character tags in RDF statements, the RDF concepts document [1] clearly does not preclude the use of 3-character tags or even complex tags like "zh-yue" or "tlh-Kore-AQ-fonipa" (phonetic transcription of Klingon using Korean script :-)).

In BCP-47 terms it should be "yue" rather than "zh-yue"

As for tlh-Kore-AQ-fonipa, you could have a document that is simultaneously using the -Kore and -fonipa subtags

tlh-Latn-AQ-fonipa  or tlh-Kore-AQ but not tlh-Kore-AQ-fonipa

The biggest problem with library data is actually romanisations and the inability to tag romanisation data according to the romanisation scheme being used. For most cases that is

The RDF document states that any valid language tag (referring to the relevant IETF doc, BCP47 [2]) can be used. That IETF document instructs one to tag languages at the level at which the information is useful, but not beyond. That obviously makes good sense. The fact is that there are languages (MANY!) that have no 2-letter code, at which point a three-letter code, or a tag and subtag, must be used. I suspect that the prevalence of two-letter codes has to do with who is providing linked data. Stats, however, show that some three-letter codes are being used. [3]

The key is "valid language tag" by BCP47 definition.

And BCP47 gives a preference for the two letter code, rather than one of the three letter codes.

The tags as you indicate should be short and only indicate what is needed to be indicated. E.g. 
The language tag for arabic, would be "ar" (three letter codes would only be needed to distinguish between colloquial varieties of Arabic, 'ar' tag would be sufficient identifier for Modern Standard Arabic written in the Arabic script)

A language tag for romanised Arabic based on the ALA-LC romanisaation tables as published in 1997 would be ar-Latn-alalc97

It is not possible to construct a language tag for current ALA-LC Arabic romanisation scheme, since there is no appropriate subtag registered. A language tag ar-Latn ... is insufficient since there are many widely different romanisation schemes for Arabic, and the language tag does not have enough specificity


This falls under the general problem of the use of strings instead of IRIs; different forms of code that are associated with the same "language" could be associated with an IRI referring to that "language" .

Alternatively,  two Identifiers could be declared and asserted to be sameAs ,  but that approach is more complicated.

"Language" left unpacked to avoid issues of extended language tags

Rob Sanderson is concerned about the ways in which Bibframe does NOT
worked in the linked data environment, and is trying to effectively
communicate the issues.  He's asking for feedback:


My biggest issue (that's not covered in the doc, but which I've already fed to the doc's authors) is that BIBFRAME mandates three-letter language codes, where available, while core RDA mandates two-letter language codes, where available.

This requires every app that wants to interoparate BIBFRAME with any thing else (and indeed any app that wants to compare BIBFRAME language codes with the language codes on RDF plain-text labels) to have extensive lookup tables.


