Thanks, Ralph. That confirms what I was seeing. So essentially only the part about codes is to be looked at.


On 5/29/13 9:53 AM, LeVan,Ralph wrote:
[log in to unmask]" type="cite">

The fully spelled out forms were not in the bib records themselves.  I provided translations of the LC codes as part of my indexing process.


Karen, are you saying that “Author of Introduction” and “Author of Screenplay” are incorrect renderings of the LC codes?


Wow, I just looked up the relator codes and can see that there are two tables that don’t quite match.  I used this one:


And you probably used this one:


As you will see, “aus” translates to “Author of screenplay” in the first table, not “Author of screenplay, etc.”.  Either way, the differences should not be considered significant.


Roy, another source of inconsistency to add to your list!




From: Bibliographic Framework Transition Initiative Forum [mailto:[log in to unmask]] On Behalf Of Karen Coyle
Sent: Wednesday, May 29, 2013 12:05 PM
To: [log in to unmask]
Subject: [BIBFRAME] OCLC role stats


I thank Ralph for providing the data from the OCLC database. It is interesting, however I suspect that it does not actually answer our question about how often roles are "bad". In any case, here is my analysis of the data. (p.s. I do not claim perfection in these numbers! Errors may have been introduced! Do not read this as any kind of truth!)

First, caveats:
1. I have no idea what this set of 38 million names represents. It may or may not tell us about data in general outside of or even inside OCLC.
2. I do not know if these records have been through any quality control applied by OCLC, so I do not know if the OCLC data is the same as the data in local catalogs.
3. The OCLC file contains both codes and spelled out forms. Some of these were obvious pairs, based on the fact that they had the same exact number of occurrences. I did not de-dup these in my data gathering.







38 million names produced 14,825,668 codes and full forms. This would put about 40% of the names having a role. However, since many were name/code pairs that probably were in the same name field (and I didn't calculate how many) the number of names in this list having roles is somewhere between 20% and 40%, probably toward the lower end.

Number of name/code pairs on LC list: 228

OCLC Names:

Total number of distinct names: 191
Total number of names matching LC list: 175
Total number of names not matching LC list: 16
Occurrences of names matching LC list: 7,202,345
Occurrences of non-matching names: 199,413

OCLC Codes:

Total number of distinct codes: 872
Total number of codes matching LC list: 210
Total number of codes not matching: 664 [1]
Occurrences of matching codes: 7,404,728
Occurrences of non-matching codes: 19,503

TOTAL: about 1.5% of instances are "bad". Removing the paired instances (which

[1] Yes, I seem to be 2 off here - possibly due to headers in XSL files or something. Not worth tracking down. If anyone wants my data to compare to their own, I can send it.


1) There were very few non-matching names, and in those, I did not see an expected mix of languages. The full set of non-matching names, including the nonsense ones, is:

Author of introduction    167949  [2]
Author of screenplay    27680
Curator    3753
Plaintiff -appellee    9
Contestant -appellee    4
éd    4                                   [3]
Graphic technician    3
ré    2
sé    2
Respondent -appellee    1
keÌ€    1
mé    1
öv    1
prÌŒ    1
røm    1
øga    1

[2] Most of these "bad" names are close variations on the LC list, missing ", etc.". This could either be a problem with how the data was produced or how I manipulated it.

[3] Yes, some of these are probably actually bad codes - I separated codes from names by > 3 chars, so bad codes with Unicode characters and other garbage ended up here. Their occurrence numbers are low,  however.

3) There were many more bad codes than bad names. The top ten "bad" codes were:





















3) Like the stats on MARC fields and subfields, both names and codes have a long tail of nonsense. This long tail, however, accounts for very few occurrences. Which is why, although there are many more "bad" codes than "bad" names, the actual number of bad codes is low -- they are mostly one-offs.


Role names seem to have been pretty carefully quality-controlled in OCLC; codes less so, or there was a problem in the creation of the dataset.

Non-English roles do not seem to be represented here.


Karen Coyle
[log in to unmask]
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet

Karen Coyle
[log in to unmask]
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet