I thank Ralph for providing the data from the OCLC database. It is interesting, however I suspect that it does not actually answer our question about how often roles are "bad". In any case, here is my analysis of the data. (p.s. I do not claim perfection in these numbers! Errors may have been introduced! Do not read this as any kind of truth!)

First, caveats:
1. I have no idea what this set of 38 million names represents. It may or may not tell us about data in general outside of or even inside OCLC.
2. I do not know if these records have been through any quality control applied by OCLC, so I do not know if the OCLC data is the same as the data in local catalogs.
3. The OCLC file contains both codes and spelled out forms. Some of these were obvious pairs, based on the fact that they had the same exact number of occurrences. I did not de-dup these in my data gathering.

e.g.:
124620 act
124620 Actor

Data:

38 million names produced 14,825,668 codes and full forms. This would put about 40% of the names having a role. However, since many were name/code pairs that probably were in the same name field (and I didn't calculate how many) the number of names in this list having roles is somewhere between 20% and 40%, probably toward the lower end.

Number of name/code pairs on LC list: 228

OCLC Names:

Total number of distinct names: 191
Total number of names matching LC list: 175
Total number of names not matching LC list: 16
Occurrences of names matching LC list: 7,202,345
Occurrences of non-matching names: 199,413

OCLC Codes:

Total number of distinct codes: 872
Total number of codes matching LC list: 210
Total number of codes not matching: 664 [1]
Occurrences of matching codes: 7,404,728
Occurrences of non-matching codes: 19,503

TOTAL: about 1.5% of instances are "bad". Removing the paired instances (which

[1] Yes, I seem to be 2 off here - possibly due to headers in XSL files or something. Not worth tracking down. If anyone wants my data to compare to their own, I can send it.

SOME OBSERVATIONS

1) There were very few non-matching names, and in those, I did not see an expected mix of languages. The full set of non-matching names, including the nonsense ones, is:

Author of introduction    167949  [2]
Author of screenplay    27680
Curator    3753
Plaintiff -appellee    9
Contestant -appellee    4
éd    4                                   [3]
Graphic technician    3
ré    2
sé    2
Respondent -appellee    1
keÌ€    1
mé    1
öv    1
prÌŒ    1
røm    1
øga    1

[2] Most of these "bad" names are close variations on the LC list, missing ", etc.". This could either be a problem with how the data was produced or how I manipulated it.

[3] Yes, some of these are probably actually bad codes - I separated codes from names by > 3 chars, so bad codes with Unicode characters and other garbage ended up here. Their occurrence numbers are low,  however.

3) There were many more bad codes than bad names. The top ten "bad" codes were:

csn 3678
cnm 2423
9pu 1104
fme 990
ens 899
for 735
orc 735
sde 715
dir 405
730 387


3) Like the stats on MARC fields and subfields, both names and codes have a long tail of nonsense. This long tail, however, accounts for very few occurrences. Which is why, although there are many more "bad" codes than "bad" names, the actual number of bad codes is low -- they are mostly one-offs.

MY CONCLUSIONS

Role names seem to have been pretty carefully quality-controlled in OCLC; codes less so, or there was a problem in the creation of the dataset.

Non-English roles do not seem to be represented here.

kc
-- 
Karen Coyle
[log in to unmask] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet