I thank Ralph for providing the data from the OCLC database. It is 
interesting, however I suspect that it does not actually answer our 
question about how often roles are "bad". In any case, here is my 
analysis of the data. (p.s. I do not claim perfection in these numbers! 
Errors may have been introduced! Do not read this as any kind of truth!)

First, caveats:
1. I have no idea what this set of 38 million names represents. It may 
or may not tell us about data in general outside of or even inside OCLC.
2. I do not know if these records have been through any quality control 
applied by OCLC, so I do not know if the OCLC data is the same as the 
data in local catalogs.
3. The OCLC file contains both codes and spelled out forms. Some of 
these were obvious pairs, based on the fact that they had the same exact 
number of occurrences. I did not de-dup these in my data gathering.

124620 	act
124620 	Actor


38 million names produced 14,825,668 codes and full forms. This would 
put about 40% of the names having a role. However, since many were 
name/code pairs that probably were in the same name field (and I didn't 
calculate how many) the number of names in this list having roles is 
somewhere between 20% and 40%, probably toward the lower end.

Number of name/code pairs on LC list: 228

OCLC Names:

Total number of distinct names: 191
Total number of names matching LC list: 175
Total number of names not matching LC list: 16
Occurrences of names matching LC list: 7,202,345
Occurrences of non-matching names: 199,413

OCLC Codes:

Total number of distinct codes: 872
Total number of codes matching LC list: 210
Total number of codes not matching: 664 [1]
Occurrences of matching codes: 7,404,728
Occurrences of non-matching codes: 19,503

TOTAL: about 1.5% of instances are "bad". Removing the paired instances 

[1] Yes, I seem to be 2 off here - possibly due to headers in XSL files 
or something. Not worth tracking down. If anyone wants my data to 
compare to their own, I can send it.


1) There were very few non-matching names, and in those, I did not see 
an expected mix of languages. The full set of non-matching names, 
including the nonsense ones, is:

Author of introduction    167949  [2]
Author of screenplay    27680
Curator    3753
Plaintiff -appellee    9
Contestant -appellee    4
e?d    4                                   [3]
Graphic technician    3
re?    2
se?    2
Respondent -appellee    1
keEUR    1
me?    1
o^v    1
prOE    1
røm    1
øga    1

[2] Most of these "bad" names are close variations on the LC list, 
missing ", etc.". This could either be a problem with how the data was 
produced or how I manipulated it.

[3] Yes, some of these are probably actually bad codes - I separated 
codes from names by > 3 chars, so bad codes with Unicode characters and 
other garbage ended up here. Their occurrence numbers are low,  however.

3) There were many more bad codes than bad names. The top ten "bad" 
codes were:

csn 	3678
cnm 	2423
9pu 	1104
fme 	990
ens 	899
for 	735
orc 	735
sde 	715
dir 	405
730 	387

3) Like the stats on MARC fields and subfields, both names and codes 
have a long tail of nonsense. This long tail, however, accounts for very 
few occurrences. Which is why, although there are many more "bad" codes 
than "bad" names, the actual number of bad codes is low -- they are 
mostly one-offs.


Role names seem to have been pretty carefully quality-controlled in 
OCLC; codes less so, or there was a problem in the creation of the dataset.

Non-English roles do not seem to be represented here.


Karen Coyle
[log in to unmask]
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet