I thank Ralph for providing the data from the OCLC database. It is
interesting, however I suspect that it does not actually answer our
question about how often roles are "bad". In any case, here is my
analysis of the data. (p.s. I do not claim perfection in these
numbers! Errors may have been introduced! Do not read this as any
kind of truth!)
1. I have no idea what this set of 38 million names represents. It
may or may not tell us about data in general outside of or even
2. I do not know if these records have been through any quality
control applied by OCLC, so I do not know if the OCLC data is the
same as the data in local catalogs.
3. The OCLC file contains both codes and spelled out forms. Some of
these were obvious pairs, based on the fact that they had the same
exact number of occurrences. I did not de-dup these in my data
38 million names produced 14,825,668 codes and full forms. This
would put about 40% of the names having a role. However, since many
were name/code pairs that probably were in the same name field (and
I didn't calculate how many) the number of names in this list having
roles is somewhere between 20% and 40%, probably toward the lower
Number of name/code pairs on LC list: 228
Total number of distinct names: 191
Total number of names matching LC list: 175
Total number of names not matching LC list: 16
Occurrences of names matching LC list: 7,202,345
Occurrences of non-matching names: 199,413
Total number of distinct codes: 872
Total number of codes matching LC list: 210
Total number of codes not matching: 664 
Occurrences of matching codes: 7,404,728
Occurrences of non-matching codes: 19,503
TOTAL: about 1.5% of instances are "bad". Removing the paired
 Yes, I seem to be 2 off here - possibly due to headers in XSL
files or something. Not worth tracking down. If anyone wants my data
to compare to their own, I can send it.
1) There were very few non-matching names, and in those, I did not
see an expected mix of languages. The full set of non-matching
names, including the nonsense ones, is:
 Most of these "bad" names are close variations on the LC list,
missing ", etc.". This could either be a problem with how the data
was produced or how I manipulated it.
 Yes, some of these are probably actually bad codes - I separated
codes from names by > 3 chars, so bad codes with Unicode
characters and other garbage ended up here. Their occurrence numbers
are low, however.
3) There were many more bad codes than bad names. The top ten "bad"
3) Like the stats on MARC fields and subfields, both names and codes
have a long tail of nonsense. This long tail, however, accounts for
very few occurrences. Which is why, although there are many more
"bad" codes than "bad" names, the actual number of bad codes is low
-- they are mostly one-offs.
Role names seem to have been pretty carefully quality-controlled in
OCLC; codes less so, or there was a problem in the creation of the
Non-English roles do not seem to be represented here.