Thanks, Ralph. That confirms what I was seeing. So essentially only the part about codes is to be looked at. kc On 5/29/13 9:53 AM, LeVan,Ralph wrote: > > The fully spelled out forms were not in the bib records themselves. I > provided translations of the LC codes as part of my indexing process. > > Karen, are you saying that “Author of Introduction” and “Author of > Screenplay” are incorrect renderings of the LC codes? > > Wow, I just looked up the relator codes and can see that there are two > tables that don’t quite match. I used this one: > > http://www.loc.gov/marc/relators/relacode.html > > And you probably used this one: > > http://www.loc.gov/marc/relators/relaterm.html > > As you will see, “aus” translates to “Author of screenplay” in the > first table, not “*Author of screenplay, etc.*”. Either way, the > differences should not be considered significant. > > Roy, another source of inconsistency to add to your list! > > Ralph > > *From:*Bibliographic Framework Transition Initiative Forum > [mailto:[log in to unmask]] *On Behalf Of *Karen Coyle > *Sent:* Wednesday, May 29, 2013 12:05 PM > *To:* [log in to unmask] > *Subject:* [BIBFRAME] OCLC role stats > > I thank Ralph for providing the data from the OCLC database. It is > interesting, however I suspect that it does not actually answer our > question about how often roles are "bad". In any case, here is my > analysis of the data. (p.s. I do not claim perfection in these > numbers! Errors may have been introduced! Do not read this as any kind > of truth!) > > First, caveats: > 1. I have no idea what this set of 38 million names represents. It may > or may not tell us about data in general outside of or even inside OCLC. > 2. I do not know if these records have been through any quality > control applied by OCLC, so I do not know if the OCLC data is the same > as the data in local catalogs. > 3. The OCLC file contains both codes and spelled out forms. Some of > these were obvious pairs, based on the fact that they had the same > exact number of occurrences. I did not de-dup these in my data gathering. > > e.g.: > > 124620 > > > > act > > 124620 > > > > Actor > > > Data: > > 38 million names produced 14,825,668 codes and full forms. This would > put about 40% of the names having a role. However, since many were > name/code pairs that probably were in the same name field (and I > didn't calculate how many) the number of names in this list having > roles is somewhere between 20% and 40%, probably toward the lower end. > > Number of name/code pairs on LC list: 228 > > OCLC Names: > > Total number of distinct names: 191 > Total number of names matching LC list: 175 > Total number of names not matching LC list: 16 > Occurrences of names matching LC list: 7,202,345 > Occurrences of non-matching names: 199,413 > > OCLC Codes: > > Total number of distinct codes: 872 > Total number of codes matching LC list: 210 > Total number of codes not matching: 664 [1] > Occurrences of matching codes: 7,404,728 > Occurrences of non-matching codes: 19,503 > > TOTAL: about 1.5% of instances are "bad". Removing the paired > instances (which > > [1] Yes, I seem to be 2 off here - possibly due to headers in XSL > files or something. Not worth tracking down. If anyone wants my data > to compare to their own, I can send it. > > SOME OBSERVATIONS > > 1) There were very few non-matching names, and in those, I did not see > an expected mix of languages. The full set of non-matching names, > including the nonsense ones, is: > > Author of introduction 167949 [2] > Author of screenplay 27680 > Curator 3753 > Plaintiff -appellee 9 > Contestant -appellee 4 > eÌd 4 [3] > Graphic technician 3 > reÌ 2 > seÌ 2 > Respondent -appellee 1 > keÌ€ 1 > meÌ 1 > öv 1 > prÌŒ 1 > røm 1 > øga 1 > > [2] Most of these "bad" names are close variations on the LC list, > missing ", etc.". This could either be a problem with how the data was > produced or how I manipulated it. > > [3] Yes, some of these are probably actually bad codes - I separated > codes from names by > 3 chars, so bad codes with Unicode characters > and other garbage ended up here. Their occurrence numbers are low, > however. > > 3) There were many more bad codes than bad names. The top ten "bad" > codes were: > > csn > > > > 3678 > > cnm > > > > 2423 > > 9pu > > > > 1104 > > fme > > > > 990 > > ens > > > > 899 > > for > > > > 735 > > orc > > > > 735 > > sde > > > > 715 > > dir > > > > 405 > > 730 > > > > 387 > > > > 3) Like the stats on MARC fields and subfields, both names and codes > have a long tail of nonsense. This long tail, however, accounts for > very few occurrences. Which is why, although there are many more "bad" > codes than "bad" names, the actual number of bad codes is low -- they > are mostly one-offs. > > MY CONCLUSIONS > > Role names seem to have been pretty carefully quality-controlled in > OCLC; codes less so, or there was a problem in the creation of the > dataset. > > Non-English roles do not seem to be represented here. > > kc > > -- > Karen Coyle > [log in to unmask] <mailto:[log in to unmask]> http://kcoyle.net > ph: 1-510-540-7596 > m: 1-510-435-8234 > skype: kcoylenet -- Karen Coyle [log in to unmask] http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet