Print

Print


Thanks, Ralph. That confirms what I was seeing. So essentially only the 
part about codes is to be looked at.

kc

On 5/29/13 9:53 AM, LeVan,Ralph wrote:
>
> The fully spelled out forms were not in the bib records themselves.  I 
> provided translations of the LC codes as part of my indexing process.
>
> Karen, are you saying that “Author of Introduction” and “Author of 
> Screenplay” are incorrect renderings of the LC codes?
>
> Wow, I just looked up the relator codes and can see that there are two 
> tables that don’t quite match.  I used this one:
>
> http://www.loc.gov/marc/relators/relacode.html
>
> And you probably used this one:
>
> http://www.loc.gov/marc/relators/relaterm.html
>
> As you will see, “aus” translates to “Author of screenplay” in the 
> first table, not “*Author of screenplay, etc.*”. Either way, the 
> differences should not be considered significant.
>
> Roy, another source of inconsistency to add to your list!
>
> Ralph
>
> *From:*Bibliographic Framework Transition Initiative Forum 
> [mailto:[log in to unmask]] *On Behalf Of *Karen Coyle
> *Sent:* Wednesday, May 29, 2013 12:05 PM
> *To:* [log in to unmask]
> *Subject:* [BIBFRAME] OCLC role stats
>
> I thank Ralph for providing the data from the OCLC database. It is 
> interesting, however I suspect that it does not actually answer our 
> question about how often roles are "bad". In any case, here is my 
> analysis of the data. (p.s. I do not claim perfection in these 
> numbers! Errors may have been introduced! Do not read this as any kind 
> of truth!)
>
> First, caveats:
> 1. I have no idea what this set of 38 million names represents. It may 
> or may not tell us about data in general outside of or even inside OCLC.
> 2. I do not know if these records have been through any quality 
> control applied by OCLC, so I do not know if the OCLC data is the same 
> as the data in local catalogs.
> 3. The OCLC file contains both codes and spelled out forms. Some of 
> these were obvious pairs, based on the fact that they had the same 
> exact number of occurrences. I did not de-dup these in my data gathering.
>
> e.g.:
>
> 124620
>
> 	
>
> act
>
> 124620
>
> 	
>
> Actor
>
>
> Data:
>
> 38 million names produced 14,825,668 codes and full forms. This would 
> put about 40% of the names having a role. However, since many were 
> name/code pairs that probably were in the same name field (and I 
> didn't calculate how many) the number of names in this list having 
> roles is somewhere between 20% and 40%, probably toward the lower end.
>
> Number of name/code pairs on LC list: 228
>
> OCLC Names:
>
> Total number of distinct names: 191
> Total number of names matching LC list: 175
> Total number of names not matching LC list: 16
> Occurrences of names matching LC list: 7,202,345
> Occurrences of non-matching names: 199,413
>
> OCLC Codes:
>
> Total number of distinct codes: 872
> Total number of codes matching LC list: 210
> Total number of codes not matching: 664 [1]
> Occurrences of matching codes: 7,404,728
> Occurrences of non-matching codes: 19,503
>
> TOTAL: about 1.5% of instances are "bad". Removing the paired 
> instances (which
>
> [1] Yes, I seem to be 2 off here - possibly due to headers in XSL 
> files or something. Not worth tracking down. If anyone wants my data 
> to compare to their own, I can send it.
>
> SOME OBSERVATIONS
>
> 1) There were very few non-matching names, and in those, I did not see 
> an expected mix of languages. The full set of non-matching names, 
> including the nonsense ones, is:
>
> Author of introduction    167949  [2]
> Author of screenplay    27680
> Curator    3753
> Plaintiff -appellee    9
> Contestant -appellee    4
> éd    4                                   [3]
> Graphic technician    3
> ré    2
> sé    2
> Respondent -appellee    1
> kè    1
> mé    1
> öv    1
> př    1
> røm    1
> øga    1
>
> [2] Most of these "bad" names are close variations on the LC list, 
> missing ", etc.". This could either be a problem with how the data was 
> produced or how I manipulated it.
>
> [3] Yes, some of these are probably actually bad codes - I separated 
> codes from names by > 3 chars, so bad codes with Unicode characters 
> and other garbage ended up here. Their occurrence numbers are low,  
> however.
>
> 3) There were many more bad codes than bad names. The top ten "bad" 
> codes were:
>
> csn
>
> 	
>
> 3678
>
> cnm
>
> 	
>
> 2423
>
> 9pu
>
> 	
>
> 1104
>
> fme
>
> 	
>
> 990
>
> ens
>
> 	
>
> 899
>
> for
>
> 	
>
> 735
>
> orc
>
> 	
>
> 735
>
> sde
>
> 	
>
> 715
>
> dir
>
> 	
>
> 405
>
> 730
>
> 	
>
> 387
>
>
>
> 3) Like the stats on MARC fields and subfields, both names and codes 
> have a long tail of nonsense. This long tail, however, accounts for 
> very few occurrences. Which is why, although there are many more "bad" 
> codes than "bad" names, the actual number of bad codes is low -- they 
> are mostly one-offs.
>
> MY CONCLUSIONS
>
> Role names seem to have been pretty carefully quality-controlled in 
> OCLC; codes less so, or there was a problem in the creation of the 
> dataset.
>
> Non-English roles do not seem to be represented here.
>
> kc
>
> -- 
> Karen Coyle
> [log in to unmask]  <mailto:[log in to unmask]>  http://kcoyle.net
> ph: 1-510-540-7596
> m: 1-510-435-8234
> skype: kcoylenet

-- 
Karen Coyle
[log in to unmask] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet