While we're asking questions first and shooting later...
Can we also get another dump of the NAF that is formatted in the same
was the SKOS dumps?
As it stands, you need to load the entire dataset into something
(memory, database, something) to get all of the variant labels (and
whatnot) as a result of the blank nodes.
Ok, so it doesn't need to be SKOS, necessarily, but can we get
something that we can sort on subject URI and stream?
On Mon, Dec 12, 2011 at 5:40 PM, Ford, Kevin <[log in to unmask]> wrote:
> The story here is you need to inquire first and code second.
> You saw a deficiency in the bulk downloads. That's a good thing, and something that had been missed on our end. But, instead of inquiring about this, you unleashed an irresponsible amount of traffic at ID. And, quizzically, based on my reading of your email, you believe one of the better solutions is to slow your crawl versus engaging us and passing on the very valuable information that blocking your heedless crawl unintentionally elicited. You note the "slow" nature of crawling; you should be mindful of the numbers: it would take 23 days of retrieving 4 names per second to crawl the entire Names file at ID. That strikes me as inefficient, even if a computer is doing all the work.
> This is in contrast to the three days, or so, that it takes to generate the bulk downloads, which it is high time we did. And, I assure you, learning this *before* we do that work means that it'll get fixed in a timely manner. That is why communicating a problem with the ID service and/or data should be the first course of action, and is often the best course. I'm confident other users of the bulk download files will benefit from our addressing this issue also.
> So, thanks for drawing our attention to this problem, even if it was in a rather circuitous manner requiring gigabytes of network traffic and far, far more effort than exchanging a couple of emails would have. You'll be unblocked at some point in the next couple of days.
> Kevin Ford
> Network Development and MARC Standards Office
> Library of Congress
>> -----Original Message-----
>> From: Ford, Kevin
>> Sent: Monday, December 12, 2011 4:05 PM
>> To: Ford, Kevin
>> Subject: RE: [ID.LOC.GOV] LCNAF & HTTP requests to id.loc.gov
>> From: Authorities and Vocabularies Service Discussion List
>> [mailto:[log in to unmask]] On Behalf Of Trevor Thornton
>> Sent: Monday, December 12, 2011 2:59 PM
>> To: [log in to unmask]
>> Subject: [ID.LOC.GOV] LCNAF & HTTP requests to id.loc.gov
>> To Whom It may Concern-
>> (Apologies in advance if you receive 2 versions of this message - I
>> submitted it via your web form also)
>> I am an applications developer at the New York Public Library. We are
>> creating a tool to assist our metadata catalogers in using terms from
>> authorized sources (currently just LC authorities and Getty thesauri).
>> The first step is to get all of the terms into a centralized database.
>> I've been working from your RDF downloads, and have been able to get
>> all of the information I need for LCSH and LCGFT from those. I had a
>> problem, however, with the data included with the LCNAF downloads. The
>> MADS/RDF file does not include the type of name (e.g. personal,
>> corporate, conference, title, etc.). This is specified in the
>> individual records with a distinct wrapper element (e.g.
>> It's important for us to be able to easily differentiate between types
>> of names, therefor I need to record this our the database. Since it is
>> not included in the download, I've been using the LCNAF to VIAF RDF
>> file as a sort of manifest, and sending an HTTP request for each LCNAF
>> URI to retrieve the full record, then extracting the name type by
>> evaluating the name of the wrapper element.
>> This was working pretty well, though it was slow. Today I tried to
>> multithread this task, effectively doubling the number of hits to your
>> server. This resulted in my being blocked. I was afraid that would
>> happen, and I'm sorry that I did not notify you in advance.
>> So I have 2 questions:
>> 1. Is there a better way to retrieve name types for the records in
>> LCNAF, one that doesn't involve fetching each individual record from
>> id.loc.gov? Perhaps another extract of the data that isn't listed on
>> the site?
>> 2. If there is no better way to get this data, can you unblock me, on
>> the condition that I go back to my slow, single-thread procedure? My IP
>> address is 184.108.40.206.
>> Please let me know if you need any more info from me, and thanks in
>> advance for your help.
>> Trevor Thornton
>> Trevor Thornton
>> Applications Developer, Information Technology Group The New York
>> Public Library
>> phone: 212-621-0607
>> email: [log in to unmask]