Steven Barr wrote and asked: > 1) In some cases, data can be obtained from still-extant recording ledgers. > Note that these ledgers (except for many Victor items) generally do NOT > provide session personnel data...often including vocalists. However, > they DO provide date and location of the recordings they document. > In part, this answers your question immediately above; the relevant > Brunswick ledgers DO exist, so matrix numbers can be entered by looking > for the sheet listing the title in question (except that if more than > one take was recorded, there is no reliable way of knowing which > take was issued...!). Where ledgers no longer exist (virtually all > minor/"indie" labels of the twenties) there is no way of knowing > (accurately, anyway) matrix numbers, dates or other session- > related data...and "educated best guesses," presumably prefixed > with "c." or "est." or equivalents...and/or data from common > sources (ADBD/CED/usw.) will have to suffice by default...! Well, my present view is that we gather session-related data from anywhere we can find it, and provide for including duplicate data which may conflict with each other. The "metadata" for the source of every "data set" will be recorded, and an estimate of the level of "authority" could also be noted by another authority (e.g., ARSC). Users of the discographical information can filter out sources they don't believe are reliable, or will have several sources which they can view and weigh against each other. In many ways this is thinking outside the box, since the long-held paradigm is that someone just has to pick the information which is "most correct" and run with that and exclude the other information. We should chuck that paradigm and simply record it all, so long as we properly identify the source of information, and let the end-user decide what to use based on the authority metadata. ***** At this point, let me note again that in my view the "discographical database" will comprise (at least) two parts, which will be independent of each other, other than identifier linkage at the recording master level [note]: 1) The recording artifact information, and 2) Session information. [note, when the artifact does not provide its own master recording identifier, commonly referred to as "matrix number", we simply apply our own unique identifier to that field, so that it can point over to the session information if any is known. UUID is one candidate unique identifier that I would consider using, although its length may scare some people.] Regarding 1, we record exactly what the recording artifact tells us, both written and physical. I'm not certain we even want to include "normalized" data of any kind in the artifact data, but if we do, that data would not replace what the record tells us, but would be added as "normalized" data, essentially in parallel to the artifact data. (If there are misspellings or mistakes in the textual information the artifact gives, we transcribe that text exactly "as is". We don't care anything except to get the text transcribed accurately to what is shown on the artifact. If the label is given as "Colombia", rather than "Columbia", and it is a Columbia, we record the labelname as "Colombia".) #2 is the real meat of discography since that's where we include session data, such as location, date, musicians, the known mastered recordings, etc. This is where interpolation is allowed, and we'd probably even see attempts at normalization, authority assignment, etc. (I think that normalized "bios" of musicians, and a normalized song/composition database, be separate, again with identifier linkages from the session data. For example, Session data from ledgers may list Benny Goodman as "Benny Goodman", "Benjamin Goodman", "Bennie Goodman" and "Shoeless Joe Jackson". And if that's all the information we have from the ledgers, that's what we use. So in order to tie these different variations to the same person, some authority in the future may connect them with a common and unique identifier (and here we even allow multiple authorities.) Once we have a common identifier, then that can be used to create a biographical sketch for the performer in a separate XML document designed for that purpose.) Anyway, just some of my thoughts... > 2) Any discographic entity of whatever sort MUST list both the > extant and actual information in cases where both exist (and are > known to the compiler[s]). In some cases, what would appear to be > an error actually is not; for example, the initial Brunswick > recordings of "My Blue Heaven" are labelled as "Blue Heaven"... > and play very slightly different lyrics (..."When the whippoorwills > ARE calling...") which suggests they are actually the original > versions of the tune...! I have always used two separate fields > ("ARTCRED" and "ACTART") to track recordings issued under > pseudonyms or those with credit errors. This, in turn, means > I can query the database both for "Recordings on which Arthur > Fields sings" and "Recordings on which the vocalist is credited > as 'Mr. X'"...two entirely different questions! In fact, I can > even query for "All recordings on which 'Arthur Fields' is > credited as 'Mr. X'" should I need that specific data...! The song/composition aspect of sound recordings can get complicated due to derivatives/variants as Steve noted. The problem is that I'm not sure the Session information should provide "normalized" song title stuff, since that should be done in a database for that purpose. The problem is determining the canonical or normalized title of a song composition. Oftentimes a song composition and lyrics will vary from the "canonical" version, but yet have no indication in title and composer credits that it varies some. The ledgers themselves, if they exist, may get the song title wrong. If we have no ledgers which list the song title, then the song title is pretty much given in the artifacts that still exist -- and even here we can have variants on the song title from label to label when the recording is issued on several labels. Geez, it gets messy... And of course we have the wonderful complication of medleys. No doubt the seasoned discographers here can think up several more exceptions we have to deal with when it comes to song titles. It is one of the messier aspects of discography. (The next is musician info, but that's nowhere near as messy as song compositions and lyrics.) > 3) IMO, the "wiki-db" should provide either (A) ALL available > discographic data relevant to a phonorecord (or side thereof) > with actual verified data items noted as such and "best guess" > entries likewise identified...OR (B) enough information to > identify a given phonorecord, along with (hyper?)links to > other relevant data thereon. It should also be possible to > query the database on any of its fields (including related > data tables in the database) and receive a list of all > phonorecords (including "None" if that is the case) which > fit the query's declared criteria. Regardless of how the > tables are set up, the results will be the same...the only > difference being in how many different tables the data is > stored! Note that my first discographic catalog database > was NOT relational, which often resulted in a large number > of empty data fields (which, in xBase, use as much space > as completed fields...!); however, in these days of 1TB > (and larger?) consumer hard drives, this is no longer a > consideration...or so I am told...?! We have to separate the database from the application. This is why the data must be stored (in a source sense) in XML. XML is portable, standardized, UTF-8 text encoded, and both human and machine readable. Certainly a specific application may import the XML and convert it to some internal form for fast processing/access, but we must NOT get into the mode where our discographical data is archived and transported in some proprietary, machine-readable-only database format. BAD. [Now to really show my XML markup wonk side: ARSC should set a strict policy that the master discographical information be contained in an XML document (or documents) which is valid to the DTD controlled and maintained by ARSC. Furthermore, "internal subsets" are not allowed or ARSC will get very pissed. This is one step to make it much harder for someone to "proprietize" the XML documents for proprietary advantage -- if someone just gotta have something new, they come to the ARSC committee overseeing the DTD and nicely ask for the DTD to be expanded. There also have to be controls on the use of other namespaces. And if it ends up that there are requirements which cannot be completely enforced by a DTD or Schema, then ARSC will write a script to verify the XML documents conform to the other requirements. I speak from first-hand experience having co-authored open standard XML-based e-book formats since 1999 for IDPF, where we had to take seriously the possibility of a company hijacking the spec by adding proprietary stuff for their advantage. Likewise, ARSC has to take firm control of the whole spec or it will get away and we'll end up again with a Tower of eBabel. And it should be clear by now that ARSC should not agree to bless a proprietary database format for mastering the discographical information -- in my opinion it must be mastered as UTF-8 XML document(s) valid to the published, open standard ARSC-maintained DTD/Schema. This assures true internationalization and repurposeability of the discographical information into the very distant future...] Anyway, so long as our ontology expressed in our XML DTD/Schema is complete, then that will enable applications which access the data to do whatever it wants. It's simply a matter of developer time to get all the data visualization bells and whistles users desire. If a particular application is insufficient, that's the "fault" of the developer, not our database. Let a thousand flowers bloom! (As Mao said -- here I refer to applications using the XML discographical database. Let them compete with each other.) > 4) Are you suggesting that "songs" and "compositions" be > kept in separate (but relationally connected) tables? > Likewise, what are you referring to as "normalization?" > (the word has a specific meaning in the database "industry") To answer your first question, yes, I'm leaning this way. Part of the reason is that this is the way it should be done since, like people, song melodies/compositions/lyrics are really standalone entities that exist apart from the Session and the Artifact, and have their own richness best expressed in a separate ontology. And about "normalization," you are right that I probably did not use the term properly. Among librarian catalogers there is a term used to describe "normalizing" or "standardizing" values, such as author names -- but for the life of me I can't remember what that term is. I'm sure several here will be able to provide the more accurate terminology from the cataloging world, and I await for my memory to be jogged. <laugh/> Jon Noring