Steven Barr wrote and asked:

> 1) In some cases, data can be obtained from still-extant recording ledgers.
> Note that these ledgers (except for many Victor items) generally do NOT
> provide session personnel data...often including vocalists. However,
> they DO provide date and location of the recordings they document.
> In part, this answers your question immediately above; the relevant
> Brunswick ledgers DO exist, so matrix numbers can be entered by looking
> for the sheet listing the title in question (except that if more than
> one take was recorded, there is no reliable way of knowing which
> take was issued...!). Where ledgers no longer exist (virtually all
> minor/"indie" labels of the twenties) there is no way of knowing
> (accurately, anyway) matrix numbers, dates or other session-
> related data...and "educated best guesses," presumably prefixed
> with "c." or "est." or equivalents...and/or data from common
> sources (ADBD/CED/usw.) will have to suffice by default...!

Well, my present view is that we gather session-related data from
anywhere we can find it, and provide for including duplicate data
which may conflict with each other. The "metadata" for the source of
every "data set" will be recorded, and an estimate of the level of
"authority" could also be noted by another authority (e.g., ARSC).
Users of the discographical information can filter out sources
they don't believe are reliable, or will have several sources which
they can view and weigh against each other.

In many ways this is thinking outside the box, since the long-held
paradigm is that someone just has to pick the information which is
"most correct" and run with that and exclude the other information.
We should chuck that paradigm and simply record it all, so long as
we properly identify the source of information, and let the end-user
decide what to use based on the authority metadata.


At this point, let me note again that in my view the "discographical
database" will comprise (at least) two parts, which will be independent
of each other, other than identifier linkage at the recording master
level [note]:

1) The recording artifact information, and

2) Session information.

[note, when the artifact does not provide its own master recording
identifier, commonly referred to as "matrix number", we simply apply
our own unique identifier to that field, so that it can point over to
the session information if any is known. UUID is one candidate unique
identifier that I would consider using, although its length may scare
some people.]

Regarding 1, we record exactly what the recording artifact tells us,
both written and physical. I'm not certain we even want to include
"normalized" data of any kind in the artifact data, but if we do, that
data would not replace what the record tells us, but would be added as
"normalized" data, essentially in parallel to the artifact data. (If
there are misspellings or mistakes in the textual information the
artifact gives, we transcribe that text exactly "as is". We don't care
anything except to get the text transcribed accurately to what is
shown on the artifact. If the label is given as "Colombia", rather
than "Columbia", and it is a Columbia, we record the labelname as

#2 is the real meat of discography since that's where we include
session data, such as location, date, musicians, the known mastered
recordings, etc. This is where interpolation is allowed, and we'd
probably even see attempts at normalization, authority assignment,

(I think that normalized "bios" of musicians, and a normalized
song/composition database, be separate, again with identifier linkages
from the session data. For example, Session data from ledgers may list
Benny Goodman as "Benny Goodman", "Benjamin Goodman", "Bennie Goodman"
and "Shoeless Joe Jackson". And if that's all the information we have
from the ledgers, that's what we use. So in order to tie these different
variations to the same person, some authority in the future may connect
them with a common and unique identifier (and here we even allow
multiple authorities.) Once we have a common identifier, then that can
be used to create a biographical sketch for the performer in a
separate XML document designed for that purpose.)

Anyway, just some of my thoughts...

> 2) Any discographic entity of whatever sort MUST list both the
> extant and actual information in cases where both exist (and are
> known to the compiler[s]). In some cases, what would appear to be
> an error actually is not; for example, the initial Brunswick
> recordings of "My Blue Heaven" are labelled as "Blue Heaven"...
> and play very slightly different lyrics (..."When the whippoorwills
> ARE calling...") which suggests they are actually the original
> versions of the tune...! I have always used two separate fields
> ("ARTCRED" and "ACTART") to track recordings issued under 
> pseudonyms or those with credit errors. This, in turn, means
> I can query the database both for "Recordings on which Arthur
> Fields sings" and "Recordings on which the vocalist is credited
> as 'Mr. X'"...two entirely different questions! In fact, I can
> even query for "All recordings on which 'Arthur Fields' is
> credited as 'Mr. X'" should I need that specific data...!

The song/composition aspect of sound recordings can get complicated
due to derivatives/variants as Steve noted. The problem is that I'm
not sure the Session information should provide "normalized" song
title stuff, since that should be done in a database for that purpose.
The problem is determining the canonical or normalized title of a song
composition. Oftentimes a song composition and lyrics will vary from
the "canonical" version, but yet have no indication in title and
composer credits that it varies some. The ledgers themselves, if they
exist, may get the song title wrong. If we have no ledgers which list
the song title, then the song title is pretty much given in the
artifacts that still exist -- and even here we can have variants on
the song title from label to label when the recording is issued on
several labels. Geez, it gets messy...

And of course we have the wonderful complication of medleys. No doubt
the seasoned discographers here can think up several more exceptions
we have to deal with when it comes to song titles. It is one of the
messier aspects of discography. (The next is musician info, but that's
nowhere near as messy as song compositions and lyrics.)

> 3) IMO, the "wiki-db" should provide either (A) ALL available
> discographic data relevant to a phonorecord (or side thereof)
> with actual verified data items noted as such and "best guess"
> entries likewise identified...OR (B) enough information to
> identify a given phonorecord, along with (hyper?)links to
> other relevant data thereon. It should also be possible to
> query the database on any of its fields (including related
> data tables in the database) and receive a list of all
> phonorecords (including "None" if that is the case) which
> fit the query's declared criteria. Regardless of how the
> tables are set up, the results will be the same...the only
> difference being in how many different tables the data is
> stored! Note that my first discographic catalog database
> was NOT relational, which often resulted in a large number
> of empty data fields (which, in xBase, use as much space
> as completed fields...!); however, in these days of 1TB
> (and larger?) consumer hard drives, this is no longer a
> consideration...or so I am told...?!

We have to separate the database from the application. This is why the
data must be stored (in a source sense) in XML. XML is portable,
standardized, UTF-8 text encoded, and both human and machine readable.
Certainly a specific application may import the XML and convert it to
some internal form for fast processing/access, but we must NOT get into
the mode where our discographical data is archived and transported in
some proprietary, machine-readable-only database format. BAD.

[Now to really show my XML markup wonk side: ARSC should set a strict
policy that the master discographical information be contained in an
XML document (or documents) which is valid to the DTD controlled and
maintained by ARSC. Furthermore, "internal subsets" are not allowed or
ARSC will get very pissed. This is one step to make it much harder for
someone to "proprietize" the XML documents for proprietary advantage --
if someone just gotta have something new, they come to the ARSC committee
overseeing the DTD and nicely ask for the DTD to be expanded. There
also have to be controls on the use of other namespaces. And if it ends
up that there are requirements which cannot be completely enforced by a
DTD or Schema, then ARSC will write a script to verify the XML
documents conform to the other requirements. I speak from first-hand
experience having co-authored open standard XML-based e-book formats
since 1999 for IDPF, where we had to take seriously the possibility of
a company hijacking the spec by adding proprietary stuff for their
advantage. Likewise, ARSC has to take firm control of the whole spec
or it will get away and we'll end up again with a Tower of eBabel. And
it should be clear by now that ARSC should not agree to bless a
proprietary database format for mastering the discographical information
-- in my opinion it must be mastered as UTF-8 XML document(s) valid to
the published, open standard ARSC-maintained DTD/Schema. This assures
true internationalization and repurposeability of the discographical
information into the very distant future...]

Anyway, so long as our ontology expressed in our XML DTD/Schema is
complete, then that will enable applications which access the data to
do whatever it wants. It's simply a matter of developer time to get
all the data visualization bells and whistles users desire. If a
particular application is insufficient, that's the "fault" of the
developer, not our database. Let a thousand flowers bloom! (As Mao
said -- here I refer to applications using the XML discographical
database. Let them compete with each other.)

> 4) Are you suggesting that "songs" and "compositions" be
> kept in separate (but relationally connected) tables?
> Likewise, what are you referring to as "normalization?"
> (the word has a specific meaning in the database "industry")

To answer your first question, yes, I'm leaning this way. Part of the
reason is that this is the way it should be done since, like people,
song melodies/compositions/lyrics are really standalone entities that
exist apart from the Session and the Artifact, and have their own
richness best expressed in a separate ontology.

And about "normalization," you are right that I probably did not use
the term properly. Among librarian catalogers there is a term used to
describe "normalizing" or "standardizing" values, such as author names
-- but for the life of me I can't remember what that term is. I'm sure
several here will be able to provide the more accurate terminology from
the cataloging world, and I await for my memory to be jogged. <laugh/>

Jon Noring