I have a few comments on the recent thread about ABX testing and
I am not a statistician, but had some assistance from a statistician
in performing the tests...and this was also several years ago so I
may have forgotten the details.
This was a video ABX test, not an audio one, but it was designed
after many of the audio tests that were done with some enhancements.
I am under NDA about the client/project.
Basically, we obtained excellent quality original, uncompressed (D5)
video clips for the tests. These clips were run through the
equipment/system suspected of the degrading. We had multiple clips.
The same clips were randomized and were run a number of times before
moving on to the next clip. We had a box with a switcher A/B/X and
two vote buttons: X is A, X is B.
We ran up to about 15 tries (votes) on each clip, although there were
some statistical "tricks" that could shorten the test on each clip
(like 5 out of 6 correct or something like that). The clips would
keep looping and the operator had control over the ABX switch. The
loops were short (10-30 seconds). After each vote, X was randomized
(only the computer knew). I forget if A was always the unprocessed or
if that were randomized, too.
We had two systems so we had two sets of processed clips. The clips
were played back from an uncompressed DOREMI Labs digital server or
two (as I said it was a while ago and the details are becoming a blur).
We did show that one system was more visible than the other.
We also showed that some people who said it "was easy" ended up
getting about 50% (chance) right.
We found that "golden eyes" didn't necessarily do better than clerks,
although most of the people who did see the differences were people
with significant video shading experience.
One comment I received was "I'd never give the viewers the ABX box,
it's too critical--it allows too much training".
Since I wasn't a statistician, I was booed out of a room
(essentially) when trying to describe the technique to another group
that was considering some tests when I couldn't answer their
statistical questions. My clients and I were confident that the
statistical basis of the tests were sound and if any testing would
have caught perceptual artifacts, this would have.
Our test sessions were long, and the test subject was given a break
and food during their session and I think some were allowed to come
back another day to finish if they wished.
The test sessions were one at a time for as long as it took. I think
we saw some fatigue factors, but since we were doing multiple tests
on multiple clips of the same system (and I'm pretty sure we
randmoized which system we started with but kept the same system
together so the training would not be lost.
I think I've told this story before, but I was presenting to an
archivist's workshop and someone asked about the quality of MP3s, so
I took out my Palm and played the MP3 version of a song I had in the
original quality demo. It sounded awful. I said I think there was
something wrong with the test.
I then took the song in MP3 and 44.1/16 WAV format and inter-cut it
on line boundaries of the song so one line would be the MP3 and the
next line would be the WAV. For future workshops, I've always played
this cut-up version rather than the different players. I had the
organizer of the first workshop send an email to all the participants
that the demo was really about the poor analog circuitry in the Palm
rather than the MP3 process.
While the MP3/WAV differences (I think it was a 256 or 320 kb/s MP3)
are slightly noticeable by some people some of the time on some
systems, they are very, very close. Oddly, I hear a lack of
distinction in the bass as well as a loss of detail in the highs in
the MP3 version, but it's very subtle.
Richard L. Hess email: [log in to unmask]
Aurora, Ontario, Canada (905) 713 6733 1-877-TAPE-FIX
Detailed contact information: http://www.richardhess.com/tape/contact.htm
Quality tape transfers -- even from hard-to-play tapes.