Hi Matthew,

> On May 19, 2015, at 11:52 AM, Matthew Snyder <[log in to unmask]> wrote:
> Hi Dave -
> Thank you for providing a more complete picture of FLAC's capabilities,
> especially regarding checksums. However, I still must disagree that this
> gives FLAC an advantage over WAV or other uncompressed formats when it
> comes to preservation.
> It is not a big deal to do batch processing of checksums stored separately
> from the source files. This can be done by
> various applications, such as the open-source Sleuth Kit. Computing and
> comparing checksums for separately-located files is simply
> a fact of life in digital repositories.

True. I think this point is equal.

> Furthermore, what happens if a FLAC
> file is corrupted and the embedded checksum is
> corrupted as well?

Yes despite the odds the embedded checksum within a FLAC header could become corrupted (same issue to an external checksum whether in a sidecar file or database). Consider a valid FLAC has one bit flipped within the MD5 expression of the header. A validator would show that while the MD5 is invalid, every CRC per audio frame is still valid (would be a rare occurrence). This would either conclude that FLAC was improperly written (broken muxer/encoder), the MD5 was corrupted while the audio data and CRCs aren't, or (far, far less possible) a combination of MD5, CRC, and audio data corruption happened to all validate despite change through a minuscule coincidence.

> You also concede that the original file as received by
> the archives should be preserved. Why go further and create a FLAC file
> that will itself take up hard drive space, as the Australians seem to have
> decided to do?

It's true. If the acquisition is WAV there is less motivation to use FLAC if the other preservation services are recreated through over means. The majority of my work experience with FLAC has been in digitization work where there was no WAV file.

> If a FLAC file turns out to be corrupted, the
> first line of defense would, it seems to me, to use a backup to find an
> earlier version of the file that produces a correct
> checksum, which would be much simpler than attempting to fix the file, even
> when you know where the flipped bit is.

Surely, having a backup to restore is ideal, but there are many points in the life of the audio data where this is not (yet) available. I'll give an analogy from video digitization. A preservationist digitized several videotapes to temporary QuickTime files on a network (calling this temp file as file A). The QuickTime files were then opened in an editor to trim parts off the beginning and end to save out a new file (file B) as a master preservation copy. Ideally the new file B should only contain frames that exist in file A, but naturally the checksums of file A and file B will differ, but there is still a need to verify the integrity of the frames between file A and file B. In this case external framedm5s (checksums per frame where used) rather than FLAC audio frame CRCs were used to fill this role. Eventually a change in the network caused an error of a few dozen bytes every 40GB or so of video data. The result looked like this: (which would have been very hard to catch from a typical qc skim of the video, but is very clear with frame based fixity). Even if there were backups of file A and backups of file B, there wouldn't be a way to address this scenario with external checksums (the corruption here happened as data was transmitted from a temp file to a new master file).

> If the original file as received at the archives is indeed in FLAC, that
> would be an interesting question. I could ask our
> digital archivist about how they would handle it.

IMHO for an archive to receive a FLAC rather than a WAV is an advantage. If receiving a WAV, the archive could ask the supplier for a checksum (maybe or maybe not this exists), but with FLAC it mandates inclusion of two types of checksums already though design. If the archive receives a WAV without a checksum it is very difficult to reliable show if it is authentic or not, whereas any damage to the FLAC is clearly known and the extent of the damage (relation to the timeline) is clear. Ideally the application of fixity should occurs as early into the life of the audio data as possible and FLAC mandates for at creation whereas with WAV is can only be later than with FLAC and is naturally optional and external.

Best Regards,
Dave Rice

> -- 
> Matt Snyder
> Archivist
> Special Collections Unit
> The New York Public Library
> [log in to unmask]