I actually compiled the whole shebang myself: the initial HDs and the current NAS server, so I know the origin of everything... But there’s always some fine tuning re: duplicates as I continue to add new files to a DB that’s not yet in one easily searchable location.
I very much appreciate Richard’s and the elegant simplicity of looking at duplicate folders, not files. Once I’ve done that within the 30ish folders that represent the initial 30ish drives, I can ponder the issue of newer loose/duplicate _files_, if needed.
So thank you very much indeed! I am sure your notes will help other people
D
PS: the tool I had used thus far was http://www.jthink.net/songkong/ - which I used for pulling metadata into empty fields and files. The thing is that odd East European stuff is more likely to be listed on Discogs than on MusicBrain and SongKong looks in Discogs as one part of its song recognition search.
Sent from my iPhone
> On Jan 14, 2021, at 11:51 AM, Lou Judson <[log in to unmask]> wrote:
>
> Great idea, Richard, unless they copied the folders independent of which drive they came from…
>
> Just a thought.
> <L>
> Lou Judson
> Intuitive Audio
> 415-883-2689
>
>> On Jan 14, 2021, at 11:00 AM, Richard L. Hess <[log in to unmask]> wrote:
>>
>> Hello, David,
>>
>> As I understand it, you may have duplicate folders...i.e. copies of the same album that came from different original hard drives. Perhaps as one HDD was getting full, they added an album to the next HDD without deleting it from the first, or it was an album that needed referencing.
>>
>> If you can search on the folders rather than the files, that would avoid the problem of deleting the songs from the compilation albums.
>>
>> Perhaps the easiest way to do this is semi-manually.
>>
>> Create a spreadsheet with three columns: Dupe, HDD, and Folder and then label them thusly.
>>
>> Put "HDD-01" in the first cell below the heade row.
>>
>> Go into the HDD-01 folder on the NAS, highlight all folders within that folder and then using a utility like
>> https://www.extrabit.com/copyfilenames
>>
>> This will copy the NAMES of all the top level folders (and any files) in the HDD-01 folder.
>>
>> Paste this into the Folder column of the spreadsheet.
>>
>> Copy down the HDD ID for every row that you just put a filename in.
>>
>> Repeat for all HDD folders on the NAS.
>>
>> You now have a spreadsheet listing all the folders from all the HDDs identified. You also know which HDD folder it's in.
>>
>> Save it.
>>
>> Sort all the data based on column C, "Folder" ignoring the header row.
>>
>> Save it under a new name just in case something got messed up.
>>
>> Then try something like this formula in cell A3 (assuming cell A1 is the label and Row 2 is the first folder.
>>
>> =IF(C2=C3,"DUPLICATE","")
>>
>> That will place a null in cell A3 if C2 and C3 are different. If they are the same, it will display the word "DUPLICATE" in cell A3.
>>
>> The duplicate will be the line above the word and the one with the word.
>>
>> Now go down and search for duplicates and act on them accordingly.
>>
>> Once you've seen one, you can then look at the contents and see if they are the same (check filenames, sizes, and dates/times)
>>
>> Or you could use a program like ViceVersa Pro to compare the two folders and copy files if necessary.
>> https://www.tgrmn.com/
>>
>> OPTION TWO
>>
>> ViceVersa Pro will also compare trees You could make a copy on a new NAS by copying all the files from each HDD folder to one folder on the new NAS using VVP and it will give you the latest version of each file on the HDD when it encounters duplicates (avoiding the spreadsheet approach.
>>
>> I would also consider using VVP's checksum ability which I haven't tried.
>>
>> OPTION THREE
>>
>> I don't know how to automate this exactly, but you could use
>> http://fastsum.com/
>> to generate one file of MD5 checksums in each HDD folder. This will generate checksums for every file in the HDD folder. The checksum file will be in the root of the HDD folder (if you select the correct option). It is a text file.
>>
>> Load all those file checksums into a spreadsheet, parsing the text files to a checksum column and a filename column. Sort on the checksum column, make a duplicates column and proceed as above. Letting the spreadsheet find the duplicates (once you sort on the checksums) is much easier than comparing millions of checksums visually.
>>
>> This will provide information on actual duplicate files within the very, very small possibility of two files creating the same checkum, then you can manually decide whether or not to delete -- again the deletion is not automatic, but telling you what to consider deleting is.
>>
>> OPTION FOUR
>>
>> I haven't tried this, but TreeSizePro is designed for managing large disks and does have a de-duping function that I haven't tried and not certain I feel comfortable with it, mainly because I haven't studied it. It might be useful for you. I do use this to monitor my hard drives and NASes.
>> https://www.jam-software.com/treesize
>>
>> To show the size of about 13 TB and about 1M files took about 15 minutes over Gigabit Ethernet. It's a 6-year-old QNAP NAS.
>>
>> Anyway, just some thoughts -- I hope you find some of this useful.
>>
>> I don't envy you your project.
>>
>> Cheers,
>>
>> Richard
>>
>>
>>
>>> On 2021-01-13 5:17 p.m., David MacFadyen wrote:
>>> I have a very large and multilingual collection of East European music (numerous alphabets, diacritics, etc). My problem concerns the finding and deletion of duplicates. The music is archived as follows.
>>> The overwhelming number of recordings are EPs, LPs, etc—not individual “singles” or lone MP3s lacking artwork. This means that almost every release is in a folder, containing both the MP3s/FLACs/WAVs, whatever - and the artwork as a JPG or PNG file. Those multiple folders were once collected on individual hard drives, which in turn were then transferred to a single multi-rack NAS server (offline), which is where they currently sit.
>>> So the entire collection (2M+ files) consists of maybe 30 folders, each representing an older external HD, and each of those HD folders contains very many sub folders (each = an LP, EP, etc,...).
>>> Perhaps not surprisingly, we have two problems. (1) DUPLICATE files and LPs, which used to be on separate external HDs but are now in one location. (2) The SPEED searching for duplicates across multiple racks (on an aging NAS sever) is painful. Time for a major backup, I know.
>>> As I decide whether to upload everything to a cloud-based DB or invest in a new server, my question to the forum is this:
>>> **Which software and tactic do you trust most to find/delete duplicate digital tracks? I’ve had partial luck with some. (1) Gemini2 is interesting but freezes when dealing with even mid-size tasks. The online support is awful, too. (2) SongKong is good and well supported, if not old-fashioned, but I worry greatly about any program, for example, confusing LPs and compilations, say—and merely punching permanent holes in compilations, deleting what it considers to be “copies” of LP tracks.
>>> Given that my collection is so big, I’d love to rely on some form of automation, if possible.
>>> Any suggestions, please?
>>> Thanks! David
>>> https://www.davidmacfadyen.com/
>>
>> --
>> Richard L. Hess email: [log in to unmask]
>> Aurora, Ontario, Canada 647 479 2800
>> http://www.richardhess.com/tape/contact.htm
>> Track Format - Speed - Equalization - Azimuth - Noise Reduction
>> Quality tape transfers -- even from hard-to-play tapes.
>>
|