One method I am coincidentally testing just now is using quickhash-gui
I can use this to select a drive or folder, then scan and hash every file, when it has finished you can right click on the list to "Show only duplicates" then right click again to export that selection to a CSV file, it shows name, path, file size, and file hash.
It can take a while to scan but it seems very good so far from my testing.
School of Scottish Studies,
University of Edinburgh,
29 George Square,
0131 651 5001
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
From: Association for Recorded Sound Discussion List <[log in to unmask]> On Behalf Of Richard L. Hess
Sent: 14 January 2021 19:01
To: [log in to unmask]
Subject: Re: [ARSCLIST] Finding and Deleting Duplicate Digital Tracks in Large Collections
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
As I understand it, you may have duplicate folders...i.e. copies of the same album that came from different original hard drives. Perhaps as one HDD was getting full, they added an album to the next HDD without deleting it from the first, or it was an album that needed referencing.
If you can search on the folders rather than the files, that would avoid the problem of deleting the songs from the compilation albums.
Perhaps the easiest way to do this is semi-manually.
Create a spreadsheet with three columns: Dupe, HDD, and Folder and then label them thusly.
Put "HDD-01" in the first cell below the heade row.
Go into the HDD-01 folder on the NAS, highlight all folders within that folder and then using a utility like https://www.extrabit.com/copyfilenames
This will copy the NAMES of all the top level folders (and any files) in the HDD-01 folder.
Paste this into the Folder column of the spreadsheet.
Copy down the HDD ID for every row that you just put a filename in.
Repeat for all HDD folders on the NAS.
You now have a spreadsheet listing all the folders from all the HDDs identified. You also know which HDD folder it's in.
Sort all the data based on column C, "Folder" ignoring the header row.
Save it under a new name just in case something got messed up.
Then try something like this formula in cell A3 (assuming cell A1 is the label and Row 2 is the first folder.
That will place a null in cell A3 if C2 and C3 are different. If they are the same, it will display the word "DUPLICATE" in cell A3.
The duplicate will be the line above the word and the one with the word.
Now go down and search for duplicates and act on them accordingly.
Once you've seen one, you can then look at the contents and see if they are the same (check filenames, sizes, and dates/times)
Or you could use a program like ViceVersa Pro to compare the two folders and copy files if necessary.
ViceVersa Pro will also compare trees You could make a copy on a new NAS by copying all the files from each HDD folder to one folder on the new NAS using VVP and it will give you the latest version of each file on the HDD when it encounters duplicates (avoiding the spreadsheet approach.
I would also consider using VVP's checksum ability which I haven't tried.
I don't know how to automate this exactly, but you could use http://fastsum.com/ to generate one file of MD5 checksums in each HDD folder. This will generate checksums for every file in the HDD folder. The checksum file will be in the root of the HDD folder (if you select the correct option). It is a text file.
Load all those file checksums into a spreadsheet, parsing the text files to a checksum column and a filename column. Sort on the checksum column, make a duplicates column and proceed as above. Letting the spreadsheet find the duplicates (once you sort on the checksums) is much easier than comparing millions of checksums visually.
This will provide information on actual duplicate files within the very, very small possibility of two files creating the same checkum, then you can manually decide whether or not to delete -- again the deletion is not automatic, but telling you what to consider deleting is.
I haven't tried this, but TreeSizePro is designed for managing large disks and does have a de-duping function that I haven't tried and not certain I feel comfortable with it, mainly because I haven't studied it.
It might be useful for you. I do use this to monitor my hard drives and NASes.
To show the size of about 13 TB and about 1M files took about 15 minutes over Gigabit Ethernet. It's a 6-year-old QNAP NAS.
Anyway, just some thoughts -- I hope you find some of this useful.
I don't envy you your project.
On 2021-01-13 5:17 p.m., David MacFadyen wrote:
> I have a very large and multilingual collection of East European music (numerous alphabets, diacritics, etc). My problem concerns the finding and deletion of duplicates. The music is archived as follows.
> The overwhelming number of recordings are EPs, LPs, etc—not individual “singles” or lone MP3s lacking artwork. This means that almost every release is in a folder, containing both the MP3s/FLACs/WAVs, whatever - and the artwork as a JPG or PNG file. Those multiple folders were once collected on individual hard drives, which in turn were then transferred to a single multi-rack NAS server (offline), which is where they currently sit.
> So the entire collection (2M+ files) consists of maybe 30 folders, each representing an older external HD, and each of those HD folders contains very many sub folders (each = an LP, EP, etc,...).
> Perhaps not surprisingly, we have two problems. (1) DUPLICATE files and LPs, which used to be on separate external HDs but are now in one location. (2) The SPEED searching for duplicates across multiple racks (on an aging NAS sever) is painful. Time for a major backup, I know.
> As I decide whether to upload everything to a cloud-based DB or invest in a new server, my question to the forum is this:
> **Which software and tactic do you trust most to find/delete duplicate digital tracks? I’ve had partial luck with some. (1) Gemini2 is interesting but freezes when dealing with even mid-size tasks. The online support is awful, too. (2) SongKong is good and well supported, if not old-fashioned, but I worry greatly about any program, for example, confusing LPs and compilations, say—and merely punching permanent holes in compilations, deleting what it considers to be “copies” of LP tracks.
> Given that my collection is so big, I’d love to rely on some form of automation, if possible.
> Any suggestions, please?
> Thanks! David
Richard L. Hess email: [log in to unmask]
Aurora, Ontario, Canada 647 479 2800
Track Format - Speed - Equalization - Azimuth - Noise Reduction Quality tape transfers -- even from hard-to-play tapes.