Hello, David,
As I understand it, you may have duplicate folders...i.e. copies of the
same album that came from different original hard drives. Perhaps as one
HDD was getting full, they added an album to the next HDD without
deleting it from the first, or it was an album that needed referencing.
If you can search on the folders rather than the files, that would avoid
the problem of deleting the songs from the compilation albums.
Perhaps the easiest way to do this is semi-manually.
Create a spreadsheet with three columns: Dupe, HDD, and Folder and then
label them thusly.
Put "HDD-01" in the first cell below the heade row.
Go into the HDD-01 folder on the NAS, highlight all folders within that
folder and then using a utility like
https://www.extrabit.com/copyfilenames
This will copy the NAMES of all the top level folders (and any files) in
the HDD-01 folder.
Paste this into the Folder column of the spreadsheet.
Copy down the HDD ID for every row that you just put a filename in.
Repeat for all HDD folders on the NAS.
You now have a spreadsheet listing all the folders from all the HDDs
identified. You also know which HDD folder it's in.
Save it.
Sort all the data based on column C, "Folder" ignoring the header row.
Save it under a new name just in case something got messed up.
Then try something like this formula in cell A3 (assuming cell A1 is the
label and Row 2 is the first folder.
=IF(C2=C3,"DUPLICATE","")
That will place a null in cell A3 if C2 and C3 are different. If they
are the same, it will display the word "DUPLICATE" in cell A3.
The duplicate will be the line above the word and the one with the word.
Now go down and search for duplicates and act on them accordingly.
Once you've seen one, you can then look at the contents and see if they
are the same (check filenames, sizes, and dates/times)
Or you could use a program like ViceVersa Pro to compare the two folders
and copy files if necessary.
https://www.tgrmn.com/
OPTION TWO
ViceVersa Pro will also compare trees You could make a copy on a new NAS
by copying all the files from each HDD folder to one folder on the new
NAS using VVP and it will give you the latest version of each file on
the HDD when it encounters duplicates (avoiding the spreadsheet approach.
I would also consider using VVP's checksum ability which I haven't tried.
OPTION THREE
I don't know how to automate this exactly, but you could use
http://fastsum.com/
to generate one file of MD5 checksums in each HDD folder. This will
generate checksums for every file in the HDD folder. The checksum file
will be in the root of the HDD folder (if you select the correct
option). It is a text file.
Load all those file checksums into a spreadsheet, parsing the text files
to a checksum column and a filename column. Sort on the checksum column,
make a duplicates column and proceed as above. Letting the spreadsheet
find the duplicates (once you sort on the checksums) is much easier than
comparing millions of checksums visually.
This will provide information on actual duplicate files within the very,
very small possibility of two files creating the same checkum, then you
can manually decide whether or not to delete -- again the deletion is
not automatic, but telling you what to consider deleting is.
OPTION FOUR
I haven't tried this, but TreeSizePro is designed for managing large
disks and does have a de-duping function that I haven't tried and not
certain I feel comfortable with it, mainly because I haven't studied it.
It might be useful for you. I do use this to monitor my hard drives and
NASes.
https://www.jam-software.com/treesize
To show the size of about 13 TB and about 1M files took about 15 minutes
over Gigabit Ethernet. It's a 6-year-old QNAP NAS.
Anyway, just some thoughts -- I hope you find some of this useful.
I don't envy you your project.
Cheers,
Richard
On 2021-01-13 5:17 p.m., David MacFadyen wrote:
> I have a very large and multilingual collection of East European music (numerous alphabets, diacritics, etc). My problem concerns the finding and deletion of duplicates. The music is archived as follows.
>
> The overwhelming number of recordings are EPs, LPs, etc—not individual “singles” or lone MP3s lacking artwork. This means that almost every release is in a folder, containing both the MP3s/FLACs/WAVs, whatever - and the artwork as a JPG or PNG file. Those multiple folders were once collected on individual hard drives, which in turn were then transferred to a single multi-rack NAS server (offline), which is where they currently sit.
>
> So the entire collection (2M+ files) consists of maybe 30 folders, each representing an older external HD, and each of those HD folders contains very many sub folders (each = an LP, EP, etc,...).
>
> Perhaps not surprisingly, we have two problems. (1) DUPLICATE files and LPs, which used to be on separate external HDs but are now in one location. (2) The SPEED searching for duplicates across multiple racks (on an aging NAS sever) is painful. Time for a major backup, I know.
>
> As I decide whether to upload everything to a cloud-based DB or invest in a new server, my question to the forum is this:
>
> **Which software and tactic do you trust most to find/delete duplicate digital tracks? I’ve had partial luck with some. (1) Gemini2 is interesting but freezes when dealing with even mid-size tasks. The online support is awful, too. (2) SongKong is good and well supported, if not old-fashioned, but I worry greatly about any program, for example, confusing LPs and compilations, say—and merely punching permanent holes in compilations, deleting what it considers to be “copies” of LP tracks.
>
> Given that my collection is so big, I’d love to rely on some form of automation, if possible.
>
> Any suggestions, please?
> Thanks! David
> https://www.davidmacfadyen.com/
>
--
Richard L. Hess email: [log in to unmask]
Aurora, Ontario, Canada 647 479 2800
http://www.richardhess.com/tape/contact.htm
Track Format - Speed - Equalization - Azimuth - Noise Reduction
Quality tape transfers -- even from hard-to-play tapes.
|