LISTSERV mailing list manager LISTSERV 16.0

Help for ARSCLIST Archives


ARSCLIST Archives

ARSCLIST Archives


ARSCLIST@LISTSERV.LOC.GOV


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

ARSCLIST Home

ARSCLIST Home

ARSCLIST  January 2021

ARSCLIST January 2021

Subject:

Re: Finding and Deleting Duplicate Digital Tracks in Large Collections

From:

ROBINSON Stuart <[log in to unmask]>

Reply-To:

Association for Recorded Sound Discussion List <[log in to unmask]>

Date:

Fri, 15 Jan 2021 12:09:20 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (1 lines)

One method I am coincidentally testing just now is using quickhash-gui

 https://www.quickhash-gui.org/download/quickhash-v1-5-6-for-windows/

 I can use this to select a drive or folder, then scan and hash every file, when it has finished you can right click on the list to "Show only duplicates" then right click again to export that selection to a CSV file, it shows name, path, file size, and file hash.

 It can take a while to scan but it seems very good so far from my testing.

 Best, stuart

-- 
Thanks,
        Stuart Robinson,
        AV Technician,
        Sound Lab,
        School of Scottish Studies,
        University of Edinburgh,
        29 George Square,
        Edinburgh,
        EH8 9LD

        0131 651 5001


The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

-----Original Message-----
From: Association for Recorded Sound Discussion List <[log in to unmask]> On Behalf Of Richard L. Hess
Sent: 14 January 2021 19:01
To: [log in to unmask]
Subject: Re: [ARSCLIST] Finding and Deleting Duplicate Digital Tracks in Large Collections

This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Hello, David,

As I understand it, you may have duplicate folders...i.e. copies of the same album that came from different original hard drives. Perhaps as one HDD was getting full, they added an album to the next HDD without deleting it from the first, or it was an album that needed referencing.

If you can search on the folders rather than the files, that would avoid the problem of deleting the songs from the compilation albums.

Perhaps the easiest way to do this is semi-manually.

Create a spreadsheet with three columns: Dupe, HDD, and Folder and then label them thusly.

Put "HDD-01" in the first cell below the heade row.

Go into the HDD-01 folder on the NAS, highlight all folders within that folder and then using a utility like https://www.extrabit.com/copyfilenames

This will copy the NAMES of all the top level folders (and any files) in the HDD-01 folder.

Paste this into the Folder column of the spreadsheet.

Copy down the HDD ID for every row that you just put a filename in.

Repeat for all HDD folders on the NAS.

You now have a spreadsheet listing all the folders from all the HDDs identified. You also know which HDD folder it's in.

Save it.

Sort all the data based on column C, "Folder" ignoring the header row.

Save it under a new name just in case something got messed up.

Then try something like this formula in cell A3 (assuming cell A1 is the label and Row 2 is the first folder.

=IF(C2=C3,"DUPLICATE","")

That will place a null in cell A3 if C2 and C3 are different. If they are the same, it will display the word "DUPLICATE" in cell A3.

The duplicate will be the line above the word and the one with the word.

Now go down and search for duplicates and act on them accordingly.

Once you've seen one, you can then look at the contents and see if they are the same (check filenames, sizes, and dates/times)

Or you could use a program like ViceVersa Pro to compare the two folders and copy files if necessary.
https://www.tgrmn.com/

OPTION TWO

ViceVersa Pro will also compare trees You could make a copy on a new NAS by copying all the files from each HDD folder to one folder on the new NAS using VVP and it will give you the latest version of each file on the HDD when it encounters duplicates (avoiding the spreadsheet approach.

I would also consider using VVP's checksum ability which I haven't tried.

OPTION THREE

I don't know how to automate this exactly, but you could use http://fastsum.com/ to generate one file of MD5 checksums in each HDD folder. This will generate checksums for every file in the HDD folder. The checksum file will be in the root of the HDD folder (if you select the correct option). It is a text file.

Load all those file checksums into a spreadsheet, parsing the text files to a checksum column and a filename column. Sort on the checksum column, make a duplicates column and proceed as above. Letting the spreadsheet find the duplicates (once you sort on the checksums) is much easier than comparing millions of checksums visually.

This will provide information on actual duplicate files within the very, very small possibility of two files creating the same checkum, then you can manually decide whether or not to delete -- again the deletion is not automatic, but telling you what to consider deleting is.

OPTION FOUR

I haven't tried this, but TreeSizePro is designed for managing large disks and does have a de-duping function that I haven't tried and not certain I feel comfortable with it, mainly because I haven't studied it.
It might be useful for you. I do use this to monitor my hard drives and NASes.
https://www.jam-software.com/treesize

To show the size of about 13 TB and about 1M files took about 15 minutes over Gigabit Ethernet. It's a 6-year-old QNAP NAS.

Anyway, just some thoughts -- I hope you find some of this useful.

I don't envy you your project.

Cheers,

Richard



On 2021-01-13 5:17 p.m., David MacFadyen wrote:
> I have a very large and multilingual collection of East European music (numerous alphabets, diacritics, etc). My problem concerns the finding and deletion of duplicates. The music is archived as follows.
>
> The overwhelming number of recordings are EPs, LPs, etc—not individual “singles” or lone MP3s lacking artwork. This means that almost every release is in a folder, containing both the MP3s/FLACs/WAVs, whatever - and the artwork as a JPG or PNG file. Those multiple folders were once collected on individual hard drives, which in turn were then transferred to a single multi-rack NAS server (offline), which is where they currently sit.
>
> So the entire collection (2M+ files) consists of maybe 30 folders, each representing an older external HD, and each of those HD folders contains very many sub folders (each = an LP, EP, etc,...).
>
> Perhaps not surprisingly, we have two problems. (1) DUPLICATE files and LPs, which used to be on separate external HDs but are now in one location. (2) The SPEED searching for duplicates across multiple racks (on an aging NAS sever) is painful. Time for a major backup, I know.
>
> As I decide whether to upload everything to a cloud-based DB or invest in a new server, my question to the forum is this:
>
> **Which software and tactic do you trust most to find/delete duplicate digital tracks? I’ve had partial luck with some. (1) Gemini2 is interesting but freezes when dealing with even mid-size tasks. The online support is awful, too. (2) SongKong is good and well supported, if not old-fashioned, but I worry greatly about any program, for example, confusing LPs and compilations, say—and merely punching permanent holes in compilations, deleting what it considers to be “copies” of LP tracks.
>
> Given that my collection is so big, I’d love to rely on some form of automation, if possible.
>
> Any suggestions, please?
> Thanks! David
> https://www.davidmacfadyen.com/
>

--
Richard L. Hess                   email: [log in to unmask]
Aurora, Ontario, Canada                             647 479 2800
http://www.richardhess.com/tape/contact.htm
Track Format - Speed - Equalization - Azimuth - Noise Reduction Quality tape transfers -- even from hard-to-play tapes.

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003

ATOM RSS1 RSS2



LISTSERV.LOC.GOV

CataList Email List Search Powered by the LISTSERV Email List Manager