Check 981 files ending in " 2.pdf" for being duplicate files

After having started to use iCloud as repository for my Documents folder I found out that 981 pdf files appeared in various subfolders being apparent duplicates with an added " 2" (blank and 2) at the end of the file name.
Example:
filename.pdf
filename 2.pdf

Randomly checking some files they are duplicates with an added " 2" seemingly having the same content.
I identified them using HoudahSpot.
I am able to move or copy and tag them to a different folder for safeguarding, but would lose the path information. I don't know how to save them including their path.

Is it feasible to have KM check each file against a possible duplicate (without " 2") in the same folder, and if a duplicate is found to delete the duplicate ending in " 2"?
I guess comparing pdf file content to assure duplication is too demanding and time consuming.
Identifying a duplicate through file name comparison would be sufficient for me.

I am new to KM. My understanding so far is, that a macro could do the job. My knowledge is just not sufficient.
I imagine the following:

  • get the file information on files ending in " 2" into a text file as listing (how)
  • fetch the file name and path information sequentially from that file
  • compare whether the file name exists in that folder without " 2"
  • if so, delete the file ending in " 2" else rename the file without " 2"
  • fetch next file name from list

I am grateful for any suggestions.

Model Identifier: Mac14,3
System Version: macOS 14.7 (23H124)

Are the files with "2" at the end going to have a more recent timestamp (creation time or modification time) on the file than the original? If so, we don't really need to worry about the filename at all. It should be fairly easy to see if two files are identical, and if they are, delete the older one.

Verifying....

It seems that the Date Created and Date Modified are identical.
However, the files ending in " 2" seem to have swapped the Date Added with the original files. So the ones with " 2" look older.

By the way, I have found a way to export file names and paths into a csv file.

That seems odd (the newest file has the older date!?)

In any case the approach you seem to be wanting to use is to determine if two files are the same by comparing their filenames (ie, checking if a filename has a "2" at the end.) This might work, but I find it rather complex and dangerous to do that. My approach would be to actually check if the files are identical by comparing their contents, not their name. This would be fairly easy to do because there's a macOS utility called md5 which converts a file's contents to a hash value that is guaranteed (at least on the scale of a human's lifespan) to be unique. In that case, all you have to do is sort the files by their hash value, then remove values with unique hash values, and the remaining lines will be all the duplicated files. This would probably be a couple of lines of code. Let me work on that for a few minutes and I'll post it here.

For example, if you run this:

find . -type f -exec md5 {} \;

...you will see the list of all hash values for all files in your current folder and subfolders. Any two files with the same hash value will have the same contents. If we sort the output by hash values, and remove lines with unique hash values, we will be left with duplicated files only.

2 Likes

Thanks for the example.
I am a rookie Terminal user, but I got it to work for one subfolder.
I will read more about the find command and its syntax to fully understand what all the flags and arguments mean, which you use.
Next steps ?

  • create a txt file with the output, import into Numbers
  • remove all unique files from the Numbers file so that only true duplicate files and path information remain

How would you use that information? Automator, scripting?
I am not familiar with that, but certainly interested in your ideas.

By the way - I understand that it seems odd that the newest file has the older date.
When I started to have the problems with files being duplicated, that was true for folders, too. However, some of the "new" folders (ending in 2) contained the files from the original folder while the subfolder(s) remained in the original folder. When comparing files and folders I found out that the "new" folders had the original date while the original folders showed the new date. All this happened after I moved my Documents folder to iCloud. Very weird.

That would be very hard to do. There are much easier ways to solve the next step. So you succeeded in getting my "find" command to generate the MD5 values? Good. Approximately how many results were there? 10? 100? 1,000? 10,000?

Since you liked the first step in my solution, I will work on the next step (which will be getting a list of all duplicated files.) And if you like the second step, we can proceed to the third step (which will be deciding which of those files to delete.) The final step will be deleting the files. I expect the solution to be a lot simpler than you expected.

Just to show you my progress, if you save the results of the command above into a file, you can then reorganize and sort the file by the MD5 hash using the following command:

awk -F'[()=]' '{print $4 " = " $2}' | sort -k 1

There is still a little more work to be done, but we're getting there. I need to make something to eat now, however.

If you don't care which duplicates get deleted, then the next step is easy. We can extract duplicates using the following command on the results of the previous command:

awk '$1 == prev { print } { prev = $1 }'

However the above command randomly picks duplicate files, rather than picks the one with the earliest or latest date. That might take more work. But if you are satisfied with this, we can easily implement the final step of deleting all the files that the command above generates. But if you are picky about which duplicate files you want to delete, then there will be some additional work.

I think I have a simple, delightful way to pick the file with the oldest or newest date as the selection criteria for deleting. However it may depend upon the maximum number of files you may get from the search of your folders. If it's more than a few thousand duplicates, my method may not work (I would have to test it to find the maximum.)