Comparing 2 100GB Drives/directories to see if they identical

tester_V · Sep-28-2025, 03:44 AM

Greetings to those who work on Saturdays!
I’m tinkering with a small Python project and would love to hear your ideas or suggestions.
I'm going to write a script to compare 2 Drives or Directories (with tons of subdirs) to see if they have same files/folders.
Some of those have 100GBs some 2TBs of files (Pictures and Video files) and so on...
I thought I could do this:
For Disk1. Get a list of files and a directory name for each subdir
For Disk2. Get a list of files and a directory name for each subdir
Compare each list.
I leaned it is better to ask, before I'll starting writing, if there is abetter way to do it. Usually there is a better way... Wink

Thank you in advance.
Tester_V

Pedroski55 · Sep-28-2025, 04:10 AM

Just a tip: search "handling large datasets in Python" you will find many webpages with good advice.

This is a good start.

You will have 2 large sets of data to compare.

Later on, you can get a job with Homeland Security, spying on all Americans! (No, we don't do that!! HS)

tester_V · Sep-28-2025, 05:18 AM

For the Saturday night? Yes that is funny! Dance

***snippsat*** · (This post was last modified: Sep-28-2025, 08:49 AM by snippsat.)

To understand the problem start by finding duplicates in eg in on folder first.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
Have to use file hash and compare to find duplicates.
Here is some i written before as a answer to this kind of tasks which is ok start to understand the task.
It's not yet optimizes for speed(eg blake3 for hash would be lot faster).
So this iterate over all files in a folder includes sub-folders and find all duplicates files.

import hashlib from pathlib import Path def compute_hash(file_path): with open(file_path, 'rb') as f: hash_obj = hashlib.md5() hash_obj.update(f.read()) return hash_obj.hexdigest() def find_duplicate(root_path): hashes = {} for file_path in root_path.rglob('*'): # print(file_path) if file_path.is_file(): hash_value = compute_hash(file_path) if hash_value in hashes: print(f'Duplicate file found: {file_path}') else: hashes[hash_value] = file_path if __name__ == '__main__': root_path = Path(r'F:\images') find_duplicate(root_path)

noisefloor · (This post was last modified: Sep-28-2025, 10:14 AM by noisefloor.)

Hi,

generally speaking: the total file size on a drive is not necessarily related to the number of files. If you drive holds 1 TB of data, but all files are 50 GB 4k video files, you have 20 files only. If all files are 5kB text files, you have a hell a lot of files.

However, a pretty straight forward approach is:

* Iterate recursively for the files on each drive. Python's pathlib module is your friend for that.
* Store the _full_ path with the filename and file extension in a Python set, so you have one set for each drive.
* Let Python calculate the difference between the sets to see which files are not on the other drive.

The time-consuming step will be most likely iterating over the files, as it creates plenty of I/O. However, this step is mandatory anyway, no matter how you process the data afterwards.

Please remember that this approach _only_ checks for files with _exactly_ identical full path, filename and extension. If you look for duplicate files names only or duplicate dirs only, you need to alter the approach by deciding what to write to the set or have additional sets for the directories etc.
Please remember that this approach is purely name-based, so you won't find any duplicate files with different names but identical content. In case the latter is required, you need to calculate a hash for each file and compare hash values.

Regards, noisefloor

EDIT: writing this post overlapped with @snippsat post...

tester_V · Sep-28-2025, 06:17 PM

That’s why I come here: the advice is always straightforward and thoughtful, offered without attitude, and occasionally even with a touch of humor. Love you guys!

***snippsat*** · (This post was last modified: Sep-28-2025, 10:18 PM by snippsat.)

To modify code it so it can compare two folders or drives,this use blake3 for hash.
Uses sets for comparing.
Should be reasonable fast now if folders not to big,but there is still improvement that can help a lot eg can add concurrent.futures or maybe other improvement if do more testing.
Also now compare all files,if need only images,videos ect add that so it only compare those file extensions.

from pathlib import Path # pip install blake3 import blake3 CHUNK = 8 * 1024 * 1024 # 8MB def compute_hash(p: Path) -> str: h = blake3.blake3() with open(p, "rb") as f: for b in iter(lambda: f.read(CHUNK), b""): h.update(b) return h.hexdigest() def index_hashes(root: Path) -> dict[str, list[Path]]: hashes: dict[str, list[Path]] = {} for p in root.rglob("*"): if p.is_file(): try: hv = compute_hash(p) except Exception as e: print(f"Skip (error) {p}: {e}") continue hashes.setdefault(hv, []).append(p) return hashes def compare_folders(left: Path, right: Path) -> None: left_hashes = index_hashes(left) right_hashes = index_hashes(right) common = set(left_hashes) & set(right_hashes) only_left = set(left_hashes) - set(right_hashes) only_right = set(right_hashes) - set(left_hashes) print(f"\n=== Identical content across {left} and {right} ===") for hv in sorted(common): for lp in left_hashes[hv]: for rp in right_hashes[hv]: print(f"[SAME CONTENT] {lp} == {rp}") '''print(f"\n=== Content only in {left} ===") for hv in sorted(only_left): for lp in left_hashes[hv]: print(lp) print(f"\n=== Content only in {right} ===") for hv in sorted(only_right): for rp in right_hashes[hv]: print(rp)''' if __name__ == "__main__": # Edit these two folder to compare compare_folders(Path(r"E:\stuff"), Path(r"C:\stuff"))

Pedroski55 · (This post was last modified: Sep-29-2025, 06:50 AM by Pedroski55.)

@snippsat I don't understand why we should hash the file paths. I thought the file names are unique?

If you have time to explain, I, and maybe others, would be be very grateful!

I have 2 folders below with my own Python files in them. Some files are common to both folders.

This gets the names of files with identical names found in both folders and / or their sub-directories:

from pathlib import Path # a collection of old files I have used path1 = Path('/home/peterr/myPython') # path2 is newer,it is a Python virtual environment, all necessary modules are also stored in PVE # so it is bigger than path1 path2 = Path('/home/peterr/PVE') # this only gets file names from the top level directory /home/peterr/myPython file_names_list1 = [item.name for item in path1.iterdir() if item.is_file()] len(file_names_list1) # returns 129 # recursively get all file names in path1 # path1.rglob('*') is a generator, so small size file_names_list1_all = [item.name for item in path1.rglob('*') if item.is_file()] len(file_names_list1_all) # returns 8054 # recursively get all file names in path2 file_names_list2_all = [item.name for item in path2.rglob('*') if item.is_file()] len(file_names_list2_all) # returns 11195 common_files = [] for f in file_names_list2_all: if f in file_names_list1_all: common_files.append(f) len(common_files) # returns 1172 result = set(common_files) len(result) # returns 428

***snippsat*** · (This post was last modified: Sep-29-2025, 12:59 PM by snippsat.)

(Sep-29-2025, 06:50 AM)Pedroski55 Wrote: @snippsat I don't understand why we should hash the file paths. I thought the file names are unique?

If you have time to explain, I, and maybe others, would be be very grateful!

The only way to be sure if have/find all duplicates files is to use hash.
Filenames aren’t always reliable two different names can point to identical content.
That said your approach with compare filenames can in many cases work fine,
it depend on what user have done and changed with file content before.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	rename same file names in different directories	elnk	5	3,355	Jul-12-2024, 01:43 PM Last Post: snippsat
	Organization of project directories	wotoko	3	2,049	Mar-02-2024, 03:34 PM Last Post: Larz60+
	Copying the order of another list with identical values	gohanhango	7	4,094	Nov-29-2023, 09:17 PM Last Post: Pedroski55
	Listing directories (as a text file)	kiwi99	1	1,910	Feb-17-2023, 12:58 PM Last Post: Larz60+
	I need to copy all the directories that do not match the pattern	tester_V	7	6,163	Feb-04-2022, 06:26 PM Last Post: tester_V
	Putting code into a function breaks its functionality, though the code is identical!	PCesarano	1	3,299	Apr-05-2021, 05:40 PM Last Post: deanhystad
	scan drives in windows from Cygwin	RRR	1	2,881	Nov-29-2020, 04:34 PM Last Post: Larz60+
	identical cells in 2 different excel sheets python pandas	esso	0	2,390	Jul-19-2020, 07:50 PM Last Post: esso
	Python create directories within directories	mcesmcsc	2	3,581	Dec-17-2019, 12:32 PM Last Post: mcesmcsc
	Upload files to Google/Azure/AWS or cloud drives using python	tej7gandhi	0	2,652	May-11-2019, 03:02 PM Last Post: tej7gandhi

Comparing 2 100GB Drives/directories to see if they identical

User Panel Messages

Announcements