Python Forum
Comparing 2 100GB Drives/directories to see if they identical
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Comparing 2 100GB Drives/directories to see if they identical
#1
Greetings to those who work on Saturdays!
I’m tinkering with a small Python project and would love to hear your ideas or suggestions.
I'm going to write a script to compare 2 Drives or Directories (with tons of subdirs) to see if they have same files/folders.
Some of those have 100GBs some 2TBs of files (Pictures and Video files) and so on...
I thought I could do this:
For Disk1. Get a list of files and a directory name for each subdir
For Disk2. Get a list of files and a directory name for each subdir
Compare each list.
I leaned it is better to ask, before I'll starting writing, if there is abetter way to do it. Usually there is a better way... Wink
Thank you in advance.
Tester_V
Reply
#2
Just a tip: search "handling large datasets in Python" you will find many webpages with good advice.

This is a good start.

You will have 2 large sets of data to compare.

Later on, you can get a job with Homeland Security, spying on all Americans! (No, we don't do that!! HS)
Reply
#3
Big Grin For the Saturday night? Yes that is funny! Dance
Reply
#4
To understand the problem start by finding duplicates in eg in on folder first.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
Have to use file hash and compare to find duplicates.
Here is some i written before as a answer to this kind of tasks which is ok start to understand the task.
It's not yet optimizes for speed(eg blake3 for hash would be lot faster).
So this iterate over all files in a folder includes sub-folders and find all duplicates files.
import hashlib from pathlib import Path def compute_hash(file_path): with open(file_path, 'rb') as f: hash_obj = hashlib.md5() hash_obj.update(f.read()) return hash_obj.hexdigest() def find_duplicate(root_path): hashes = {} for file_path in root_path.rglob('*'): # print(file_path) if file_path.is_file(): hash_value = compute_hash(file_path) if hash_value in hashes: print(f'Duplicate file found: {file_path}') else: hashes[hash_value] = file_path if __name__ == '__main__': root_path = Path(r'F:\images') find_duplicate(root_path)
tester_V likes this post
Reply
#5
Hi,

generally speaking: the total file size on a drive is not necessarily related to the number of files. If you drive holds 1 TB of data, but all files are 50 GB 4k video files, you have 20 files only. If all files are 5kB text files, you have a hell a lot of files.

However, a pretty straight forward approach is:

* Iterate recursively for the files on each drive. Python's pathlib module is your friend for that.
* Store the _full_ path with the filename and file extension in a Python set, so you have one set for each drive.
* Let Python calculate the difference between the sets to see which files are not on the other drive.

The time-consuming step will be most likely iterating over the files, as it creates plenty of I/O. However, this step is mandatory anyway, no matter how you process the data afterwards.

Please remember that this approach _only_ checks for files with _exactly_ identical full path, filename and extension. If you look for duplicate files names only or duplicate dirs only, you need to alter the approach by deciding what to write to the set or have additional sets for the directories etc.
Please remember that this approach is purely name-based, so you won't find any duplicate files with different names but identical content. In case the latter is required, you need to calculate a hash for each file and compare hash values.

Regards, noisefloor

EDIT: writing this post overlapped with @snippsat post...
tester_V likes this post
Reply
#6
That’s why I come here: the advice is always straightforward and thoughtful, offered without attitude, and occasionally even with a touch of humor. Love you guys!
Reply
#7
To modify code it so it can compare two folders or drives,this use blake3 for hash.
Uses sets for comparing.
Should be reasonable fast now if folders not to big,but there is still improvement that can help a lot eg can add concurrent.futures or maybe other improvement if do more testing.
Also now compare all files,if need only images,videos ect add that so it only compare those file extensions.
from pathlib import Path # pip install blake3 import blake3 CHUNK = 8 * 1024 * 1024 # 8MB def compute_hash(p: Path) -> str: h = blake3.blake3() with open(p, "rb") as f: for b in iter(lambda: f.read(CHUNK), b""): h.update(b) return h.hexdigest() def index_hashes(root: Path) -> dict[str, list[Path]]: hashes: dict[str, list[Path]] = {} for p in root.rglob("*"): if p.is_file(): try: hv = compute_hash(p) except Exception as e: print(f"Skip (error) {p}: {e}") continue hashes.setdefault(hv, []).append(p) return hashes def compare_folders(left: Path, right: Path) -> None: left_hashes = index_hashes(left) right_hashes = index_hashes(right) common = set(left_hashes) & set(right_hashes) only_left = set(left_hashes) - set(right_hashes) only_right = set(right_hashes) - set(left_hashes) print(f"\n=== Identical content across {left} and {right} ===") for hv in sorted(common): for lp in left_hashes[hv]: for rp in right_hashes[hv]: print(f"[SAME CONTENT] {lp} == {rp}") '''print(f"\n=== Content only in {left} ===") for hv in sorted(only_left): for lp in left_hashes[hv]: print(lp) print(f"\n=== Content only in {right} ===") for hv in sorted(only_right): for rp in right_hashes[hv]: print(rp)''' if __name__ == "__main__": # Edit these two folder to compare compare_folders(Path(r"E:\stuff"), Path(r"C:\stuff"))
Pedroski55 likes this post
Reply
#8
@snippsat I don't understand why we should hash the file paths. I thought the file names are unique?

If you have time to explain, I, and maybe others, would be be very grateful!

I have 2 folders below with my own Python files in them. Some files are common to both folders.

This gets the names of files with identical names found in both folders and / or their sub-directories:

from pathlib import Path # a collection of old files I have used path1 = Path('/home/peterr/myPython') # path2 is newer,it is a Python virtual environment, all necessary modules are also stored in PVE # so it is bigger than path1 path2 = Path('/home/peterr/PVE') # this only gets file names from the top level directory /home/peterr/myPython file_names_list1 = [item.name for item in path1.iterdir() if item.is_file()] len(file_names_list1) # returns 129 # recursively get all file names in path1 # path1.rglob('*') is a generator, so small size file_names_list1_all = [item.name for item in path1.rglob('*') if item.is_file()] len(file_names_list1_all) # returns 8054 # recursively get all file names in path2 file_names_list2_all = [item.name for item in path2.rglob('*') if item.is_file()] len(file_names_list2_all) # returns 11195 common_files = [] for f in file_names_list2_all: if f in file_names_list1_all: common_files.append(f) len(common_files) # returns 1172 result = set(common_files) len(result) # returns 428
Reply
#9
(Sep-29-2025, 06:50 AM)Pedroski55 Wrote: @snippsat I don't understand why we should hash the file paths. I thought the file names are unique?

If you have time to explain, I, and maybe others, would be be very grateful!
The only way to be sure if have/find all duplicates files is to use hash.
Filenames aren’t always reliable two different names can point to identical content.
That said your approach with compare filenames can in many cases work fine,
it depend on what user have done and changed with file content before.
Pedroski55 likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  rename same file names in different directories elnk 5 3,355 Jul-12-2024, 01:43 PM
Last Post: snippsat
  Organization of project directories wotoko 3 2,049 Mar-02-2024, 03:34 PM
Last Post: Larz60+
  Copying the order of another list with identical values gohanhango 7 4,094 Nov-29-2023, 09:17 PM
Last Post: Pedroski55
  Listing directories (as a text file) kiwi99 1 1,910 Feb-17-2023, 12:58 PM
Last Post: Larz60+
  I need to copy all the directories that do not match the pattern tester_V 7 6,163 Feb-04-2022, 06:26 PM
Last Post: tester_V
  Putting code into a function breaks its functionality, though the code is identical! PCesarano 1 3,299 Apr-05-2021, 05:40 PM
Last Post: deanhystad
  scan drives in windows from Cygwin RRR 1 2,881 Nov-29-2020, 04:34 PM
Last Post: Larz60+
  identical cells in 2 different excel sheets python pandas esso 0 2,390 Jul-19-2020, 07:50 PM
Last Post: esso
  Python create directories within directories mcesmcsc 2 3,581 Dec-17-2019, 12:32 PM
Last Post: mcesmcsc
  Upload files to Google/Azure/AWS or cloud drives using python tej7gandhi 0 2,652 May-11-2019, 03:02 PM
Last Post: tej7gandhi

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.