This article will give an overview of how perceptual image hashes can be used to detect duplicate and near duplicate images.
Install dependencies
Install txtai
and all dependencies.
pip install txtai[pipeline] textdistance !wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz tar -xvzf tests.tar.gz
Generate hashes
The example below generates perceptual image hashes for a list of images.
import glob from PIL import Image from txtai.pipeline import ImageHash def show(image): width, height = image.size return image.resize((int(width / 2.25), int((width / 2.25) * height / width))) # Get and scale images images = [Image.open(image) for image in glob.glob('txtai/*jpg')] # Create image pipeline ihash = ImageHash() # Generate hashes hashes = ihash(images) hashes
['000000c0feffff00', '0859dd04ffbfbf00', '78f8f8d8f8f8f8f0', '0000446c6f2f2724', 'ffffdf0700010100', '00000006fefcfc30', 'ff9d8140c070ffff', 'ff9f010909010101', '63263c183ce66742', '60607072fe78cc00']
Hash search
Next we'll generate a search hash to use to find similar near-duplicate images. This logic takes a section of an image and generates a hash for that
# Select portion of image width, height = images[0].size # Get dimensions for middle of image left = (width - width/3)/2 top = (height - height/1.35)/2 right = (width + width/3)/2 bottom = (height + height/1.35)/2 # Crop image search = images[0].crop((left, top, right, bottom)) show(search)
Now let's compare the hash to all the image hashes using Levenshtein distance. We'll use the textdistance library for that.
import textdistance # Find closest image hash using textdistance shash = ihash(search) # Calculate distances for search hash distances = [int(textdistance.levenshtein.distance(h, shash)) for h in hashes] # Show closest image hash low = min(distances) show(images[distances.index(low)])
And as expected, the closest match is the original full image!
Generate hashes with Embeddings indexes
Next we'll add a custom field with a perceptual image hash and a custom SQL function to calculate Levenshtein distance. An index of images is built and then a search query run using the distance from the same search hash.
from txtai.embeddings import Embeddings def distance(a, b): if a and not b: return len(a) if not a and b: return len(b) if not a and not b: return 0 return int(textdistance.levenshtein.distance(a, b)) # Create embeddings index with content enabled. The default behavior is to only store indexed vectors. embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": "image", "functions": [distance]}) # Create an index for the list of text embeddings.index([(uid, {"object": image, "text": ihash(image)}, None) for uid, image in enumerate(images)]) # Find image that is closest to hash show(embeddings.search(f"select object from txtai order by distance(text, '{shash}')")[0]["object"])
And just like above, the best match is the original full image.
Wrapping up
This article introduced perceptual image hashing. These hashes can be used to detect near-duplicate images. This method is not backed by machine learning models and not intended to find conceptually similar images. But for tasks looking to find similar/near-duplicate images this method is fast and does the job!
Top comments (0)