I’ve released a Go library Github: Duplo. The library can be used to detect image duplicates or to find similars in a set of images. It is an implementation of Fast Multiresolution Image Querying by Jacobs et al. It works surprisingly well if you want to find duplicate images which may have been modified slightly (e.g. colour correction, different compression / file format, watermarks). Applications may include:
- Recognize copyright violations
- Save disk space by detecting and removing duplicate images
- Search for images by similarity
The images themselves are not stored in the data structure. Instead, they are first reduced in size. A Haar wavelet transform is then applied to the downsized image. To save even further space, only the top 40 Haar coefficients are kept in a data structure that allows for fast queries. These coefficients make up the “visual hash” of the image. (Binary hashes such as MD5 do not work here because similar but different images do not share the same hash.)
To determine how similar two images are, a distance function takes their two hashes and calculates a distance value. The smaller that value, the more similar the two images are.
It works very well on photos. But it also has limits. Especially when there are many images with only very slight differences, the algorithm may fail. Here is an example:
The method is also fundamentally different from other approaches that can find partial or distorted images. There is also no “intelligence” built in that may detect faces, text, or other patterns (e.g. QR codes). For those applications, you may want to look at projects such as OpenCV.
The same algorithm has previously been used by the imgSeek software as well as the retrievr tool. I’ve decided to implement a modern version of it in Go because it’s fast and works well as a server-side process. At Stock Performer, we have it in production and our users are very happy with the results.