Over the years, various approaches have been proposed to address the problem of matching images based on similarity. These approaches mainly differ in how they form the feature vector for an image. The most widely used sources for feature extraction are color, shape and texture.
One of the first algorithms utilizing color was proposed by Swain et al. . This algorithm computes color histograms for images, followed by an intersection of the histograms to perform the similarity search. Improvements that add spatial information and correlation to color histograms were proposed in  and . Color histograms are very sensitive to noise and they work best when both the input image and the database images are taken by the same device and therefore have similar color representation. Unfortunately, this is not the case for our domain: cameras on phones tend to take relatively noisy pictures, and our database consists of digital versions of cover art, not photos of them.
In addition to color histograms, various methods look for textures in an image and their spatial placement, and construct a feature vector based on this information. One important work utilizing textures is given by Tamura et al. . In this paper, the authors propose approximations to the following texture features: Coarseness, contrast, roughness, regularity, directionality and linelikeness. These properties are based on how humans actually perceive textures. While these methods are usually good for detecting uniform regions such as sky and sea in an image, they are
not applicable to our domain with good results because such regions are not enough to correctly identify a single match for an input image.
Shape can also be used to build a feature vector from an image. Methods relying on shape usually work on the similarity of edges, corners and shapes of the objects in the image. Feature extraction using shapes usually works in a local level: it is concerned with locating points of interest in the image, rather than considering the global distribution of a feature as in color-based methods. One of the most commonly used shape detectors is the corner detector by Harris . Another descriptor that makes use of local points of interest is SIFT (Scale-invariant feature transform) . SIFT is a popular method based on detection of key points in an image using the Difference of Gaussians method. The resulting features are invariant to scaling, differences in illumination and rotation, and works good even for 3D images and different points of view. These properties make SIFT ideal for our architecture.