The current state of image search

A picture is worth a thousand words, especially to search engines trying to match a brief search query to a set of appropriate visual results. How can a web search engine collect enough data about a particular image to provide a user with relevant results? In this post I will outline image search concepts, the current state of the art, and outline some of the challenges with still image search.

Image on your website

You might recognize the depiction above as Yoda, a popular character the Star Wars movie series. More specifically this is a picture of a Yoda statue perched on top of a fountain at Lucasfilm’s headquarters in San Francisco. Here’s what Yoda might look like expressed on a web page.

<img src="yoda.jpg"
 alt="Yoda statue"
 longdesc="yodainfo.html"
 width="195" height="240"
 xml:lang="en-US" />

The above markup communicates a few attributes of the image the publisher would like to display using the img element of (x)HTML. A publisher will specify the location of the file but the other attributes are often not used to add further information about the image.

I’ve provided a few extra pieces of data in my example. The alt attribute provides a brief description and is used by browsers as a placeholder while the image is retrieved, if it can be retrieved at all. The longdesc attribute links to a URL with a longer description of the image. The width and height of the image is described in pixels, and all values are provided in English. This extra data is uncommonly used, although XHTML requires both the location of the file (src) and a brief description (alt).

Most search engines utilize the file name as an approximate descriptor of the image. A digital still camera will create serial file names such as DSC001.jpg, making things much worse!

Hosted image libraries

How do image hosting sites provided by major search engines change the ability to search your latest still image? Yahoo!’s Flickr and Google’s Picasa encourage users to add extra descriptors to images to enable better discoverability and sharing. The description data is more visible than standard HTML markup, making that DSC0001.jpg image title look pretty ugly. Any Flickr-hosted photo displayed in another web page must also include a link to the full Flickr photo page, thereby creating a long description of the image for all search engines.

My Flickr photo page of Yoda contains a short description, long description, a set of keywords provided by me and/or other site users, and metadata extracted from the image file such as the date and time reading on the digital still image device on capture. Other data such as geographic coordinates may be extracted and displayed on this page, or I might take a few extra steps to manually add the metadata.

The popularity of a particular photo measured by the hosting site complements other ranking factors such as page- and author-level link ranks. Data gathering possibilities are defined by the manual data input of each user as well as the information present at time of capture and edit. It’s very easy for a search service such as Yahoo! or Google to reach out and “touch” these images stored just a short fiber down the rack.

External citations

The image may also be described by links from other websites to the hosted page or the image itself. In this case image search can use similar citation analysis as traditional web search to note how other publishers reference a particular resource.

Touching the image

A few more pieces of data are available to indexers once they take a peek inside the actual image file. Date and time of capture, camera settings, location, and copyright data may be described in formats such as Exif or XMP, adding even more context.

Date and time

Most digital image capture devices include a clock and timestamp their photos. The time available on a mobile phone syncs over-the-air and generally more reliable than a typical digital camera which requires additional setup and menu navigation.

Geolocation

Where in the world did you take that photo? Mobile phones are delivering better location-aware services with each new release, fueled by government demand for better emergency services for mobile customers and the industry’s desire to capitalize on location-aware service offerings. Some phones include an actual GPS receiver while others rely on the same cell tower triangulation that helps deliver a call to your handset.

A standalone GPS can synchronize its coordinates with a stand-alone digital camera based on the timestamp on each device. A bicyclist with a GPS receiver and a stand-alone digital point-and-shoot can combine data from each gadget and plot their entire bike ride complete with pictures.

A WiFi-enabled camera can ping nearby access points to approximate its current location using location data provided by the access point or by comparing the access point’s digital fingerprint against a mapped database such as Microsoft Virtual Earth.

Copyright data

A publisher may describe a photo’s copyright in plain text or by pointing to a URL with more information. If that URL is Creative Commons it’s pretty easy to parse license terms.

Machine viewing

Text

An indexer might take a look at the photo and try to analyze its depictions. Pictured above is the drink menu from Stumptown Coffee Roasters in Portland, Oregon with lots of text. If a machine could recognize the words “espresso” and “latte” in the picture it could build a richer data set for this image. The same technology is useful for decoding image headers found in web pages and for testing CAPTCHA images designed to be parsed by humans, not machines.

People, places, things

Facial-recognition technology can identify the same photo subject across multiple photo captures by analyzing patterns across common facial attributes. The technology is used by security systems, such as comparing World Cup attendees against a list of known troublemakers. A photo publisher can install software on their desktop computer to analyze each photograph looking for people, places, and things familiar to that person or the software’s larger community. The software can then identify things such as a previously identified person such as the boy pictured above, a picture of the Eiffel Tower already identified by other users of the software, or a Coke bottle present in a photo.

Google acquired Neven Vision in August to boost their ability to extract information from image depictions. Riya is working on image recognition technology applied to image search.

Summary

A search engine has a variety of data available when trying to make sense of a particular image. The most reliable data comes from auto-configured machines, but humans can supplement and correct this data if they choose to involve themselves in the process. Advances in capture hardware and software will continue to add more valuable metadata surrounding the photo, allowing a search engine to better understand the image with less text from the publisher.

The biggest area for search advancement currently lies in image analysis for text, people, places, and things. National security budgets are currently funding advanced research in this area that will hopefully trickle down to the consumer sector to help us better identify our family photo collections without repetitive data input.