The current state of audio search

Online audio is definitely on an upswing, fueled by the iPod revolution, improved online playback, and broadband penetration. Audio search is keeping up with demand for new content, thanks in part to national security spending in the Cold War and beyond. In this post I will outline the current state of audio search, and how machines make sense of spoken word, progressing from easy to difficult.


First, let’s define the space. I’m interested how a search engine might index content with non-professionally produced metadata. The President’s weekly radio address contains a full transcript. Music catalogs are available for purchase from Muze and others to provide structured data about Bob Dylan and what he’s saying. A voicemail message or a podcast might not be as thoroughly described.

Let’s take a look at audio files a search engine might discover during a web crawl and current methods of understanding the content.

Filetype identification

Audio content can be broken down into a few unique file extensions that hint at the remote audio container.

The waveform audio format is a common form of uncompressed audio on Windows PCs.
The Audio Interchange File Format is a common form of uncompressed audio on Apple computers.
MPEG-1 Audio Layer 3 is a popular form of distribution for compressed audio files.
Windows Media Audio, popular on Windows machines.
Advanced Systems Format, a container for streaming audio and video commonly used by Microsoft products.
MPEG 4 audio files, most likely Advanced Audio Coding compressed audio created by Apple software.
RealAudio format by Real Networks
Ogg Vorbis open source compression format.
The Free Lossless Audio Codec is a compressed format used by audioheads and for archival purposes.

A web search engine can take a look at all of the links in its index and identify possible audio files based on these file extensions without retrieving any file information from the host server. You can search Google for URLs containing “MP3” and referencing “Bob Dylan.” Audio files are not currently supported in Google’s file type operator. exposes bookmarked audio through the system:media:audio tag.

HTML markup

Audio files found in the wild are often described and referenced from within HTML pages. Here’s an example of how an audio file might be described within a web page link:

<a href="speech.mp3"
 title="A longer description of the target audio">
 A short description</a>

The href attribute points to the location of the audio file. The audio/mpeg type value provides a hint for user agents about the type of file on the other end of the link. The hreflang attribute communicates the base language of the linked content. The title attribute provides more information about the linked resource, and may be displayed as a tooltip in some browsers. The element value, “A short description,” is the linked text on the page.

It’s not very likely publishers will produce more data than the functional effort of href. Title is a semi-visible attribute and therefore more likely to be included in the description, but still uncommon. It’s possible to identify audio by a given MIME type such as audio/mpeg but few sites provide the advisory hint of type in their HTML markup. Collecting a file’s MIME type requires “touching” the remote file, and will most likely return default values of popular hosting applications such as Apache or IIS, so a search engine is likely better off relying on a local list of mapped extensions and helper application behaviors.

Syndication formats

It is possible for a publisher to include more information about a file using a syndication feed combined with a specialized namespace such as the iTunes podcasting spec or Yahoo! Media RSS. A search engine may parse these feeds to gather more information about a particular audio item such as title, description, and length, which often provides a closer correlation than an audio link present on a web page.

Hosted audio

Large search engines such as Google, Yahoo!, and Microsoft have not created the same sort of hosted audio community for user-generated content as is present in images or video. Sites such as the Internet Archive host audio such as a Grateful Dead concert complete with data such as artist, title, performance date, equipment used, and audio editors.

Apple’s GarageBand software is one example of integrated recording, compression, descriptive markup, and remote hosting.

Metadata containers

Once you reach out and “touch” the audio file the search engine can discover more description information embedded within. An ID3 tag describes the track title, artist, album, genre, and other information provided by the publisher. The metadata descriptor might contain additional information such as album art, lyrics, or descriptions specific to a specific segment of the audio file described as “chapters.” An audio metadata parser takes a look at each frame it knows how to read to extract the associated descriptive data.

ID3 tags often occur at the beginning of the file to assist streaming applications and a metadata indexer might not grab the entire audio file, opting instead to only look for data in those first bytes.

Parsing spoken word

Speech recognition has enjoyed rapid improvement over the last decade, thanks in part to the large budgets of national security indexing spoken words captured through ECHELON and other methods. Similar technology is now being applied to medical and legal transcriptions and creating more searchable content for each podcast.

AVOKE ATX Speech processing

Speech-to-text software such as AVOKE from BBN Technologies is used to create transcripts of phone calls to call centers, the nightly news, and government surveillance. The system utilizes known vocabularies by language applied over a continuous density hidden Markov model to analyze speech phonemes in various contexts. The system uses multiple passes to determine context and associative clustering of words and phrases.

Spoken word analysis is utilized in consumer search engine PodZinger to track a search term and jump to the appropriate marker within the file containing the given term. You can search for audio containing mentions of the Athletics and Tigers and view your results in the context of the file with direct links to that segment of the audio program.


Online audio content will only continue to get bigger, as more content makes its way online and into the ears of consumers on a PC, iPod, or other listening device. The maturity of online audio and the current business feasibility should consolidate audio format offerings into audio understood by dominant market players in the desktop, portable, and home theater markets.

I expect even more speech-to-text work in the future as the CPUs, memory, and disk space available continues to become computationally and monetarily cheaper. Perhaps we might even see client-side analysis of content similar to analysis work being conducted on images. Windows Media Player and iTunes are just two examples of popular media players that connect to the Internet to retrieve more information about your media files, from album art to recorded year. In the future such applications might also query data services such as, MusicBrainz, or the Music Genome Project to apply more data to each file based on a purchased database, collective intelligence, or expert analysis.

Creating new sources of audio content is becoming easier. The popularity of VoIP will place new value on microphones connected to our PCs, gaming systems, and other connected electronics devices. Voice will become an integrated feature, allowing you to easily save a compressed audio file of a recent planning call or your Halo trash-talking session.

I think many search engines have looked past audio search due to the litigious nature of the RIAA and others evidenced by last year’s MGM vs. Grokster Supreme Court ruling. Google’s recent $1.65 billion purchase of YouTube is perhaps a sign that search technology will continue to advance, challenging any emergent legal roadblocks along the way.

As with most search sectors, audio search is still in very early stages. Expect known vocabularies and relationship mappings to increase over time, providing more insight not only into each word, but also speaker identification, tone, and possibly even relationships between events such as a power outage’s correlation to customer service calls. We’ll keep talking and publishing and search will attempt to keep up with our rate of speech, accents, and methods of describing our creations.