Google processes over 20 petabytes of data per day

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month. These are just some of the facts about the search giant’s computational processing infrastructure revealed in an ACM paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat.

Twenty petabytes (20,000 terabytes) per day is a tremendous amount of data processing and a key contributor to Google’s continued market dominance. Competing search storage and processing systems at Microsoft (Dyrad) and Yahoo! (Hadoop) are still playing catch-up to Google’s suite of GFS, MapReduce, and BigTable.

MapReduce statistics for different months
Aug. 2004Mar. 2006Sep. 2007
Number of jobs (1000s)291712,217
Avg. completion time (secs)634874395
Machine years used2172,00211,081
map input data (TB)3,28852,254403,152
map output data (TB)7586,74334,774
reduce output data (TB)1932,97014,018
Avg. machines per job157268394
Unique implementations
map3951,9584,083
reduce2691,2082,418

Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).

The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.

Summary

The January 2008 MapReduce paper provides new insights into Google’s hardware and software crunching processing tens of petabytes of data per day. Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data. It’s some fascinating large-scale processing data that makes your head spin and appreciate the years of distributed computing fine-tuning applied to today’s large problems.

12 comments

Commentary on "Google processes over 20 petabytes of data per day":

  1. Kevin Burton on wrote:

    I have the paper hosted on my blog since the ACM wasn’t cool enough to provide a copy for free.I assume Jeff Dean, Sanjay Ghemawat, Google hold the copyright on this…

    • Fred Oliveira on wrote:

      Kevin: sweet stuff – I looked around and couldn’t find it either -, thanks!

    • G on wrote:

      Kevin Burton – thx for the hosting the paper!

  2. Startled on wrote:

    When I began buying tightly-targeted ads on Google recently, I was startled to discover that Google clutters its own index with the content of AdSense ads, and includes many pages which contain no other content whatever.

    One wonders how much this adds to their workload…

  3. Eric on wrote:

    Great article. Your site is asking me if I will let it access my Google gears info…I am saying no for now and cannot find reference to what it is asking for. Can you provide what is going on before I say yes?

    • Niall Kennedy on wrote:

      Eric, My site utilizes the Google Gears LocalServer module API if it detects you have Gears installed. I am caching static resources on your machine for offline access and quick local retrieval.

    • Eric on wrote:

      Thanks for the explanation…ok ill do it:)

  4. Sam on wrote:

    Does this pass the smell test? 20 PB per day on 400 machines? If my calculations are correct, that would be 621.38 MB per second. That’s faster than you can read from disks or networks with the commodity hardware they describe.

    • Niall Kennedy on wrote:

      Sam, The paper references over 20 petabytes of data crunched per day and over 2.2 million MapReduce jobs executed in September 2007. The average job runs on 400 machines, but Google has more than 400 machines in its total cluster. Jeff Dean ran a test job on 1800 machines over a weekend for an example cited in the paper.

  5. Manele on wrote:

    Yeah, thanks for the paper!

  6. Dave Brondsema on wrote:

    From the paper, the table of statistics is only a subset of MapReduce jobs at Google.