Google provides a full suite of services for the entry-level blog spammer. There are plenty of legitimate uses for all of these Google services, but Google's market-leading position in search creates a spam ecosystem that inflates corporate revenues, index size, and user data. Google's blog hosting service, Blog*Spot, received a lot of attention this week as blogosphere neighbors threw up their arms in protest of the host, which is like the seedy motel at the edge of town that rents by the-hour. It's cheap and inviting to those who know no better, but those in the know don't want anything to do with it.
I will describe the Google elements that contribute to a spam farm in an attempt to create more understanding about how your content ends up where you may not want it.
The host
Blogger's Blog*Spot hosting is a quick and easy way to create new blogs. It's free, you can post via e-mail, and many people think a Blog*Spot blog is the quickest way into Google's search index since the blog hosting servers might be only a few rows away from the Google crawler and of course Google knows how to find all of the content inside its own system.
The image above is a completely automated public Turing test to tell computers and humans apart, commonly referred to by its acronym: CAPTCHA. A CAPTCHA is supposed to be easy for a human to decipher, but difficult for computers using image recognition software.
Blogger requires users to solve the above CAPTCHA before creating a new blog. Yet the system is bypassed daily and thousands of new blogs are created.
A simple CAPTCHA can be broken using optical character recognition, the same technology that scans a printed page and converts the words to plain text.
A common way to bypass a CAPTCHA system is to offer humans a reward for successfully entering the scrambled word. Some sites trade free porn for a CAPTCHA solutions, others hire people in low-income areas of the world to sit in front of a computer and solve CAPTCHAs all day.
The content
Google provides a lot of free content for someone to repurpose on their newly created Blog*Spot blog. Search Google's web, news, or blog results for the keyword of your choice and you will receive a list of content sources Google has determined is most relevant to the query. Copying from the top of these results is an easy way for spammers to obtain content already deemed relevant by Google for inclusion in its own pages.
You will often see spam blogs composed of a group of results including a title, link, and except for targeted keywords. These pages are meant to attract search referrals for advertising or create more pages linking to a site the spammer would like to promote.
Google blog search is the newest Google search service with relevant content available for scraping. Many of the cries from bloggers over the past week were most likely a result of a spammer using a script to retrieve the top search results on Google's blog search ranked by relevance for inclusion on a newly created Blog*Spot blog.
The payout
Google AdWords places text advertisements across the web related to the textual content of a page. Every time someone clicks on a Google text ad for "refinance" it costs the advertiser over $35 and makes the site owner some money. "Vioxx" pays about $16.50 a click, "poker" pays about $2.50 a click, and "camcorder" pays about $2.60 a click on Google's advertising network. The newly created blog can make money from these advertisements based on how many people are searching for their targeted keyword, the likelihood of a visitor to click on an ad, and the payout for such keywords.
Automation
The above process becomes even easier through the use of automated tools for blog creation, content retrieval, and advertising placement. More expensive tools include the use of pre-configured Blog*Spot blogs for a quick start.
Conclusion
Free web hosts have hidden costs. You don't have friendly neighbors and it's possible that search engines will not want to help others discover your area of the web.
Google has taken more steps to protect its e-mail service, Gmail, from spammers than it has taken them away from Blog*Spot. There is a lot more that Google can do to reduce spam, reduce click fraud, and improve their Blogger service, but it might involve losing some advertising revenue in the short-term. I think no company in the business of content generation, indexing, or payment can afford to ignore the problem.

21 Comments
Commentary on "Google spam suite primer":
Subscribe to new comments
Jeremy Zawodny on October 24, 2005 at 8:18 AM wrote: #
Automation? See also: http://jeremy.zawodny.com/blog/archives/005549.html
:-)
Greg Linden on October 24, 2005 at 8:23 AM wrote: #
Hi, Niall. I thought you might be interested in knowing that there are some good algorithms for cracking captchas these days too, no humans necessary.
For the details, see captcha.net and the papers on that site.
Niall Kennedy on October 24, 2005 at 8:57 AM wrote: #
Niall Kennedy on October 24, 2005 at 11:03 AM wrote: #
TechCrunch on October 24, 2005 at 11:42 AM wrote: #
Ian Kennedy on October 24, 2005 at 11:42 AM wrote: #
Niall Kennedy on October 24, 2005 at 12:00 PM wrote: #
Jason,
Any examples I might provide of human-powered CAPTCHA solvers would be as a result of my employment at Technorati and I have purposely chosen to not share that information.
Think of it like a doctor telling you what he's seen in the operating room without giving you the names of the patients.
Niall, I get what you're saying, but I'm not too sure I buy the analogy. As a physician, I have to consider both the moral obligation to not disclosing specific patients' illnesses as well as the very real legal obligations of the HIPAA laws. Of course, this doesn't mean that I can't publish what I have discovered -- and publish the (very real, very verifiable) information that goes along with it (e.g., blood counts, CT scans, pathology specimen images) -- so that others both know about the findings and have reason to believe that they're real; that's the only way that medicine progresses.
What's the similar restriction in the web world that prevents you from providing proof of the exploits you've described? Similarly, what has prevented anyone else -- literally, anyone at all -- from pointing to sites that use free porn or low-cost labor to contravene CAPTCHAs? The entirety of the glaring lack of evidence makes the whole story a little harder to take at face value. Again, I'm not saying that it doesn't happen, just that I've yet to see a single person demonstrate it to happen, an important distinction.
(To flip your own analogy: if an oncologist tells a patient that he's seen times when widely metastatic colon cancer just goes away without any treatment, that patient might believe the doctor, but more realistically would probably want some verifiable proof of the statement before accepting it.)
pwb on October 24, 2005 at 12:39 PM wrote: #
Pardon my ignorance, but could you explain what the purpose of blog spamming is? Is it to lift PageRank? Make money off AdSense?
If that latter, could the problem be attacked on the back-end by making sure Google only pays legitimate enterprises?
phil jones on October 24, 2005 at 1:46 PM wrote: #
So what's the advice to genuine bloggers who've been using blogspot for years?
Where should we go to leave the "sleezy" hotel? Or have all the rich bloggers now decided that the non-paying masses should no longer be allowed to have free bloghosting?
Niall Kennedy on October 24, 2005 at 2:38 PM wrote: #
PWB,
Two possible reasons a spam blog might be created are to increase the amount of targetted advertising inventory available or to create more links promoting a site.
Yes, Google and others can attack the problem at their checkbooks if they can successfully identify the bad actors.
Niall Kennedy on October 24, 2005 at 2:45 PM wrote: #
Phil,
I recommend if you would like to continue using Blogger's Blog*Spot hosting you should demand more of your host. Send in some feedback and let the team at Blogger know your concern that your blog may be cut off or deeply discounted in search results because of the rising problem of spam blogs on the system.
Create a neighborhood association to address the problem or you could move to another free blog host such as a TypePad partner or WordPress.com. You might also have free hosting you are not using through a membership organization or an Internet service provider you could consider.
One data point. I'm starting to see Wordpress based spam blogs using all the same techniques as above. I suspect that the barrier to entry has dropped to nothing and a Wordpress farm is probably not that much harder to set up than a blogspot farm.
I also see a conflict of interest here for Google. If the intention is to receive AdSense publisher revenue, it's also inflating Google's revenues. Or is that too cynical for you all? In their new post-IPO worldview are they balancing the effort to stop pollution of their search index against increased income from their advertising business? Nah, they wouldn't do that. After all, they do no evil, right?
Hal on October 24, 2005 at 3:36 PM wrote: #
aaron wall on October 24, 2005 at 5:47 PM wrote: #
AdSense spam does not hurt all parties the same.
If the content looks ugly enough the ads are the obvious thing to click on.
James Kew on October 24, 2005 at 7:34 PM wrote: #
Blogger themselves have noted human-powered CAPTCHA solving in the recent Blogger Buzz On Spam posting:
Tim Converse on October 24, 2005 at 11:31 PM wrote: #
Hi Niall ---
I believe the claim that there are offshore human captcha-solvers, working for very low wages. Like you I have independent evidence of that claim that I do not want to share.
But I do not believe the claim that there are systems that reward captcha-solvers with porn rather than with money. I was incautious enough to repeat this idea at a workshop on CAPTCHAs (that's me in the middle of the photo), and was immediately challenged --- no one's been able to substantiate any such report.
I would be happy to give any of the following to the first person that can persuade me (even while swearing me to secrecy) that a captchas-for-porn scheme has actually been implemented:
ObPlug: captchaservice.org
Robert on August 24, 2006 at 11:50 AM wrote: #
Add a comment
Some comments may be placed in a moderation queue to ensure topical relevancy. You may contact author Niall Kennedy directly if you prefer to comment privately.