4 unusual tricks to not give SE-robots a chance to waste all your traffic
It’s not a secret for anyone that in last months there is going an agressive fight among three top SE for status of SE with the biggest amount of indexed sites. Sequently, crawlers dramatically increase their activity. Yahoo makes thousands of hits per day for the same pages and waits for new links. MSN attacks servers occasionally and put them out of action. But more important that every hit wastes your traffic which you pay for.
I’m not going to give you banal tips to save your traffic (like “protect images from hotlinking” or “ban bad robots”). It won’t help to save traffic. The problem is more tragic than it seems: three SE waste more traffic than ordinary surfers or some crawlers do, and you cannot ban them. What can you do then?
1. Limit excessive activity of Yahoo and MSN search robots using Crawl-delay parameter. In view of numerous complaints, these two SE added the parameter that only their spiders understand to robot.txt ([1], [2]). It’s value is set in seconds and means delay between page requests. For example, you want to set 10 seconds delay for Yahoo crawler and 2 minutes for MSN. Write in your robot.txt:
User-agent: Slurp
Crawl-delay: 10User-agent: msnbot
Crawl-delay: 120
2. Ban Google, MSN and Yahoo image crawlers if you no need image searches. Most websurfers use image search for pictures only and rarely visit your site. Do you need this? Do you want crawlers wasting your traffic and downloading your images? I guess, not. You can forbid it in robot.txt.
To ban Google Image Search robot write:
User-agent: Googlebot-Image
Disallow: /
To ban Yahoo! Image Search robot:
User-agent: Yahoo-MMCrawler
Disallow: /
To ban MSN Image Search robot:
User-agent: msnbot-MM
Disallow: /
Updated: No doubts, it’s up to you to decide whether you need image searches or not. My opinion is you need image searches just in one case - if your images are so creative that make a surfer go to your site. For example, your beautiful art or very detailed scheme. Anyway, as Jonathan advices, you can check your logs first to see where visitors come from. Image crawlers don’t visit very often: Google image search is always quite out of date with lots of broken links (which isn’t good from user’s point of view!).
3. Enable cashing support on level of HTTP-protocol (HTTP 304 responses). After downloading a page once, user’s browser or SE-crawler wouldn’t repeat that until the page is changed! Unfortunately, developers of websites’ engines don’t much care about saving your traffic.
I recommend to WordPress users to install the great plugin WP-Cache. It’s not only support such cashing, but have other useful options.
MovableType users can enable HTTP-cashing support adding “$mt->conditional = true;” in mtview.php file. Check documentation for details.
To those who use other engines, I advice to google for a method of enabling HTTP-cashing support by keywords “http 304“, “304 responce” + name of your engine. If you fail, you still can make it by yourself or ask a programmer. In this case, start to learn cashing mechanism with 13th chapter rfc2616.
Updated: 4. Limit excessive activity of Googlebot using Google’s Webmaster Central panel. Hooray! Now you can limit activity of Google’s robot, too. Login (or signup, if you’re not yet) to Google Webmaster Tools (ex Sitemaps), add and verify all of your sites. Choose the site you want limit a Googlebot activity for, click ‘Crawl rate’ in left sidebar , then choose ‘Slower’ crawl speed in the bottom of the page. After that you, and Google can tune crawl speed more precisely.
I hope these tips will help you to save your money and traffic, and we’ll never see a message “account has been suspended due to overuse of bandwitch bandwidth limit” instead of your site!
The Web Design Blog said,
September 2, 2006 @ 9:28 pm
[...] If bandwidth is an issue for you or your clients, this technique will stop search engine spiders from over-crawling your website, and more importantly, stop the spiders from crawling and indexing your images, because images tend to use more bandwidth, and lets face it, no-one really views your site when using image search. Big waste of bandwidth. « Turning visitors into users [...]
Jonathan said,
September 3, 2006 @ 8:16 am
Some good advice. However, I find people often visit my site through an image search, so I wouldn’t want to disable that. I suggest people check their logs first to see where visitors come from. The image crawlers don’t visit too often: Google image search is always quite out of date with lots of broken links (which isn’t good from the user’s point of view!)
Easy Webbers » Blog Archive » Speedlinking said,
September 4, 2006 @ 3:03 pm
[...] - An ajax enhancement for wordpress blogs. - 3 unusual tricks to not give se-robots a chance to waste all your traffic. - Generate a screenshot of your website in about 5 seconds, for free. - 10 things businesses should know before building a website. - Pageviews are obsolete? - If you must have a shoutbox, this is the one to use. - And finally some usability guidelines. [...]
Maurice said,
September 5, 2006 @ 12:04 pm
Yeh
Right I want to stop search engines spiding my site - if they come back often is beacuse they think your site is worth spidering.deeper.
get a real host if your serious about the web or leave it to the pros
samlowry said,
September 5, 2006 @ 3:17 pm
>if they come back often is beacuse they think your site is worth spidering.deeper.
You are absolutely wrong. I mean huge number of requests recieved in short time. Yahoo or MSN crazy robots can make several K of hits perd day to a small well indexed site. I experienced the real DDOS attack to one of my server from Yahoo Slurp before I made Crawl-delay:10 for all sites on that server.
>get a real host if your serious about the web
And I am very serious.
I have several
I made update of tricks with saving traffic and SE-robots said,
April 20, 2007 @ 7:59 am
[...] April 20, 2007 at 7:59 am · Tags: update tutorial For long time I was going to update my tutorial ‘3 unusual tricks to not give SE-robots a chance to waste all your traffic’, finally done with this. Now it’s ‘4 unusual tricks to not give SE-robots a chance to waste all your traffic‘. [...]
John H. Gohde said,
June 8, 2008 @ 11:12 pm
I tend to agree that excessive hits from search engines is a waste of bandwidth. If half of your hits are coming from search engines then your site definitely has a problem.
I think think that most sites really don’t want to have their images indexed. It is an issue of copyright violation. But, really have made no effort to prevent it.
In the past, I have experienced excessive indexing from Yahoo. With Google, it will index the same web pages over and over again while totally ignoring new content.
L. Stetz said,
November 2, 2008 @ 10:12 am
I’ve been hit by a bot attack from Yahoo. Over 30 hours this week and last week too. they’ve been getting more aggressive for weeks.
My question: Don’t I limit Yahoo crawler?(crawl.yahoo.net) I mean, what’s with this “Slurp”?
Also MSN is doing it too with their inktomisearch.com search bot.
They all just send a different bot every time.
I’ve got lots of images and want them on the searches. of course, as you say, Google doesn’t update and alot of the images they ‘ve plucked from my pages don’t match search queries.
I just wonder about the NAMES to limit the bots. Can you give any suggestions?