Wednesday, February 28, 2007

The Robots Exclusion Protocol: What is it, why you want it and how to do it

As a site owner you actually have some control over how search engines access and index your website. You can exclude pages from Google's crawler using a robots.txt file. A robots.txt file contains a list of pages from your site that you don’t want search engines to access. A typical robots.txt file that tells the spider to index your entire site looks like this:

User-agent: *
Disallow:

There are many different reasons this may come in handy to site owners. By laying out a few rules in this text file, you can tell robots not to crawl and index entire directories within your site, single pages, images or nothing at all. To create a robots.txt file, simply create a regular text file, name it "robots.txt" and place it in the root directory. A robots.txt file that tells crawlers to exclude your feedback forms may look something like this:

User-Agent: *
Disallow: /feedback/

The User-Agent line specifies which crawler your instructions are for while the Disallow line specifies which parts of your site you are disallowing it from crawling. The * indicates all crawlers, or you can make different specifications for different robots.

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home