Web Blog | Blog, Design, Create, cssOrigins.com

A Deeper Look At Robots.txt

Robots.txt syntax

  • User-Agent: the robot the following rule applies to (e.g. “Googlebot,” etc.)
  • Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed)
  • Noindex: the pages you want a search engine to block AND not index (or de-index if previously indexed). Unofficially supported by Google; unsupported by Yahoo and Live Search.
  • Each User-Agent/Disallow group should be separated by a blank line; however no blank lines should exist within a group (between the User-agent line and the last Disallow).
  • The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. May be used either for whole lines or end of lines.
  • Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines.

Let’s look at an example robots.txt file. The example below includes:

  • The robot called “Googlebot” has nothing disallowed and may go anywhere
  • The entire site is closed off to the robot called “msnbot”;
  • All robots (other than Googlebot) should not visit the /tmp/ directory or directories or files called /logs, as explained with comments, e.g., tmp.htm, /logs or logs.php.

User-agent: Googlebot Disallow:

User-agent: msnbot Disallow: /

# Block all robots from tmp and logs directories User-agent: * Disallow: /tmp/ Disallow: /logs # for directories and files called logs

via A Deeper Look At Robots.txt .

Tags: , ,

Comments are closed.