Robot.txt Revisited
After doing more in depth research into robots.txt I decided to go ahead and publish a new article on my findings.
Before
Completely unsure about how good bots scan and index files and sites, myself and other bloggers alike had come up with a million different ways to present their own robots.txt file, and a million more ways to do theirs if they where using a wordpress site.
they ranged from
User-agent: * Disallow: / Disallow: *
to something similar to this.
User-agent: * Disallow: /wp*
After
Some of what we have in our typical robots.txt is indeed valid, but bear with me for a second as i explain what i did to come up with some new results.
to gather statistics on what was being indexed I cross referenced the results from some analytic tools that I had available.
- Google analytics (google search stats)
- Stat press (wordpress stats)
- awstats (server side built in stats)
- Web mastering tools (By Google)
From the data I was able to pull from these resources I specifically looked for what was being hit. Because of the special way that wordpress loads, and imports I could see that /index.php was being indexed as a file, but most other FILES on the site where not indexed. All other PAGES or what would be considered a page by a typical computer user where not indexed as an actual FILE but rather indexed by its permalink location.
hmm … interesting Google bot indexes files and paths. Interesting but not unexpected.
By now most of us know this, you set up a permalink structure in your website and it dynamically changes the links throughout your site to reflect the change. This is great for dynamically driven websites looking for a cleaner path or some SEO optimization.
The Google bot can in most cases follow an average custom permalink structure, I will note that the more difficult the permalink structures are ether crawled slower if crawled at all. I would shy away from lots of string names within your structure =?etc …
Back to robots…
So how does all this effect your robots.txt structure you ask? Well if you have gotten yourself a good permalink structure and you can then you can essentially set up your robots to block everything but the categories, pages, or posts you want it to crawl, and is perfect for helping eliminate duplicated content.
Here’s an example.
My own robots.txt file
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /category/archive/ Disallow: /feed/
I have disallowed my wp-admin directory, and a category in my wordpress permalink structure /category/archive, and finally disallowed my rss feeds, accessed through http://www.cssorigins.com/feed if my site was indexed properly by Google then all the content that would be in the feeds would already been indexed and thus the feeds would be duplicated content. The same goes for the archive category, since I post all my posts into that category as well as their individual categories.
But remember that the bots index files as well, so it would be a bad idea to do something similar to this.
Disallow: /*.php
as your index file now would not be indexed, along with the rest of your website. =(
Don’t understand? Think there’s a better way? Please comment and help us build a better understanding of how the web works.
Tags: Css, Permalink, Robots.txt, SEO, Webmaster, Website, Wordpress










Not really sure how much this actually helps since search engines won’t access those files in the first place unless they are linked to.
And be careful when doing that, some people may have WordPress installed at /wp/ in which case, would essentially block their site from the search engines.
Many of the files from my research at least proved that the files can be indexed by bots, and you may not want them too.
Search bots out there don’t only index links, but rather transverse the web though various links allowed by the site. For instance if you had an include file to a css stylesheet on your site but no actual link to it eventually Google and the other searches would catch on to that and index the contents of your css file. On the other hand if a user installed their blog into a different directory then their root (ie /) like I have the robots.txt file would change accordingly
My whole deal was trying to customize the Robots.txt file to get the optimal content indexed by googly without repeating content and thus possibly losing potential natural rankings.
thanks for the comment.