[csw-maintainers] robots.txt

Mon Aug 31 21:21:56 CEST 2009

On Mon, Aug 31, 2009 at 7:33 PM, Philip Brown<phil at bolthole.com> wrote:
> I am against putting in "limit" directives, unless there is a very
> clear, specific benefit.

I would say there is a specific benefit. You needn't be afraid of
limiting access, since you can closely monitor what the bot is doing,
via webmaster tools.  If it would exclude too much, you would know.

Google guidelines[1] say:

"""Google tries hard to index and show pages with distinct
information. This filtering means, for instance, that if your site has
a "regular" and "printer" version of each article, and neither of
these is blocked in robots.txt or with a noindex meta tag, we'll
choose one of them to list. In the rare cases in which Google
perceives that duplicate content may be shown with intent to
manipulate our rankings and deceive our users, we'll also make
appropriate adjustments in the indexing and ranking of the sites
involved. As a result, the ranking of the site may suffer, or the site
might be removed entirely from the Google index, in which case it will
no longer appear in search results."""

> Unless i missed something, you mentioned only general case things, and
> a potential of broken robots.

I can't recall saying anything about broken robots. What did I say?

> Any robot that cant figure out to properly prune   /page  vs
> /page.php, in this day and age, is a broken robot.

What makes you say that? Any references?

> We should not modify our configs to make dumb robots look better than they are.
>
> (that being said, if we have any references to page.php anywhere,
> instead of the native "page" name, we should update our links. But
> that is besides the point of the changes you are proposing)

You can't say that such-and-such content is not or shouldn't be
indexed because you currently don't link to it. If it was ever linked
to, it's going to be indexed. Even more, it's enough that a page had
an outgoing link, and someone clicked it -- the referrer header will
tell the target website from where did the browser come. The referrer
URL will appear in statistics somewhere, and bots will find it.

If you serve content, it's understood, and rightly so, that it's
content that you want other people or bots to see. Otherwise you
wouldn't serve it. You can't make up excuses that you don't link to it
so it's fine.

Looking at the webmaster tools, I see that it's the .php links that
are currently indexed. You can check that even without the webmaster
tools: http://tinyurl.com/myfd9v

I would suggest using robots.txt as the easiest way of making the bots
aware which content is the canonical one. As a second step, HTTP
redirects would be good.

What do others think?

Maciej

[1] http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66359