[csw-maintainers] robots.txt

Mon Aug 31 19:40:15 CEST 2009

On Mon, Aug 31, 2009 at 4:41 PM, Philip Brown<phil at bolthole.com> wrote:
> Why do you suggest we do this?

It's purely optional, so we don't have to add the robots.txt file. The
reason is that in general it's one of the standard things to include
in a website; specifically, it can be used to tell bots not to visit
certain parts of an otherwise public website. It's useful when there's
content which isn't suitable for indexing. An example can be an URL
which embeds a session identifier. I'm not sure whether the presence
of the file is used as a signal for positioning, but it might be and
it doesn't hurt to serve this tiny static file.

Whether we actually should exclude pages, is another matter; I think
the /search/softwarename pages could be excluded as they only contain
lists of files. Another candidate for excluding would be any duplicate
content. For instance, these 4 URLs serve the same content:

http://www.opencsw.org/packages/analog
http://www.opencsw.org/packages/CSWanalog
http://www.opencsw.org/packages.php/analog
http://www.opencsw.org/packages.php/CSWanalog

Indexers generally don't like duplicate content, so it's better to say
which URLs we think are canonical. We could tell the bot to ignore
three of these four. Also, serving HTTP redirects from the other three
would make sense, but let's just start with robots.txt for now.

Maciej