There are a number of ways to get google and other search engines to only crawl what you want them to crawl.
1) You can create a robots.txt file that dictates directories or URL patterns that the robots should not crawl
2) You can add a tag to your HTML header that reads <meta name="robots" content="noindex"> (basically - robot you cannot index this page)
3) Create an XML sitemap for your site. I believe this is the best option for what is being discussed here. It is basically a laundry list of all the URLs you want google to crawl. See more here about the format: http://www.sitemaps.org/protocol.php . You would then create a Google Webmaster Tools account for your domain and submit your sitemap to them.
Google can definitely crawl dynamic pages and while you will usually do better for SEO with a static URL that doesn't have a gillion parameters hanging off the end, don't let that URL format make you think you can't be included well in the search indexes. Also - it is my understanding that google will stagger the crawling of your site when it is first pulling content into the index. It is smart enough to watch for the rhythm of your updates so it figures out over time how often it needs to recrawl your pages. For most finding aids and other collection descriptions, your page won't need to be recrawled more than once in a blue moon. Part of what you tell search engines in your sitemap file is when a URL was last updated.
I will be chairing a session at DC 2010 on SEO for archives websites and I will make sure that I post back to this list with links to the presentations and any reference sheets we develop.
Hope this helps!
That is interesting. Sounds similar in scale to us - 18,000 descriptions of which the majority are collection-level and several hundred are multi-level. You don't have any problems with the bots following all dynamically generated links within the interface, e.g. for us the refine search links, the hyperlinked index terms, the browse links. My understanding was that this would mean they would effectively be crawling through hundreds of thousands of pages.
Chris Prom wrote:
At the University of Illinois our system has been open to Google and other bots for several years. Over 7,000 collection-level records and several hundered full finding aids are routinely harvested by Google and other bots. Our system is a PHP-driven database application, not static HTML.
We have never run into an issue with server overload. I suspect it would not be a problem for you, since server load is significantly higher to serve up a PHP using our system than it would be to serve up an equivalent page in static HTML.
University of Dundee
Jane Stevenson wrote:
>>Basically what I'm trying to do is get away from creating static html pages to store on our server and just present the view and print options through xml and xsl.
This has prompted me to think about a rather different question - we're actually thinking of creating static html pages in addition to our XSL generated pages because we want our descriptions to be exposed to Google. Alternatively we could create pre-generated searches. We don't just open up our system to robots due to problems with overloading the system. Has anyone had any experience of this kind of thing? It would be useful to get your thoughts.
Archives Hub Co-ordinator
University of Manchester
Email: [log in to unmask]