Thank you for this comprehensive information, and thanks to all the others who have replied. I've
taken it all down and it will be helpful in discussing options with my colleagues.
Any further info from your SEO session would be very useful!
all the best,
Archives Hub Co-ordinator
Jeanne Kramer-Smyth wrote:
> There are a number of ways to get google and other search engines to
> only crawl what you want them to crawl.
> 1) You can create a robots.txt file that dictates directories or URL
> patterns that the robots should not crawl
> 2) You can add a tag to your HTML header that reads <meta name="robots"
> content="noindex"> (basically - robot you cannot index this page)
> 3) Create an XML sitemap for your site. I believe this is the best
> option for what is being discussed here. It is basically a laundry list
> of all the URLs you want google to crawl. See more here about the
> format: http://www.sitemaps.org/protocol.php . You would then create a
> Google Webmaster Tools account for your domain and submit your sitemap
> to them.
> Google can definitely crawl dynamic pages and while you will usually do
> better for SEO with a static URL that doesn't have a gillion parameters
> hanging off the end, don't let that URL format make you think you can't
> be included well in the search indexes. Also - it is my understanding
> that google will stagger the crawling of your site when it is first
> pulling content into the index. It is smart enough to watch for the
> rhythm of your updates so it figures out over time how often it needs to
> recrawl your pages. For most finding aids and other collection
> descriptions, your page won't need to be recrawled more than once in a
> blue moon. Part of what you tell search engines in your sitemap file is
> when a URL was last updated.
> I will be chairing a session at DC 2010 on SEO for archives websites and
> I will make sure that I post back to this list with links to the
> presentations and any reference sheets we develop.
> Hope this helps!
> Jeanne Kramer-Smyth
> On Thu, Dec 10, 2009 at 8:44 AM, Jane Stevenson
> <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
> Hi Chris,
> That is interesting. Sounds similar in scale to us - 18,000
> descriptions of which the majority are collection-level and several
> hundred are multi-level. You don't have any problems with the bots
> following all dynamically generated links within the interface, e.g.
> for us the refine search links, the hyperlinked index terms, the
> browse links. My understanding was that this would mean they would
> effectively be crawling through hundreds of thousands of pages.
> Chris Prom wrote:
> Hi Jane,
> At the University of Illinois our system has been open to Google
> and other bots for several years. Over 7,000 collection-level
> records and several hundered full finding aids are routinely
> harvested by Google and other bots. Our system is a PHP-driven
> database application, not static HTML.
> We have never run into an issue with server overload. I suspect
> it would not be a problem for you, since server load is
> significantly higher to serve up a PHP using our system than it
> would be to serve up an equivalent page in static HTML.
> Chris Prom
> Fulbright Scholar
> University of Dundee
> United Kingdom
> Jane Stevenson wrote:
> Hi all,
> >>Basically what I'm trying to do is get away from creating
> static html pages to store on our server and just present
> the view and print options through xml and xsl.
> This has prompted me to think about a rather different
> question - we're actually thinking of creating static html
> pages in addition to our XSL generated pages because we want
> our descriptions to be exposed to Google. Alternatively we
> could create pre-generated searches. We don't just open up
> our system to robots due to problems with overloading the
> system. Has anyone had any experience of this kind of thing?
> It would be useful to get your thoughts.
> Jane Stevenson
> Archives Hub Co-ordinator
> University of Manchester
> Email: [log in to unmask]
> <mailto:[log in to unmask]>