Web Design Thoughts

The journal of a Bradford web designer

I recently decided to take a look at the Sitemap.xml protocol, and see what all the fuss was about. I’d always assumed that it was one of those things which doesn’t apply to me, and stuck my head in the sand. What I wasn’t ready for was the huge can of worms that my investigations opened.

‘Sitemaps’ are basically just xml lists of web pages, which instruct search engine robots on what to crawl. Their purpose is to make life easy for search engine robots, and enable them to easily find everything on your web site—including stuff that isn’t linked properly.

Firstly, creating a sitemap xml file isn’t straightforward. Google provides a Python script for this, but this was one learning curve I wasn’t prepared to take. So, I looked into alternatives: Dreamweaver can produce Google sitemaps with the aid of an extension, but this is no use for some of the dynamically generated links on my web sites; similarly, Wordpress plug-ins are available to generate sitemaps of permalinks, but this isn’t much use for other content. After a bit of detective work, I came across a program called gSiteCrawler, which is a Windows application which behaves just like a finely configurable search robot.

Okay, now you’ve got your sitemap.xml file, you can upload it to your site root and job’s a good ‘un? Not quite. It was at this stage I realised just what search engines have been indexing all these months…

Close inspection of my web site’s statistics in Google’s Webmaster Tools microsite revealed scores of duplicate pages which I thought I’d removed a few weeks ago, using a Wordpress theme modification (which generates meta robots=noindex tags on the fly). This galled me, because I thought I knew all about Wordpress’s duplicate content weaknesses and the solutions. GSiteCrawler’s sitemap also showed further duplicates which I was unaware of.

There then followed a few days of severe head-banging-on-wall with Webmaster Tools, before I realised four things: (1) Wordpress’s 404 page actually gives a 200 (OK) and needs fixing so it gives a proper 404 (not found); (2) Google’s ‘Supplemental Results’ are untouchable — don’t for a minute think you can change them; (3) you can’t remove anything from their index unless you’ve made sure your robots.txt file backs it up; and (4) ‘duplicate content’ extends to pages you couldn’t be bothered to give custom titles and meta description tags.

Don’t get me started on robots.txt. The syntax for this file is voodoo, and very poorly documented. I ended up changing my directory structure to be as simple as possible so that robots.txt could screen duplicate content without much effort. Trying to be clever with robots.txt was a waste of time and brain cells. But, used well, it can stop robots scanning pages not needed to be seen by the search engines, and save them the time and effort of indexing them—time better spent spidering the rest of your web site.

So, the moral of the story is that using a CMS to generate pages, and/or being slapdash with title and meta tags, and/or lack of attention to ‘duplicate pages’ seen through alternate URLs can cause a hell of a lot of nightmares, which all need addressing before you go anywhere near a Sitemap and Google’s Webmaster Tools.

Similar Posts

Comments

  1. Anthony Houghton said on 19th October at 11:52 pm:

    Hmm! Interesting! Confirms what I’ve always thought. Mash-ups can be harder work than bespoke coding if the application doesn’t exactly fit your needs. In my experience the Googlebot is very well behaved, provided you give it a 404 to work with (more than can be said for Yahoo!’s Slurp). However if something like Wordpress (which I had considered migrating to) has a bug, then perhaps bespoke coding has its advantages.

    Sitemaps work well for me, but then I can generate them directly off my database.

    For someone like me from a mainframe/Windows background, robots.txt comes as a bit of a culture shock. I don’t think it’s voodoo, it’s just functionally very limited, and it doesn’t support being clever.

    Ant

  2. Keith said on 20th October at 9:01 am:

    I’ve found that Google responds to 404s and robots.txt, but 404s aren’t much use when it comes to duplicate content.
    I spent a long time designing 301 permanent redirects on duplicate content, and non-canonical URLs, but I feel that I’ve lost years of my life in the process, what with conflicts between various Apache modules, Wordpress ‘permalinks’ and various .htaccess techniques.
    The KISS acronym definitely rules here, and, after building increasing complexity, I’ve started to strip back the levels, to avoid things like double redirects and unexpected bad rewrites.
    I later discovered that the 404 bug wasn’t WordPress’s fault, but was down to the 3rd party WordPress theme I was using at the time.
    I think the main problems with robots.txt are (a) its lack of detailed documentation, and (b) its apparent differences in implementation among the different robots. Again, keeping things as simple as possible seems to be the best policy, and aim at the lowest common denominator.

Add A Comment

©2008 Keith Nuttall, Bradford Web Designer. Powered by WordPress.