Erik Lenaerts

Do, or do not. There is no try.

Unreachable Url's in google

Since a while we encounter more and more problems for one of the websites I'm working on (www.morres.com). Somewhere in the first Quarter of 2007, the site dissapeared from the radar of Google entirely.

Question is, what wen't wrong? We spent a great deal of the project budget into an Search Engine friendly site like:

  • Url rewriting
  • Inclusing of industry keywords
  • Meta Tags
  • Correct use of propper HTML (like H1, H2, etc)
  • Fully use of alt and title attributes for hyperlinks and images
  • a google friendly sitemap
  • HTTP 301 redirection of alternative domain names like www.morres.nl, etc
  •  ...

All of this effort resulted into no listing in Google... a frustrating period I must say :S. 

I started with Google Webmaster tools and verified the site by uploading a verification file. After this verification process I got lots of information from the tools.

On the Diagnostics page, I saw that the last "successful" crawl was like 6 months ago and the reason state was "We can't currently access your home page because of an unreachable error".

In the list of Unreachable Url's, the home page www.morres.com showed up with an HTTP 500 code from a crawl a few days ago. Strangely enough, in my browser, the page www.morres.com just showed up fine... strange, strange. I used Fiddler to check the HTTP results codes; no HTTP 500's?

So, I started dgging arround in news groups, blogs, etc.

 

Broken Links?

In my search I stumbled upon a tool Xenu, to report broken links on your site, a very usefull tool indeed. Although I found some broken links, which we fixed off course, none of the links resulted into a HTTP 500.

 

Canonical server name issues?

In this thread, I learned that google might access our site without the host name so http://morres.com. Lucky we can straigthen out this problem by means of an HTTP 301 (redirection) from the http://morres.com to http://www.morres.com on our webservers. Noneteless, this didn't worked either.

 

Validators

After some additional surfing, I came to the idea to validate the HTML output from our web site.

I used the following validators:

  • Markup validator: This is the W3C Markup Validation Service, a free service that checks Web documents in formats like HTML and XHTML for conformance to W3C Recommendations and other standards.
  • Link Checker: Checks anchors (hyperlinks) in a HTML/XHTML document. Useful to find broken links, etc.
  • CSS Validator: validates CSS stylesheets or documents using CSS stylesheets

We corrected some of the HTML and CSS issues (which were in my opinion very tiny little details, but hey, after a while you'll try everything) again, we no notable change for Google's problems.

The link checker is in fact similar with the tool from Xenu. I ran it on the home page and not following any 2e level links to keep the results limited. The only problem that was reported was this link:

BLOCKED SCRIPThistory.go(-1)

We provide this back button for the user in our navigation bar left from our breadcrumb. I searched the internet if this type of javascript used in an HREF could cause any troubles, however no one has complained about this. For the save side, we removed this "feature" temporary. (I actually wonder if people ever use it all).

 

Spider simulator

I wanted to know how googlebot (the spider/crawler) from google sees our pages and therefore I ran this spider simulator. A very nice simulator and especially interesting for keyword analysis. However, I could see any problems reported by the simulator.

 

ASP.Net 2.0 redirections

EUREKA It seems that ASP.Net 2.0 contains an error when it comes to its Url Redirection technique based on RewritePath. This method seems to work well for certain User Agents but not for all of them. Guess what, Googlebot was one of the User Agents where things went wrong.

You can read detailed information on this subject here.

Leave a Comment

(required) 

(required) 

(optional)

(required) 


Enter the numbers above: