
Some search engine related conerns regarding Pinoy Top Blogs’ redirect method
Pinoy Top Blogs is a very nice project in that it adds an acceptable measure of objectivity as far as blog rankings go. Nothing could be sweeter than landing in the top 10. But to track hits to Pinoy Top Blog’s partner websites, Pinoy Top Blogs may be using, albeit unknowingly, the HTTP 302 exploit in Google. This documented exploit has been used by some webmasters for “page hijacking”.
Page hijacking as a possible SEO technique was published by Claus Schmidt in his Page Hijack: The 302 Exploit, Redirects and Google paper. Schmidt’s paper can get a little technical sometimes, but in essence, the exploit goes like this:
- Google follows the redirect to the original site but gives the redirecting site credit for the content.
- Google sees two sites with identical content and drops one of them from its index.
- Often the original site is the one dropped (read: banned from Google).
- Sometimes, malicious webmasters may redirect any visitor that clicks on the target page listing to any other page the hijacker chooses to redirect to.
As of May 8, 2005, Schmidt reports that the exploit is still not fixed in Google.
Disclaimer: I do not believe Pinoy Top Blogs has any intentions of abusing the 302 exploit. I do not know Yugatech personally, but I’ve seen his contributions to the Pinoy blogging community and I know I speak for a lot of Pinoy bloggers in saying how appreciative we are of his contributions. As Schmidt says in his paper:
This is a flaw on the technical side of the search engines. Some webmasters do of course exploit this flaw, but almost all cases I’ve seen are not a deliberate attempt at hijacking. The hijacker and the target are equally innocent as this is something that happens “internally” in the search engines, and in almost all cases the hijacker does not even know that (s)he is hijacking another page.
How the Exploit Is Done
Like Schmidt, this exploit is being published “to make the problem understandable and visible to as many people as possible in order to force action to be taken to prevent further abuse of this exploit.” Use of the exploit is NOT encouraged or endorsed.
Schmidt outlines the steps necessary for carrying out a 302 redirect hijack:
- Googlebot (the “web spider” that Google uses to harvest pages) visits a page with a redirect script. In this example it is a link that redirects to another page using a click tracker script, but it need not be so. That page is the “hijacking” page, or “offending” page.
- This click tracker script issues a server response code “302 Found” when the link is clicked. This response code is the important part; it does not need to be caused by a click tracker script. Most webmaster tools use this response code per default, as it is standard in both ASP and PHP.
- Googlebot indexes the content and makes a list of the links on the hijacker page (including one or more links that are really a redirect script)
- All the links on the hijacker page are sent to a database for storage until another Googlebot is ready to spider them. At this point the connection breaks between your site and the hijacker page, so you (as webmaster) can do nothing about the following:
- Some other Googlebot tries one of these links - this one happens to be the redirect script (Google has thousands of spiders, all are called “Googlebot”)
- It receives a “302 Found” status code and goes “yummy, here’s a nice new page for me”
- It then receives a “Location: www.your-domain.tld” header and hurries to your page to get the content.
- It heads straight to your page without telling your server on what page it found the link it used to get there (as, obviously, it doesn’t know - another Googlebot fetched it)
- It has the URL of the redirect script (which is the link it was given, not the page that link was on), so now it indexes your content as belonging to that URL.
- It deliberately chooses to keep the redirect URL, as the redirect script has just told it that the new location (That is: The target URL, or your web page) is just a temporary location for the content. That’s what 302 means: Temporary location for content.
- Bingo, a brand new page is created (never mind that it does not exist IRL, to Googlebot it does).
- Some other Googlebot finds your page at your right URL and indexes it.
- When both pages arrive at the reception of the “index” they are spotted by the “duplicate filter” as it is discovered that they are identical.
- The “duplicate filter” doesn’t know that one of these pages is not a page but just a link (to a script). It has two URLs and identical content, so this is a piece of cake: Let the best page win. The other disappears.
- Optional: For mischievous webmasters only: For any other visitor than “Googlebot”, make the redirect script point to any other page free of choice.
How Can I Stop my Pages from Being Hijacked?
Aside from politely emailing webmasters to remove the redirect to your website? Tony Spencer thinks there aren’t that many ways to stop page hijacking.
…get the other site to remove the HTTP 302 redirect. As I said before most webmasters have no idea of the havoc they are wreaking. I have found that a polite yet firm email nearly always results in a swift removal of the redirect and its often followed by a puzzled reply “Whats the problem?”.
Schmidt doesn’t think there’s a single fix strong enough to prevent your pages from being hijacked. He believes that the error “is generated by the search engines, is only found within the search engines, and hence it must be fixed by the search engines”. He does give some pointers on how to make hijacking harder, but then again these are just things you can do to “slow” down (not stop) hijackers:
- Always redirect your “non-www” domain (example.com) to the www version (www.example.com) - or the other way round (I personally prefer non-www domains, but that’s just because it appeals to my personal sense of convenience). The direction is not important. It is important that you do it with a 301 redirect and not a 302, as the 302 is the one leading to duplicate pages.
- Include a bit of always updated content on your pages (e.g. a timestamp, a random quote, a page counter, or whatever)
- Use the
meta tag on all your pages - Just like redirecting the non-www version of your domain to the www version, you can make all your pages “confirm their URL artificially” by inserting a 301 redirect from any URL to the exact same URL, and then serve a “200 OK” status code, as usual. This is not trivial, as it will easily throw your server into a loop.
For those who want to read more about 302 hijacking, there’s a rather long thread in WMW. Good read though.
Recommended Readings:

The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture
