Recently I’ve been working with an agency that has its own simple PHP web site framework. During the course of working with them, a problem arose: pages were disappearing, apparently without human involvement.
With some detective work they had discovered that somehow the ‘secure’ CMS part of the site — where the client can log in to make changes — was being crawled by automated search engines. Part of the CMS is a list of all the site’s pages, each with links to the usual operations — edit, delete etc. When the spider was indexing the pages, it also accessed the delete link, thereby deleting the page (much like this DailyWTF story).
Anyway I took a look and while the security wasn’t great — it was based around cookies with no server-side validation — it still seemed odd that the spiders were able to access the pages. I implemented a slightly more robust system using sessions, added an entry to robots.txt, and marked it as solved.
And then it happened again.
I couldn’t work out what was going wrong, so to stop it happening I converted all the delete links to forms. But it was nagging at me — how was it that the search engines were reaching the pages at all? Why weren’t they being rejected when the security script checked for a cookie and session?
Finally, the penny dropped…
The security check worked by looking for a valid session, checking it for a ‘user is logged in’ value, and if one wasn’t found then sending a redirect header pointing to the login page. Nothing unusual there. So what was going on?
When PHP sends a redirect header, the browser says “OK, I’ll go to this other page” and the user’s none the wiser. But just because the browser is no longer listening, that doesn’t mean the script automatically stops running. In fact, unless you tell it to stop it just continues as if nothing had happened. Thus, the spiders were simply ignoring the header and receiving the page as if there were no security in place at all.
The solution? Add an ‘exit;’ after the redirect header. Simple!