Impact of search engines on opportunistic hacking

2013-05-29 00:00:00 +0100

Most website owners struggle to get their website best positioned in search engine results and the rule of thumb seems to be “the more indexing the better”. How does this impact website susceptibility to hacking? In my penetration testing practice I’ve seen many websites with vulnerabilities visible on the first look that worked for many years on public network and had no signs of past compromise. They could have been hacked at any time — if anyone ever found them and tried. But they were private websites whose URL was directly passed to clients and never linked from any public sites. Search engines just did not have any chance to reach them, but if they did, they could change the situation dramatically. How can we turn this unintended consequence into a conscious advantage?

This chart shows number of pages indexed by Bing on websites that were compromised and announced on Zone-H. Majority of the compromised websites were pretty well indexed by Bing:

Looking at the same data in more details, we see that the hacked pages were pretty well indexed by search engines. And it was not sufficient that Bing just touched the main page — the better a site was indexed, the more likely it was it gets hacked:

It’s not all that simple however. The following charts show distribution of Google Toolbar PageRank and Alexa Rank for defaced websites:

Google rank is from 0 to 10, where zero is lowest popularity and 10 is highest. Alexa classifies websites by their position on their popularity rank, so 1 is the most popular (currently occupied by Facebook). Both charts thus show that it were websites of little or medium popularity that were mostly defaced. This can be explained by the fact that popular websites with generate revenue and their owners are more determined to invest in their security. Their risk profile gradually changes from opportunistic hacking to targeted hacking, which was not part of this research.

It is also important to note that hackers do not really need to use search engines to find victims. Most popular vulnerabilities in the wild are found using direct scanning on SSH, SMB, DNS and other ports. Known web application vulnerabilities are also frequently probed by direct connections to the HTTP port. This is however time consuming technique and there are a lot of competing scanners out there, so from attacker’s point of view this might be less attractive method compared with search engines.


Between February and April 2013 I was scraping Zone-H Archive for new entries. In total, I’ve collected over 20’000 reports on recently defaced websites. Out of that I’ve skipped results that were marked as “mass defacements”, that is when compromise of single server resulted in defacement of hundreds or thousands of domains hosted there. Eventually I’ve ended up with a set of over 7000 defaced websites.

For each defaced website that was newly posted on Zone-H I immediately checked how many results would Bing return for its domain. I was using paid Azure Datamarket Bing API. Fast checking was important, since the very fact of defacement could draw attention to that website and increase search engine coverage.

Initially I have tried to use Google, but there’s interesting inconsitency between results returned by consumer Google page and Google Custom Search Engine, which is currently the only legal way to programatically fetch search results from Google. Specifically CSE frequently returns zero results for many domains that have non-zero (or even significanly larger) number of results when checked manually. Bing did not have this problem.

Google Toolbar PageRank was collected using publicly available algorithm and public Google Toolbar servers. Alexa rank was collected using paid Amazon AWIS.


It is well known that botnets do use search engines actively for locating their victims. Botnets usually try to keep low profile and do not announce their victims in public. The data I was looking at came from defacements archive, a kind of hall of fame where hackers compete on number of compromised websites.

The results suggest that proper search engine control on websites is important part of limiting attack surface, just like closing ports on servers and other recognized hardening techniques. At the same time search engine control is a safeguard that is frequently underestimated from both security and business point of view. I’ve seen many B2B services that revealed all their customer base through permissive search engine controls.

At the same time it’s difficult to find a safeguard that is cheaper and easier to set up. Robots.txt, Google and Bing Webmaster Tools are trivial to set up and, most importantly, require minimum or no changes in the application code or templates. This is often the easiest and fastest way to limit exposure of ancient, legacy websites that are still needed, but no longer maintained.


These recommendations based on the following assumptions: