Annoying Robots: A Solution for Google Analytics

Last month I posted about a surge of illegitimate traffic we’ve experienced on Grist.  Given that they did things like load JavaScript, these impressions were difficult to distinguish from real traffic, except they were all from IE and all of very low quality.

A large number of people who run websites are experiencing the same problem, which is only really a problem because it can massively distort analytics (like Google Analytics for example) and also skews AdSense to a destructive degree.  While many affected folks have simply removed AdSense from the affected pages, until now, I’ve seen no report of anyone excluding the traffic from Google Analytics.

We’ve just begun testing a solution that does this, and I’d like to post about it sooner rather than later so that others may both try it out and potentially benefit from it.

The premise of this solution came from a suggestion in this thread by Darrin Ward who suggested:

1) For IE users only, serve the page with everything loaded in a JS variable and do a document.write of it only when some mouse cursor movmement takes place (GA wouldn’t execute until the doc.write).
2) Use the same principle, but only load the GA code when a mouse movement takes place.
 

While we didn’t exactly do either of these things, we did take the idea of using DOM events that are indicative of a real human (mouse movement, keystroke) to differentiate the zombie traffic from the real.  The good news is that this seems — largely — to work.  Here’s how to do it:

 
 

1.  First of all, you must be using the Google Analytics’s current (i.e. — asynchronous) method for this to make any sense.  If you’re not, you probably should be anyway, so it’s a good time to quickly switch.  Your page loads will improve if you do.

2.  We recommend as a first step that you implement some Google Analytics Events to differentiate good traffic from bad.  This will continuing tracking impressions on all page loads, but will fire off a special event that will differentiate the good traffic from the bad.  Later, once you are happy that the exclusion is happening properly, you can actually exclude impression tracking  (see below).
To do so, insert this code in your site header after the code that loads Google Analytics:

	//Evil Robot Detection

	var category = 'trafficQuality';
	var dimension = 'botDetection';
	var human_events = ['onkeydown','onmousemove'];

	if ( navigator.appName == 'Microsoft Internet Explorer' && !document.referrer) {
		for(var i = 0; i < human_events.length; i++){
			document.attachEvent(human_events[i], ourEventPushOnce);
		}
	}else{
		_gaq.push( [ '_trackEvent', category, dimension, 'botExcluded', 1 ] );
	}

	function ourEventPushOnce(ev) {

		_gaq.push( [ '_trackEvent', category, dimension, 'on' + ev.type, 1 ] );

		for(var i = 0; i < human_events.length; i++){
			document.detachEvent(human_events[i], ourEventPushOnce);
		}

	} // end ourEventPushOnce()

	//End Evil Robot Detection
This code causes a GA event of category “trafficQuality” and dimension “botDetection” with a label that will whenever possible contain the type of event, to be pushed to Google Analtyics.  It will also push a “botExcluded” event with this dimension and category whenever for any non-IE browser or any page view with a referrer.   This means you won’t get a Google Analytics event only when there’s a direct IE impression with no mousemove or keydown, which is what we want.
 
 

4.  So how does this help you?  Well, now in Google Analytics you’ll be able to tell the good traffic from the bad.  The good will have an event.  The bad won’t.  The easiest way to check this in Google Analytics is to check content -> events -> events overview.  Within a few hours of pushing the above code you should see events begin to accumulate there.

5.  To restore more sanity to your Google Analtyics, you could also define a goal.  (under admin go to goals and define a new goal like this:)

5.  Once you implement this goal, Google Analytics will know what traffic has achieved the goal and what hasn’t — based on this you’ve defined a conversion.  This means that on any report in Google Analytics, you can restrict the view of the report to only those visits that converted — this is done in the advanced segments menu:

6.  Note that this affects only new data that enters Google Analytics — it does not scrub old data unfortunately.  In our case, it’s restored Google Analytics to its normal self after a couple of months of frustration.

7.  Eventually, you may want to stop Google Analytics from even recording an impression in the case of bad traffic.  To do that, just remove the

_gaq.push( [ '_trackEvent', ...

lines above and replace them with

_gaq.push(['_trackPageview']);

Of course, don’t forget to remove the call to _trackPageview from it’s normal place outside the conditional.

I’d love to hear about any ideas for improvement anyone has for this.  We don’t use adSense, but in that case you could just use this technique to conditionalize the insertion of adCode into the DOM.

Good luck bot killers!

What flavor of NoSql are you? 

… this is a pretty good summary (really good actually) of several such schemaless DB engines:

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison

Annoying cyborgs attack, distort analytics [UPDATED][SOLVED-ish]

Over the last couple of weeks, I’ve been dealing with a strange phenomenon: a substantial (but not crippling) amount of traffic suddenly came our way.  The characteristics of this traffic are:

  • it’s direct (i.e. — no referrer and not search traffic)
  • it’s all from IE browsers
  • it’s nearly all to the homepage
  • it’s widely distributed in terms of geography, network etc.
  • it’s of very poor quality — low time on site, very high bounce, very low engagement
  • its real — confirmed in multiple analytics packages
  • it flies under DDos radar because it is less intense than a DDos burst, and rather indistinguishable from real traffic.

This traffic just simply started one day, and has gone up or down a little bit since.  Here’s what I’ve been able to conclude:

  • it’s likely not bot-traffic in the traditional sense.  Assets such as the javascript and ads for the page are getting loaded along with the DOM.
  • It’s likely not human either — the pattern is too uniform and the quality universally crappy.

This traffic has characteristics consistent with both bot and human behavior — I think we should call it cyborg traffic!  The pattern is consistent with a voluntary browser-net of some sort (people whoring out their OS’s to a central service — see Roger Dooley’s proposition below) or some kind of malware that is involuntarily opening windows in users’ browsers (less likely.)  If this behavior did not seem to include older IE browsers, I’d also speculate that it could be related to prerendering, but that seems unlikely given the facts.

Others have noticed it too, some positing causes:

  • This thread on webmasterworld contains lots of people reporting and reflecting on the problem
  • Roger Dooley (the fellow who started that thread) has proposed with some good evidence that the whole thing is due to a shady entity called Gomez from a company called Compuware.  Roger currently seems to be waiting to hear back from these guys — I hope he does soon, and posts the results of any conversations.
  • A post appeared on the google analytics product forums reporting the same behavior
  • A response to the webmasterworld thread by @incredibill seems to indicate that he’s found a way, via the request headers, to distinguish this sort of traffic from human traffic.  Any chance you could share Bill?

For updates on this situation, see Roger’s Post, or check back here — I’ll update when more info comes to light.

[UPDATE March 5th, 2012]

More consensus that this is a botnet, but little specific additional clarity about the nature of the traffic involved.  Good additional discussion appears here.

Someone affected soul even posted a rollup of their logs, with user agents:
https://analytics-a-googleproductforums-com.googlegroups.com/attach/5aade66b7c1d07b6/user_agents.csv?pli=1&view=1&part=4

 

[UPDATE March 7, 2012]
Here’s the first potentially reasonable mitigation I’ve come across, (from the google product group thread, above.)

“BB_CCIT” Says:

We have been getting the same kind of traffic to our homepage now for 17 days. Slow enough that it doesn’t do anything but ruin our analytics and advertising impressions.

One way that we started filtering things out was…

1) If it is an internet explorer user
2) It has no referrer (direct traffic)

If so we mark the IP on our blacklist at the bottom of our fully loaded page. If we detect a mouse movement or click event using javascript, we then update our database and mark their IP address as a verified user via an ajax call. This filtering system basically allows the bot to visit our site once and after we blacklist them any re-visits to our site will receive a 404 page for them.

Even if a blacklist were not used, one could conditionally load analytics packages in this way … I think.

Additional update:  Google seems to be investigating.  A google staffer posted:

We’re still investigating this issue and I’ll keep you posted when there are further updates. We appreciate your patience.

[UPDATE April 27, 2012]  We’ve found a workable way to exclude this stuff from Analytics. Check it out here.

Will Facebook Connect Liberalize Data Retention Terms?

TechCrunch re-rumored the possibility of changes to Facebook Connect, including changes to FB’s data retention policy for developers, which (as the post points out) are a) onerous and b) unenforceable.

Facebook Connect’s current terms of service prevent any third-party applications from storing data obtained through the API for more than 24 hours. And what is available through the API? If you authorize a thrid-party application using FBC, then basically anything in your profile is available to that application or site, including things like your avatar, your friend network and lots of biographical data. Notably, your email address is not always directly available — developers only get access to a proxy unless you explicitly grant otherwise. In any case, all of this is cacheable for only 24 hours. After that you have to refresh the data from Facebook directly or delete it.

If this were to change, there would be much rejoicing among FB developers. Lots of effort is spent making sure that data is properly refreshed.   Many would also like to see default direct access to the person’s real email address.

It’s difficult to background-spoof a local account using unstable data (which is essentially what most Facebook Connect website integrations attempt to do) especially since a real email address is not always available, and moreover cannot be relied upon for more than 24 hours. This has caused many sites (see, for example Huffington Post Social News) to ask for additional information directly from the user (and sometimes a lot of it) at the time someone connects using FB, a requirement which vastly reduces FBC’s utility as a SSO technology, but allows the site to permanently retain that data, since it did not originate from FB.

So here, O Facebook, is my wish-list: a real email address by default, and the ability to permanently cache certain data as long as the user remains connected to my app. What else should be on my wish list?

Dear Google, Thanks for Dissing IE6

Ben Parr at Mashable made my heart flutter the other day when he posted about Google’s announcement of the beginning of the end of IE6 support across their product suite.  Good timing for me, since it came on the very day that we here at Grist were scheduled to make a decision about IE6 support for a bunch of new designs and features that will be coming out over the next number of months.  So I feel validated Google — thanks … though I think based on our audience stats and the general way the wind is blowing, we would have made a similar decision in the end.  It seems that MSFT itself is intent on supporting IE6 until it stops supporting Windows XP, an event which won’t occur until 2014.  Most depressing job:  responsibility for this release path.