Annoying Robots: A Solution for Google Analytics

Last month I posted about a surge of illegitimate traffic we’ve experienced on Grist.  Given that they did things like load JavaScript, these impressions were difficult to distinguish from real traffic, except they were all from IE and all of very low quality.

A large number of people who run websites are experiencing the same problem, which is only really a problem because it can massively distort analytics (like Google Analytics for example) and also skews AdSense to a destructive degree.  While many affected folks have simply removed AdSense from the affected pages, until now, I’ve seen no report of anyone excluding the traffic from Google Analytics.

We’ve just begun testing a solution that does this, and I’d like to post about it sooner rather than later so that others may both try it out and potentially benefit from it.

The premise of this solution came from a suggestion in this thread by Darrin Ward who suggested:

1) For IE users only, serve the page with everything loaded in a JS variable and do a document.write of it only when some mouse cursor movmement takes place (GA wouldn’t execute until the doc.write).
2) Use the same principle, but only load the GA code when a mouse movement takes place.
 

While we didn’t exactly do either of these things, we did take the idea of using DOM events that are indicative of a real human (mouse movement, keystroke) to differentiate the zombie traffic from the real.  The good news is that this seems — largely — to work.  Here’s how to do it:

 
 

1.  First of all, you must be using the Google Analytics’s current (i.e. — asynchronous) method for this to make any sense.  If you’re not, you probably should be anyway, so it’s a good time to quickly switch.  Your page loads will improve if you do.

2.  We recommend as a first step that you implement some Google Analytics Events to differentiate good traffic from bad.  This will continuing tracking impressions on all page loads, but will fire off a special event that will differentiate the good traffic from the bad.  Later, once you are happy that the exclusion is happening properly, you can actually exclude impression tracking  (see below).
To do so, insert this code in your site header after the code that loads Google Analytics:

	//Evil Robot Detection

	var category = 'trafficQuality';
	var dimension = 'botDetection';
	var human_events = ['onkeydown','onmousemove'];

	if ( navigator.appName == 'Microsoft Internet Explorer' && !document.referrer) {
		for(var i = 0; i < human_events.length; i++){
			document.attachEvent(human_events[i], ourEventPushOnce);
		}
	}else{
		_gaq.push( [ '_trackEvent', category, dimension, 'botExcluded', 1, true ] );
	}

	function ourEventPushOnce(ev) {

		_gaq.push( [ '_trackEvent', category, dimension, 'on' + ev.type, 1, true ] );

		for(var i = 0; i < human_events.length; i++){
			document.detachEvent(human_events[i], ourEventPushOnce);
		}

	} // end ourEventPushOnce()

	//End Evil Robot Detection
This code causes a GA event of category “trafficQuality” and dimension “botDetection” with a label that will whenever possible contain the type of event, to be pushed to Google Analtyics.  It will also push a “botExcluded” event with this dimension and category whenever for any non-IE browser or any page view with a referrer.   This means you won’t get a Google Analytics event only when there’s a direct IE impression with no mousemove or keydown, which is what we want.
 
 

4.  So how does this help you?  Well, now in Google Analytics you’ll be able to tell the good traffic from the bad.  The good will have an event.  The bad won’t.  The easiest way to check this in Google Analytics is to check content -> events -> events overview.  Within a few hours of pushing the above code you should see events begin to accumulate there.

5.  To restore more sanity to your Google Analtyics, you could also define a goal.  (under admin go to goals and define a new goal like this:)

5.  Once you implement this goal, Google Analytics will know what traffic has achieved the goal and what hasn’t — based on this you’ve defined a conversion.  This means that on any report in Google Analytics, you can restrict the view of the report to only those visits that converted — this is done in the advanced segments menu:

6.  Note that this affects only new data that enters Google Analytics — it does not scrub old data unfortunately.  In our case, it’s restored Google Analytics to its normal self after a couple of months of frustration.

7.  Eventually, you may want to stop Google Analytics from even recording an impression in the case of bad traffic.  To do that, just remove the

_gaq.push( [ '_trackEvent', ...

lines above and replace them with

_gaq.push(['_trackPageview']);

Of course, don’t forget to remove the call to _trackPageview from it’s normal place outside the conditional.

I’d love to hear about any ideas for improvement anyone has for this.  We don’t use adSense, but in that case you could just use this technique to conditionalize the insertion of adCode into the DOM.

Good luck bot killers!

[UPDATE May 8, 2012] Added the final argument to _trackEvent to precent distortion of bounce rates. Thanks Chase!

8 Responses

  1. Pingback: Annoying cyborgs attack, distort analytics [UPDATED][SOLVED-ish] | StkyWll

  2. Hi Matt,
    I came across a comment you made in an old Neiman Lab article about how you quantify user engagement at Grist.org. I am trying to better understand the elusive metric of “user engagement” and how online publishers in the media world are measuring it and was hoping that if you had some time I might be able to contact you for a quick call.

    Cheers,
    Tim Cavanaugh

  3. Matt,

    Great job figuring out a solution to this issue and bringing some sanity back to GA numbers. I implemented the suggested code about a week ago (with minor adjustments according to my situation), and it seems to be working correctly.

    The only issue I’m seeing is that the bounce rate is no longer accurate. It instantly dropped from 35-40% to 6-7% once the code was implemented. My assumption is that once any human event is detected, that user is no longer considered a “bounce” regardless of subsequent action such as moving the mouse to click on back or close the page. Other items that suggest this is the issue include the bounce rate for “Visits with Conversions” is 0%, and total traffic appears roughly 8% higher than traffic from conversions which leads me to assume that bots are making up that 6-7% overall bounce rate that remains.

    Have you been experiencing this issue with bounce rate as well?

    Thanks,
    Chase

    • Chase — as a matter-a-fact we have! Thanks for bringing this up. This was just an oversight on my part — and you are right that an event does indeed make Google consider the visit a non-bounce. However, Google gives us a way to deal with this in the form of a 5th argument to trackEvent:

      _trackEvent(category, action, opt_label, opt_value, opt_noninteraction)

      if you set opt_nointeraction to true, then the visit with this event will not immediately become a non-bounce.

      I’ve just edited the examples above to reflect this — thanks for bringing this up and good luck!

  4. Thanks for posting all this! I’ve been looking for answers and was glad to see you had posted to Google Groups. I only wish I had found it sooner…

    I’ve implemented the code today and am hoping for normalcy to return to my site.

    • No problem Frank — let me know if it helps … happy to answer questions if you have any. Eventually I’d love to know what’s causing this stupid phenomenon … but no luck on that front yet.

Leave a Reply to Matt Cancel reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s