Annoying Robots: A Solution for Google Analytics

Last month I posted about a surge of illegitimate traffic we’ve experienced on Grist.  Given that they did things like load JavaScript, these impressions were difficult to distinguish from real traffic, except they were all from IE and all of very low quality.

A large number of people who run websites are experiencing the same problem, which is only really a problem because it can massively distort analytics (like Google Analytics for example) and also skews AdSense to a destructive degree.  While many affected folks have simply removed AdSense from the affected pages, until now, I’ve seen no report of anyone excluding the traffic from Google Analytics.

We’ve just begun testing a solution that does this, and I’d like to post about it sooner rather than later so that others may both try it out and potentially benefit from it.

The premise of this solution came from a suggestion in this thread by Darrin Ward who suggested:

1) For IE users only, serve the page with everything loaded in a JS variable and do a document.write of it only when some mouse cursor movmement takes place (GA wouldn’t execute until the doc.write).
2) Use the same principle, but only load the GA code when a mouse movement takes place.

While we didn’t exactly do either of these things, we did take the idea of using DOM events that are indicative of a real human (mouse movement, keystroke) to differentiate the zombie traffic from the real.  The good news is that this seems — largely — to work.  Here’s how to do it:


1.  First of all, you must be using the Google Analytics’s current (i.e. — asynchronous) method for this to make any sense.  If you’re not, you probably should be anyway, so it’s a good time to quickly switch.  Your page loads will improve if you do.

2.  We recommend as a first step that you implement some Google Analytics Events to differentiate good traffic from bad.  This will continuing tracking impressions on all page loads, but will fire off a special event that will differentiate the good traffic from the bad.  Later, once you are happy that the exclusion is happening properly, you can actually exclude impression tracking  (see below).
To do so, insert this code in your site header after the code that loads Google Analytics:

	//Evil Robot Detection

	var category = 'trafficQuality';
	var dimension = 'botDetection';
	var human_events = ['onkeydown','onmousemove'];

	if ( navigator.appName == 'Microsoft Internet Explorer' && !document.referrer) {
		for(var i = 0; i < human_events.length; i++){
			document.attachEvent(human_events[i], ourEventPushOnce);
		_gaq.push( [ '_trackEvent', category, dimension, 'botExcluded', 1 ] );

	function ourEventPushOnce(ev) {

		_gaq.push( [ '_trackEvent', category, dimension, 'on' + ev.type, 1 ] );

		for(var i = 0; i < human_events.length; i++){
			document.detachEvent(human_events[i], ourEventPushOnce);

	} // end ourEventPushOnce()

	//End Evil Robot Detection
This code causes a GA event of category “trafficQuality” and dimension “botDetection” with a label that will whenever possible contain the type of event, to be pushed to Google Analtyics.  It will also push a “botExcluded” event with this dimension and category whenever for any non-IE browser or any page view with a referrer.   This means you won’t get a Google Analytics event only when there’s a direct IE impression with no mousemove or keydown, which is what we want.

4.  So how does this help you?  Well, now in Google Analytics you’ll be able to tell the good traffic from the bad.  The good will have an event.  The bad won’t.  The easiest way to check this in Google Analytics is to check content -> events -> events overview.  Within a few hours of pushing the above code you should see events begin to accumulate there.

5.  To restore more sanity to your Google Analtyics, you could also define a goal.  (under admin go to goals and define a new goal like this:)

5.  Once you implement this goal, Google Analytics will know what traffic has achieved the goal and what hasn’t — based on this you’ve defined a conversion.  This means that on any report in Google Analytics, you can restrict the view of the report to only those visits that converted — this is done in the advanced segments menu:

6.  Note that this affects only new data that enters Google Analytics — it does not scrub old data unfortunately.  In our case, it’s restored Google Analytics to its normal self after a couple of months of frustration.

7.  Eventually, you may want to stop Google Analytics from even recording an impression in the case of bad traffic.  To do that, just remove the

_gaq.push( [ '_trackEvent', ...

lines above and replace them with


Of course, don’t forget to remove the call to _trackPageview from it’s normal place outside the conditional.

I’d love to hear about any ideas for improvement anyone has for this.  We don’t use adSense, but in that case you could just use this technique to conditionalize the insertion of adCode into the DOM.

Good luck bot killers!

Style Tiles 

Style Tiles

Catchy name for a good idea.  ”Style Guide” always seemed to vague and formal.

How to get up and running on Amazon EC2 quickly (for OSX people)

So I needed to set up my OSX rig to access AWS, spin up and configure an Ubuntu instance, install Apache, PHP, MongoDB and do various other tasks. Good thing I found these two great resources:

Fist, here’s Robert Sosinki with a great guide on how to get set up with the EC2 command line tools on Mac OSX. Really clear and well done.

Next, here’s a quick guide from RSM on how to turn that brand new instance into a full LAMP (that’s Linux, Apache, Mongo, PHP) stack … though really you could install whatever packages you need.

CanopyEngine — Grist’s Knight News Challenge entry 

CanopyEngine — Grist’s Knight News Challenge entry

Check out this project I’m involved with at Grist — in fact I’m starting serious work on it next week while I’m in Buenos Aires.  It’s all about building an open source platform for realtime/algorithmic news.  If you want to be really really nice you could even “like” the project over here.

What flavor of NoSql are you? 

… this is a pretty good summary (really good actually) of several such schemaless DB engines:

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison

Arianna Quotes Jess Zimmerman

Congrats to my Grist colleague Jess Zimmerman, who was quoted heavily today in the Huffington Post by Arianna in a piece about the fetishization of social media.

“See, no matter what some social media guru told you,” concludes Zimmerman, “Twitter is not just a marketing amplification engine. It’s a bunch of people, sharing things they think are worth sharing. Trying to start a McDonald’s appreciation hashtag is like the smelly, creepy kid running a write-in campaign for Prom King — not gonna work, and probably gonna backfire. People don’t start liking you just because you suggest a way to express their admiration.”

This was actually from a really good GristList post back in January.  Better late than never.  Now I wonder when people will begin to realize that Twitter has not quite matured into a medium (or at least expressed an interface) that fulfills that higher goal “a bunch of people, sharing things that they think are worth sharing.”  If the point of the whole thing is the things that they are sharing, how many of those things are we all missing now due to the fire-hose nature of Twitter itself, the lack of good discovery tools and the poor interface? I say this not to be a jerk, but rather because it’s becoming more and more obvious (to me anyway) that Twitter, or the developer community around Twitter, needs to do a better job of solving some of these problems so that Twitter does not end up going the way of RSS, a technology with obvious merits but an unclear use case and purpose.

Annoying cyborgs attack, distort analytics [UPDATED][SOLVED-ish]

Over the last couple of weeks, I’ve been dealing with a strange phenomenon: a substantial (but not crippling) amount of traffic suddenly came our way.  The characteristics of this traffic are:

  • it’s direct (i.e. — no referrer and not search traffic)
  • it’s all from IE browsers
  • it’s nearly all to the homepage
  • it’s widely distributed in terms of geography, network etc.
  • it’s of very poor quality — low time on site, very high bounce, very low engagement
  • its real — confirmed in multiple analytics packages
  • it flies under DDos radar because it is less intense than a DDos burst, and rather indistinguishable from real traffic.

This traffic just simply started one day, and has gone up or down a little bit since.  Here’s what I’ve been able to conclude:

  • it’s likely not bot-traffic in the traditional sense.  Assets such as the javascript and ads for the page are getting loaded along with the DOM.
  • It’s likely not human either — the pattern is too uniform and the quality universally crappy.

This traffic has characteristics consistent with both bot and human behavior — I think we should call it cyborg traffic!  The pattern is consistent with a voluntary browser-net of some sort (people whoring out their OS’s to a central service — see Roger Dooley’s proposition below) or some kind of malware that is involuntarily opening windows in users’ browsers (less likely.)  If this behavior did not seem to include older IE browsers, I’d also speculate that it could be related to prerendering, but that seems unlikely given the facts.

Others have noticed it too, some positing causes:

  • This thread on webmasterworld contains lots of people reporting and reflecting on the problem
  • Roger Dooley (the fellow who started that thread) has proposed with some good evidence that the whole thing is due to a shady entity called Gomez from a company called Compuware.  Roger currently seems to be waiting to hear back from these guys — I hope he does soon, and posts the results of any conversations.
  • A post appeared on the google analytics product forums reporting the same behavior
  • A response to the webmasterworld thread by @incredibill seems to indicate that he’s found a way, via the request headers, to distinguish this sort of traffic from human traffic.  Any chance you could share Bill?

For updates on this situation, see Roger’s Post, or check back here — I’ll update when more info comes to light.

[UPDATE March 5th, 2012]

More consensus that this is a botnet, but little specific additional clarity about the nature of the traffic involved.  Good additional discussion appears here.

Someone affected soul even posted a rollup of their logs, with user agents:


[UPDATE March 7, 2012]
Here’s the first potentially reasonable mitigation I’ve come across, (from the google product group thread, above.)

“BB_CCIT” Says:

We have been getting the same kind of traffic to our homepage now for 17 days. Slow enough that it doesn’t do anything but ruin our analytics and advertising impressions.

One way that we started filtering things out was…

1) If it is an internet explorer user
2) It has no referrer (direct traffic)

If so we mark the IP on our blacklist at the bottom of our fully loaded page. If we detect a mouse movement or click event using javascript, we then update our database and mark their IP address as a verified user via an ajax call. This filtering system basically allows the bot to visit our site once and after we blacklist them any re-visits to our site will receive a 404 page for them.

Even if a blacklist were not used, one could conditionally load analytics packages in this way … I think.

Additional update:  Google seems to be investigating.  A google staffer posted:

We’re still investigating this issue and I’ll keep you posted when there are further updates. We appreciate your patience.

[UPDATE April 27, 2012]  We’ve found a workable way to exclude this stuff from Analytics. Check it out here.

New blog design on Grist today

Today we rolled out a new blog design at work.  There are a number of reasons I’m happy about this:

  1. This is our first work with Mignon Khargie … she’s the designer we’ve been working with at Grist.  She’s not only a spectacular talent, but a pleasure to work with.
  2. This is our first test of a new set of iterative practices we’re using for design.
  3. This is just the first setp in a larger series of changes we’ll be making to the site in the coming months.  I’m really excited about this process, and the team we have in place to do this work.