Annoying Robots: A Solution for Google Analytics

Last month I posted about a surge of illegitimate traffic we’ve experienced on Grist.  Given that they did things like load JavaScript, these impressions were difficult to distinguish from real traffic, except they were all from IE and all of very low quality.

A large number of people who run websites are experiencing the same problem, which is only really a problem because it can massively distort analytics (like Google Analytics for example) and also skews AdSense to a destructive degree.  While many affected folks have simply removed AdSense from the affected pages, until now, I’ve seen no report of anyone excluding the traffic from Google Analytics.

We’ve just begun testing a solution that does this, and I’d like to post about it sooner rather than later so that others may both try it out and potentially benefit from it.

The premise of this solution came from a suggestion in this thread by Darrin Ward who suggested:

1) For IE users only, serve the page with everything loaded in a JS variable and do a document.write of it only when some mouse cursor movmement takes place (GA wouldn’t execute until the doc.write).
2) Use the same principle, but only load the GA code when a mouse movement takes place.
 

While we didn’t exactly do either of these things, we did take the idea of using DOM events that are indicative of a real human (mouse movement, keystroke) to differentiate the zombie traffic from the real.  The good news is that this seems — largely — to work.  Here’s how to do it:

 
 

1.  First of all, you must be using the Google Analytics’s current (i.e. — asynchronous) method for this to make any sense.  If you’re not, you probably should be anyway, so it’s a good time to quickly switch.  Your page loads will improve if you do.

2.  We recommend as a first step that you implement some Google Analytics Events to differentiate good traffic from bad.  This will continuing tracking impressions on all page loads, but will fire off a special event that will differentiate the good traffic from the bad.  Later, once you are happy that the exclusion is happening properly, you can actually exclude impression tracking  (see below).
To do so, insert this code in your site header after the code that loads Google Analytics:

	//Evil Robot Detection

	var category = 'trafficQuality';
	var dimension = 'botDetection';
	var human_events = ['onkeydown','onmousemove'];

	if ( navigator.appName == 'Microsoft Internet Explorer' && !document.referrer) {
		for(var i = 0; i < human_events.length; i++){
			document.attachEvent(human_events[i], ourEventPushOnce);
		}
	}else{
		_gaq.push( [ '_trackEvent', category, dimension, 'botExcluded', 1 ] );
	}

	function ourEventPushOnce(ev) {

		_gaq.push( [ '_trackEvent', category, dimension, 'on' + ev.type, 1 ] );

		for(var i = 0; i < human_events.length; i++){
			document.detachEvent(human_events[i], ourEventPushOnce);
		}

	} // end ourEventPushOnce()

	//End Evil Robot Detection
This code causes a GA event of category “trafficQuality” and dimension “botDetection” with a label that will whenever possible contain the type of event, to be pushed to Google Analtyics.  It will also push a “botExcluded” event with this dimension and category whenever for any non-IE browser or any page view with a referrer.   This means you won’t get a Google Analytics event only when there’s a direct IE impression with no mousemove or keydown, which is what we want.
 
 

4.  So how does this help you?  Well, now in Google Analytics you’ll be able to tell the good traffic from the bad.  The good will have an event.  The bad won’t.  The easiest way to check this in Google Analytics is to check content -> events -> events overview.  Within a few hours of pushing the above code you should see events begin to accumulate there.

5.  To restore more sanity to your Google Analtyics, you could also define a goal.  (under admin go to goals and define a new goal like this:)

5.  Once you implement this goal, Google Analytics will know what traffic has achieved the goal and what hasn’t — based on this you’ve defined a conversion.  This means that on any report in Google Analytics, you can restrict the view of the report to only those visits that converted — this is done in the advanced segments menu:

6.  Note that this affects only new data that enters Google Analytics — it does not scrub old data unfortunately.  In our case, it’s restored Google Analytics to its normal self after a couple of months of frustration.

7.  Eventually, you may want to stop Google Analytics from even recording an impression in the case of bad traffic.  To do that, just remove the

_gaq.push( [ '_trackEvent', ...

lines above and replace them with

_gaq.push(['_trackPageview']);

Of course, don’t forget to remove the call to _trackPageview from it’s normal place outside the conditional.

I’d love to hear about any ideas for improvement anyone has for this.  We don’t use adSense, but in that case you could just use this technique to conditionalize the insertion of adCode into the DOM.

Good luck bot killers!

Chartbeat API PHP Library Sketch

I wasn’t quite sure if there existed a PHP library for the awesome Chartbeat API. (Chartbeat is a realtime analytics engine that we use at Grist, and with which I’ve been playing around just now) so I decided to just start writing one … I’ve so far only gotten as far as erecting the framework and the one call that I happen to need now — namely quickstats().

I might keep adding to this if I have time.

/*
 * Chartbeat Library
 *
 * requires cURL
 *
 * by Matt Perry
 */

class Chartbeat {

public $apikey = '';
public $host = '';

public function __construct($apikey, $host) {

$this->apikey = $apikey;
 $this->host = $host;

if (!$apikey || !$host) {
 throw new Exception('Missing API key or Host');
 }
 //for now we require cURL ... if it's not compiled into PHP, raise exception
 if (!function_exists('curl_init')){
 throw new Exception('Curl not installed');
 return false;
 }
 }

public function quickstats($path = '') {

$url = "http://api.chartbeat.com/live/quickstats/?host={$this->host}&apikey={$this->apikey}";
 if ($path) $url .= "&$path";
 return $this->_do_request($url);
 }

//makes request, decodes json
 private function _do_request($url) {
 $result = $this->_make_call($url);
 $result = json_decode($result);
 return $result;
 }

//makes call with curl
 private function _make_call($url) {

$ch = curl_init();
 curl_setopt($ch, CURLOPT_URL, $url);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 curl_setopt($ch, CURLOPT_TIMEOUT, $this->chartbeat_timeout);
 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //don't worry about cert problems

$output = curl_exec($ch);
 curl_close($ch);

return $output;
 }

}

IronCache: Caching Expression Engine with memcached [UPDATED]

Update, Jan 2012:  Here’s an idea: ignore everything below and visit the IronCache page on GristLabs.  There, you can download IronCache rather than just reading about it!

I’m going to describe a general method of using memcached to implement whole-page caching for EE 1.6.x. It is somewhat complex, requires a small core change to EE and a reasonable number of configurations, and is therefore not suitable for most EE users. However, it may be of interest to anyone running a large EE site or one that is subject to intense traffic spikes. This method was suggested to us by the folks at Automattic, makers of WordPress. It was based on the Batcache plugin for wordpress, which they use in a massive way to cache their blogs at  wordpress.com.

The goal: We would like to use in-memory caching to store  ENTIRE EE PAGES (not just templates) in order to quickly serve these pages in the event of a traffic spike. We would like this caching to kick in only when necessary (ie — when a particular URL is subject to a spike) and only when the request is not associated with a logged-in session. We would like only certain types of pages to be cached (ie — not the control panel or any other page we want to be free from caching.)

Requirements: Before going any further, I’ll describe the infrastructure elements necessary for this. There are two:

1. A memcached server.
2. The PHP memcache extension.
3.  A small change to EE’s core (see below)

General Method:

We will define two types of cache objects, each of which are associated with a single URL.

COUNT: This object tracks how many times a URL has been accessed in the last N (configurable) seconds.
DATA: This object stores the the page content … it has a separate (also configurable) cache expiration time.

SESSIONS_END:

Very early in EE’s process (namely at the sessions_end hook) we check to see if several conditions are true:

- is this an anonymous session?
- does the requested URI match a defined list of patterns (configurable)
- does a non-empty (non-expired) DATA cache object exist for this URL?

If all three of these conditions are met, EE immediately fetches the complete page content from the DATA cache object, sends appropriate headers, and then the content to the browser, and terminates. This means we have a complete page displayed with very very little effort from either PHP or the database.  Yay!

However, if there is a logged in session, control is returned to the normal EE process with no changes — this means that logged in users have the same experience as always.  Also, if the URI does not match one of the defined patterns, we also proceed with vanilla EE — this prevents non-whitelisted pages from being cached.  In the case of an anonymous session and a cache-eligible URI but no (or expired) DATA object, the COUNT object is consulted. If the count is at or above a configurable threshold, then a flag is set in the $SESS->session_cache object so that the page will be cached later (more on this in a second)

Finally, whenever we encounter a cache eligible page, we increment the COUNT cache object associated with that URI.

OUTPUT_START:
note: this bit requires a small EE core change … see below.

The cache is populated at the beginning of the core.output class, just before the regular headers are sent. At this point, our extension checks for the flag in $SESS->session_cache. If it’s present, it stores the page it is about to send to the browser in the DATA object for that URI.   The result, of course, is that the next request for URI will have this cache available (subject to expiration) at sessions_end.

Components and Configuration:

This method is implemented in an extension called Ironcache, written by our team here at Grist.  A number of configurations are required:

$conf['ironcache_enable_cache'] = 'y'; //'y' means the cache is on, anything else, like 'n', mean's it not.
$conf['ironcache_cache_time'] = '300'; //in seconds -- the page cache life in seconds.
$conf['ironcache_counter_reset'] = '10'; //in seconds -- the number of seconds it takes for the counter to reset
$conf['ironcache_threshold'] = '2';  //if the page gets this number of hits in a given counter period, the page is cached.
$conf['ironcache_patterns'] = 'pattern1|pattern2|pattern3'; //a pipe-separated list of patterns for detecting cache-eligible pages
$conf['ironcache_cache_homepage'] = 'y'; // whether or not to cache '/'
$conf['ironcache_prefix'] = 'dev'; // prefix to add to cache elements.  Allows a single memcached server to be used by multiple applications without name collision
$conf['ironcache_memcache_host'] = 'localhost'; // memcached server host
$conf['ironcache_memcache_port'] = '11211'; // memcached server port

Core Change:

A small core change is required.  In core.output.php, at the beginning of the display_final_output method in core.output.php, add the following code:

// -------------------------------------------
// 'output_start' hook.
//  - override output behavior
//  - implement whole-page output caching
//
$edata = $EXT->universal_call_extension('output_start', $this);
if ($EXT->end_script === TRUE) return;
//
// -------------------------------------------

If anyone has any suggestions about how to avoid this core change, I’d love to hear them!

The Extension:

I’m planning to post a copy of the actual extension here sometime in the coming week … if you’d like a copy before then (minus some cleanup) I’d be fine emailing it to you … matt0perry [att] gmail [d0t] c0m

Ideas?  Comments?  Suggestions for Improvement?

Comment away.

Recapthca for Expression Engine Member Registration

Well, due to some recent drama involving blog spam at Grist, I had the opportunity to cook up an ExpressionEngine extension that implements reCaptcha for EE signup.

EE provides a convenient hook for overriding the (rather weak) native EE captcha, so getting the captcha to appear on the signup form is simply an exercise in taking the convenient reCaptcha PHP library and invoking its get_html method at the appropriate time. The resulting HTML simply overrides the result of the usual captcha method from EE. Here are the relevant lines from the extension:

require_once($PREFS->ini(‘system_folder_path’).’extensions
/mp_recaptcharecaptchalib.php’);
$EXT->end_script = TRUE;
return recaptcha_get_html($PREFS->ini(‘recaptcha_public_key’));

Now the captcha appears on ths signup form, and it’s time to turn our attention to processing. There are three requirements: 1) we need to invoke a check of the reCaptcha at the appropriate moment 2) we need to cleanly pass an error back to the signup process on failure and 3) we need to override or at least mask the native captcha check. Here’s my approach:

EE provides the member_member_register_start hook at the beginning of the registration routine. At that point it’s quite easy to do the reCaptcha check:

require_once($PREFS->ini(‘system_folder_path’).’extensions/mp_recaptcha/recaptchalib.php’);
$resp = recaptcha_check_answer ($reCaptchaPrivateKey,
$IN->IP,
$_POST["recaptcha_challenge_field"],
$_POST["recaptcha_response_field"]);

But how to handle the response? member_member_register_start only allows injection of logic into the registration process (ie — you can’t affect the return value of the method in which it appears.) You can, however, affect the session and any globals. So here’s the trick I used.

In the registration form, I added:

<input type=”hidden” name=”captcha” value=”1″>

And in the method involked when the captcha is created, I did the following:

$DB->query(“INSERT INTO exp_captcha (date, ip_address, word) VALUES (UNIX_TIMESTAMP(), ‘”.$IN->IP.”‘, ’1′)”);

This means that EE will always be expecting a captcha response of “1″, and will always get it UNLESS some outside force intervenes. This is where the result of the reCaptcha web service check comes in. If the result is successfull, we do nothing, and allow EE to think its native captcha check went perfectly. If the result indicates failure, then I do the following in the method invoked at member_member_register_start:

$_POST['captcha'] = ”;

This little change will cause EE’s native captcha check to think that it has failed, and produce its normal errors upon a captcha failure.

I’d be happy to provide the entire extension to anyone who is interested, but I feel like it needs a little cleaning, documenting and generalization in order to stand on its own two feet. Perhaps I’ll post it here soon. Until then, let me know if you’d like a copy.