Annoying Robots: A Solution for Google Analytics

Last month I posted about a surge of illegitimate traffic we’ve experienced on Grist.  Given that they did things like load JavaScript, these impressions were difficult to distinguish from real traffic, except they were all from IE and all of very low quality.

A large number of people who run websites are experiencing the same problem, which is only really a problem because it can massively distort analytics (like Google Analytics for example) and also skews AdSense to a destructive degree.  While many affected folks have simply removed AdSense from the affected pages, until now, I’ve seen no report of anyone excluding the traffic from Google Analytics.

We’ve just begun testing a solution that does this, and I’d like to post about it sooner rather than later so that others may both try it out and potentially benefit from it.

The premise of this solution came from a suggestion in this thread by Darrin Ward who suggested:

1) For IE users only, serve the page with everything loaded in a JS variable and do a document.write of it only when some mouse cursor movmement takes place (GA wouldn’t execute until the doc.write).
2) Use the same principle, but only load the GA code when a mouse movement takes place.
 

While we didn’t exactly do either of these things, we did take the idea of using DOM events that are indicative of a real human (mouse movement, keystroke) to differentiate the zombie traffic from the real.  The good news is that this seems — largely — to work.  Here’s how to do it:

 
 

1.  First of all, you must be using the Google Analytics’s current (i.e. — asynchronous) method for this to make any sense.  If you’re not, you probably should be anyway, so it’s a good time to quickly switch.  Your page loads will improve if you do.

2.  We recommend as a first step that you implement some Google Analytics Events to differentiate good traffic from bad.  This will continuing tracking impressions on all page loads, but will fire off a special event that will differentiate the good traffic from the bad.  Later, once you are happy that the exclusion is happening properly, you can actually exclude impression tracking  (see below).
To do so, insert this code in your site header after the code that loads Google Analytics:

	//Evil Robot Detection

	var category = 'trafficQuality';
	var dimension = 'botDetection';
	var human_events = ['onkeydown','onmousemove'];

	if ( navigator.appName == 'Microsoft Internet Explorer' && !document.referrer) {
		for(var i = 0; i < human_events.length; i++){
			document.attachEvent(human_events[i], ourEventPushOnce);
		}
	}else{
		_gaq.push( [ '_trackEvent', category, dimension, 'botExcluded', 1, true ] );
	}

	function ourEventPushOnce(ev) {

		_gaq.push( [ '_trackEvent', category, dimension, 'on' + ev.type, 1, true ] );

		for(var i = 0; i < human_events.length; i++){
			document.detachEvent(human_events[i], ourEventPushOnce);
		}

	} // end ourEventPushOnce()

	//End Evil Robot Detection
This code causes a GA event of category “trafficQuality” and dimension “botDetection” with a label that will whenever possible contain the type of event, to be pushed to Google Analtyics.  It will also push a “botExcluded” event with this dimension and category whenever for any non-IE browser or any page view with a referrer.   This means you won’t get a Google Analytics event only when there’s a direct IE impression with no mousemove or keydown, which is what we want.
 
 

4.  So how does this help you?  Well, now in Google Analytics you’ll be able to tell the good traffic from the bad.  The good will have an event.  The bad won’t.  The easiest way to check this in Google Analytics is to check content -> events -> events overview.  Within a few hours of pushing the above code you should see events begin to accumulate there.

5.  To restore more sanity to your Google Analtyics, you could also define a goal.  (under admin go to goals and define a new goal like this:)

5.  Once you implement this goal, Google Analytics will know what traffic has achieved the goal and what hasn’t — based on this you’ve defined a conversion.  This means that on any report in Google Analytics, you can restrict the view of the report to only those visits that converted — this is done in the advanced segments menu:

6.  Note that this affects only new data that enters Google Analytics — it does not scrub old data unfortunately.  In our case, it’s restored Google Analytics to its normal self after a couple of months of frustration.

7.  Eventually, you may want to stop Google Analytics from even recording an impression in the case of bad traffic.  To do that, just remove the

_gaq.push( [ '_trackEvent', ...

lines above and replace them with

_gaq.push(['_trackPageview']);

Of course, don’t forget to remove the call to _trackPageview from it’s normal place outside the conditional.

I’d love to hear about any ideas for improvement anyone has for this.  We don’t use adSense, but in that case you could just use this technique to conditionalize the insertion of adCode into the DOM.

Good luck bot killers!

[UPDATE May 8, 2012] Added the final argument to _trackEvent to precent distortion of bounce rates. Thanks Chase!


34 Comments on “Annoying Robots: A Solution for Google Analytics”

  1. […] April 27, 2012]  We’ve found a workable way to exclude this stuff from Analytics. Check it out here. Share this:Like this:Like4 bloggers like this […]

  2. Darrin Ward says:

    Matt,

    Awesome!!! Thank you for taking the time to develop on this idea and share your ideas and code with the community.

    Darrin.

  3. Tim Cavanaugh says:

    Hi Matt,
    I came across a comment you made in an old Neiman Lab article about how you quantify user engagement at Grist.org. I am trying to better understand the elusive metric of “user engagement” and how online publishers in the media world are measuring it and was hoping that if you had some time I might be able to contact you for a quick call.

    Cheers,
    Tim Cavanaugh

  4. Chase says:

    Matt,

    Great job figuring out a solution to this issue and bringing some sanity back to GA numbers. I implemented the suggested code about a week ago (with minor adjustments according to my situation), and it seems to be working correctly.

    The only issue I’m seeing is that the bounce rate is no longer accurate. It instantly dropped from 35-40% to 6-7% once the code was implemented. My assumption is that once any human event is detected, that user is no longer considered a “bounce” regardless of subsequent action such as moving the mouse to click on back or close the page. Other items that suggest this is the issue include the bounce rate for “Visits with Conversions” is 0%, and total traffic appears roughly 8% higher than traffic from conversions which leads me to assume that bots are making up that 6-7% overall bounce rate that remains.

    Have you been experiencing this issue with bounce rate as well?

    Thanks,
    Chase

    • Matt says:

      Chase — as a matter-a-fact we have! Thanks for bringing this up. This was just an oversight on my part — and you are right that an event does indeed make Google consider the visit a non-bounce. However, Google gives us a way to deal with this in the form of a 5th argument to trackEvent:

      _trackEvent(category, action, opt_label, opt_value, opt_noninteraction)

      if you set opt_nointeraction to true, then the visit with this event will not immediately become a non-bounce.

      I’ve just edited the examples above to reflect this — thanks for bringing this up and good luck!

  5. Thanks for posting all this! I’ve been looking for answers and was glad to see you had posted to Google Groups. I only wish I had found it sooner…

    I’ve implemented the code today and am hoping for normalcy to return to my site.

    • Matt says:

      No problem Frank — let me know if it helps … happy to answer questions if you have any. Eventually I’d love to know what’s causing this stupid phenomenon … but no luck on that front yet.

  6. This might be a dumb question but does this only target the IE browser? The reason I ask is because I have the same bad traffic but it’s all coming from the chrome browser (and i’m 100% confident that the traffic im seeing is not human – looks like a scraper or price checker)

    • Matt says:

      Andrew — seems like a reasonable question to me. Is your traffic all direct? In our case, the user agent was always IE, but user agents can be spoofed, so there’s no reason why this couldn’t report itself as some other browser. The solution outlined here should work regardless. Even if you don’t know the browser for sure, or if you suspect multiple agents, you could still use the technique and just not narrow it down by browser.

    • Matt says:

      Also, interesting theory about it being a scraper or price checker –this might be the case, but why hit the same URLs over and over and over again in that case? Our bad traffic occurs on exactly 3 distinct URLs … odd. Is yours more varied? The problem as we’re experiencing it feels a lot like a crawler or bot gone awry, or some kind of browser based toolbar … but I still haven’t figured out the why’s of the situation.

    • Hi Matt, thanks for the fast reply! Yes, the suspect traffic is all direct. It targets a single php page used for product searches to run through a list of products we sell -and some we don’t- in a-z order (as seen by watching live analytics) by changing the keyword(s) in the url with each page load.

      On another note… how would mobile browser analytics be affected by this, ie: someone using the chrome browser on an iphone/ipad, would the script not detect any keyboard/mouse movements and consider the visit a bad bot?

    • Matt says:

      Good question about mobile. We don’t have any significant traffic from mobile IE, so I didn’t include a check for mobile devices in the script here. But it’s a good idea. Probably none of the bots would show a mobile user agent, so you could just check for those, and consider any traffic with a mobile user agent to be “not bot” (just like all non-direct traffic)

      Here’s a function that does that:

      http://www.abeautifulsite.net/blog/2011/11/detecting-mobile-devices-with-javascript/

    • Mate you’re a gentlemen and a scholar! I think a combination of that mobile check and your script should do the trick. I’ll take your original advice and test the data is being flagged correctly before excluding it completely. Thanks again!

    • Sheldon says:

      Andrew, I meant to reply to your comment thread, but instead accidentally replied to the original post. See the next post for details.

      Yes, I’m seeing the Chrome on Linux or PC as the source of some of my bad traffic. I’m also seeing Safari on Linux. By far, MSIE is the biggest culprit. The only platform that doesn’t appear to be hitting me is anything on a Mac.

  7. Sheldon says:

    I’ve got some bad news. It’s not just IE browsers though they’re by far the biggest offenders. I’m having this exact same problem as was described here, but for me just filtering my Analytics data isn’t good enough of a solution because the amount of bot traffic is grinding my server to a halt. I’ve had to resort to displaying fake 404 pages for all direct traffic to the affected URLs with a click through that will load the page for real in the event that it’s a real person.

    After doing all this, my bounce rate is within 10% of a typical number, but I’m having a really hard time figuring out where the rest of the bad traffic is coming from.

    I’m custom logging everything that’s triggering my fake 404 pages and I see a ton of MSIE, but I’m also seeing a fair amount of Linux based agents too. The only things I never see as bad agents are Macintosh based ones.

    Here’s a sampling of some hits against one of my target pages in the span of one minute:

    23.22.75.60 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    23.22.75.60 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    23.22.75.60 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    220.233.120.9 / Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
    70.176.0.142 / Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
    23.22.75.60 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    23.22.75.60 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
    107.20.4.29 / Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11

    In this case as in some other examples, the Linux agent IP addresses are tracing back to an AWS cloud server. This one sample is atypical in that the bad agents are mostly Linux based. I chose this sequence specifically because it was different than what people were reporting.

    I’m also seeing some Safari on Linux agents in my bad robots log. Unlike the Linux/Firefox IPs, these don’t trace back to AWS cloud servers.

    It’s some kind of botnet for sure, but I have no idea what its intention could be. I’ve noticed a modest number of the IP addresses have been caught in honeypots as comment spammers so it’s possible that at least some of them are coming as the result of spamming attempts.

    • Matt says:

      Wow interesting. Thanks for the report. In a way this seems dissimilar to our issue in that your IP range looks (at least from this sample) to be pretty uniform. Our bad traffic was coming (I say was because I haven’t checked recently) from a highly diverse set of IPs — it was very widely distributed, and generally trended toward consumer ISPs networks like Comcast etc, indicating the possibility (at least to me) of infected browsers or Windows machines … but anyway, we did not see large numbers of requests coming from the same IP is what I’m saying.

      What does your analytics tell you about the uniqueness of this bad traffic?

      If it’s not very unique, or if it tends to use a set of IPs over and over again, you could blacklist certain IPs (perhaps in an automated way.) This wasn’t a workable strategy for us for a few reasons, but if the pattern you present here is typical, it might work for you.

      Also .. I’m assuming you’re sure that the bots load both the dom and associated assets in a browser-like environment (or at least one that can execute javascript) ? In fact, how do you know that? Is it because you’re seeing polluted Google Analytics for example?

      Finally, regarding the amount of traffic. Our bot traffic is significant, but really not that high — on the order of 7-8,000 visits/day. Do you have more than that?

      (Sorry for the barrage of questions — I find this sort of thing interesting.)

      • Sheldon says:

        Hi Matt,

        No worries. I’m happy to pass back whatever helpful information I can because I’m also really interested in this kind of stuff.

        I’m actually seeing a pretty wide range of IPs on my end too and most of it is coming from consumer ISP networks and coming from all over the world. I chose that one minute sample with the uniform IPs just because I was scanning through my logs to find something interesting to use as my example and I found a tightly clustered group of visits from the same IPs. I had just assumed that the distribution was entirely random. It wasn’t until I started to custom log this data myself that I started to notice that there were a number of heavy repeat offenders.

        Remember that I’m serving up 404 responses to help reduce the load on my server so it’s possible that the only reason they keep coming back is because they’re instructed to keep trying until they succeed. I’m now gathering IP data in a database to make it easier for me to look for any patterns in the bot visits and help me figure out if IP blocking would be effective. I’ve only got 8 hours of data so far, but here’s what I’m seeing in those 8 hours.

        – About 1500 IP addresses triggering my “bad robot” fake 404 response
        – 150 IP addresses seen 10+ times
        – 40 IPs seen 20+ times
        – 5 IPs seen 100+ times
        – 302 from one IP address

        The visits from the biggest offenders do seem to come in bursts. When I started this post, my biggest offender was at 258 trips to the sin bin where it had been for a while. By the time I got to describing it, I rechecked my numbers and found it had jumped up to 302. I’ll post some updates as I accumulate more data to crunch.

        I’m seeing a lot more than 7-8k visits a day. At its peak, we were seeing around 40k visits a day which is over 10x the typical daily visit volume for the site in question. Since I started feeding the bots fake 404s some of the volume has dropped off, but it was really consistent for the two months in between the time we first noticed it and when we were able to do something about it.

        What I found most interesting is that the distribution of visits to the target pages was fairly uniform for the hardest hit pages. In all, 10 pages were being targted and I knew immediately something was fishy because a few of these pages had really ugly URLs that would be unlikely to generate a lot of direct traffic yet they were getting around 5000 direct visits a day suddenly. 5 of the 10 affected pages were consistently getting 5000 direct visits a day. The remainders got between 1 to 3k each.

        I didn’t think of testing to see if the bots were loading JavaScript or not. My method to determine if a visit is a bad bot or not is actually really simple. They luckily are targeting mostly pages with ugly URLs that are not likely to generate much direct traffic so anytime I see a direct visit that has MSIE or Linux in its agent string on those pages, I feed it a fake 404 page that says we’re having temporary server problems and we invite them to click on a link to retry the page. In reality, clicking on that link verifies that person as human and sets a cookie so if they pay another direct visit to one of the affected pages, we won’t have to trouble them again.

        Also, anytime anyone visits us as the result of a referral or has clicked through a few pages of the website, we set a cookie to let them bypass the fake 404 page. While this is hardly a ironclad approach, the bots I’m concerned about never or rarely follow any links, so it’s working out ok.

        I did try to the trafficQuality event that you blogged about here and it’s helping, but even between the measures I’ve already taken and the trafficQuality filter, I’m still seeing between 5 and 10% of visits that are suspect. We’re obviously getting a lot lot less traffic than Grist so it’s pretty obvious when something funny is skewing our analytics data.

      • Matt says:

        Hey Sheldon,

        Thanks for the informative reply! Sounds like we do in fact have similar problems. A few things come to mind:

        – we’ve noticed the same thing as far as even distribution among target pages. That leads me to believe that there’s some sort of round-robin-ing going on.
        – the info you’ve gleaned from your server logs is very very interesting, but baffling. The question of what the heck this is and why it’s doing what it’s doing is really confusing. For many sites It’s not an effective ddos attack … so um … why? I still have no idea.
        – this stuff appears in google analytics for you, so I can conclude that the bots are loading JavaScript (unless you use an ancient method for loading GA)

        Regarding the user-agent thing, you may have already done this, but you could remove the check for IE user agents from the JS that implements the traffic quality event … that should apply the traffic quality check to all user agents and tighten things up further.

        In general, good luck!

    • Sheldon says:

      Well, I had previously said that the only bad user agent I haven’t seen is one involving “Macintosh” in the string. That is no longer true and these visits might be responsible for the final bit of the botnet that I can’t seem to quash. The “bad Macs” are actually coming from Amazon Web Services and not from residential IPs so I assume that they’re being forged.

  8. Sheldon says:

    For those of you who (like me) need to apply Matt’s event tracking trick beyond just Internet Explorer, you’ll need to modify the code a bit. The JavaScript code that Matt used works fine if you’re just trying to exclude MSIE browsers, but .attachEvent() is not a valid method in WebKit or Spider Monkey browsers.

    I had to modify the code like this to get it to work. If you don’t modify the code, your “trafficQuality” filtered stats are going to look lower than they should.

    I’m assuming that you’re using jQuery on your WordPress site. If you’re not, then this won’t work for you either.

    //Matt’s Evil Robot Detection Script in jQuery
    var category = ‘trafficQuality';
    var dimension = ‘botDetection';
    var human_events = [‘keydown’,’mousemove’];

    if (!document.referrer) {
    for(var i = 0; i < human_events.length; i++){
    $(document).bind(human_events[i], ourEventPushOnce);
    }
    }else{
    _gaq.push( [ '_trackEvent', category, dimension, 'botExcluded', 1, true ] );
    }

    function ourEventPushOnce(ev) {
    _gaq.push( [ '_trackEvent', category, dimension, 'on' + ev.type, 1, true ] );
    for(var i = 0; i < human_events.length; i++){
    $(document).unbind(human_events[i]);
    }
    }
    // end ourEventPushOnce()
    //End Evil Robot Detection

  9. bidderboy says:

    OMG ! there must be something wrong,
    I am writing this today on 25 th October-2012
    I am also getting sudden huge traffic on my website,
    All visitors are from SAN JOSE (USA)
    All visitors are DIRECT and accessing different pages for average of 5-6 minutes,
    All are using APPLE SAFARI browser .
    I am sure, this is not real traffic, This is totally unexpected

    I am monitoring this on Google Analytic’s Real Time monitoring tool. this started on 25th Oct 2012,

    I suspect, there is something abnormal, may be any competitor of me has done something to increase my site’s bounce rate and this will affect my SEO. Can any one help me PLEASE to find root cause and fix the issue ??

  10. Cindy says:

    Matt,
    I would like to talk to you about this bot issue. How can I contact you?

  11. George says:

    Hi guys, very good article. Could anyone advise what would be the better Log analyzing tools vs google Analytics? I want to use something for Nginx. We need to be able to see traffic from advertisers, its quality etc… ga is not collecting well… thank you!

  12. Rebecca says:

    Hi everyone. Thanks Matt for posting this article – great insight into this issue. I’m wondering if anyone has seen this issue specifically from a campaign they’re running? We’ve been doing a display campaign and the only URLs affected are those with a UTM code attached. We have 5 codes for one creative and 1 code is being targeted – 1,500-2x hits in one day. Any thoughts on a way to rectify the issue with a campaign – other than dropping the UTM code? All of the faulty traffic is coming from Safari>Linux out of Los Angeles.

    Thank you!

  13. Une foule de plus d’un million de NewYorkais s’est masse Ralph Lauren pas cher dans une humeur joyeuse sur la Cinquime Avenue dimanche pour assister au dfil annuel de la “Ralph Lauren”, qui a clbr cette anne l’adoption vendredi soir d’un projet de loi historique lgalisant Ralph Lauren pas cher le mariage homosexuel dans l’Etat de New York.

  14. Howdy! I know this is kinda off topic but I was wondering
    if you knew where I could locate a captcha plugin for my comment form?

    I’m using the same blog platform as yours and I’m having difficulty finding one?

    Thanks a lot!

  15. If they continue to expand the keyword list and get you more and more traffic, then seo works for you and it should be
    continued. Whenever this is done artfully, the client benefits by ranking
    and by the advantage of featuring quality content
    on their site. SEO companies are dedicated, updated and have teams for maintaining focus on working on a plan.

  16. SEO says:

    It is not my first time to visit this site, i aam visoting this web page dailly and take nice information from
    here daily.

  17. Tera says:

    It’s hard to find your page in google. I found it on 17 spot,
    you should build quality backlinks , it will help you to increase traffic.
    I know how to help you, just type in google – k2 seo tips
    and tricks

  18. Antonietta says:

    I see a lot of interesting content on your website. You have to spend a lot
    of time writing, i know how to save you a lot of work, there is a tool that creates unique, google friendly
    posts in couple of minutes, just search in google – k2 unlimited content

  19. Dustin says:

    I read a lot of interesting posts here. Probably you spend a lot
    of time writing, i know how to save you a lot of time,
    there is an online tool that creates unique, SEO friendly posts in minutes,
    just search in google – k2seotips unlimited content


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 557 other followers