Skip to Content

The Complete Google Analytics Referral Spam Filter Guide

[Google]
Summary: So, your Google Analytics reports are suffering from an overwhelming amount of BS traffic data lately, huh? You're not alone. This "referral spam" has been plaguing most GA users for months now, and sadly many of the purported solutions currently online are anything but and at best they only address part of the problem. Today we show you how to clean up your future reports with Google Analytics filters and sanitize your historical data with corresponding segments. Today's tutorial will have you back to analyzing *real* traffic data in no time. If you're reading this post, then you're probably at your wits' end trying to figure out why your Google Analytics reports have become so polluted with garbage traffic data lately. Like most webmasters, you've probably seen inexplicable spikes in your traffic data between late last fall and now, but any elation you felt was short lived when you realized that there wasn't a concurrent uptick in conversions or quality leads. Google Analytics Referral Spam Sources The more intrepid among you likely sought answers in your traffic source/medium reports and after some digging discovered that these low quality visits were coming from mysterious sources like 4webmasters.org and semalt.semalt.com (among dozens of others)... And that's the story of how you arrived here and at the "Now What?" stage of your journey.

What is Referral Spam?

Like AA, the first step in your GA rehab is admitting you have a problem - we can't block/blacklist these spammy referrals if we don't understand and identify them. And with referral spam representing between 10-50% of traffic for many website owners, a solution is long overdue. Without getting too technical or anymore loquacious than is typically expected of me, essentially the problem you're facing relates to fake visits being logged in your traffic reports, typically as a result of a combination of two spam tactics - ghost referral spam and crawler referral spam.

Ghost Referral Spam

The more common of the two, ghost referrals are most likely what's plaguing your data. Essentially these are fake visits - not even a crawler actually hitting your site - that show up in your GA reports as a result of an automated tool that hits individual accounts at random using the accounts' UA tracking ID numbers. This is done using the Google Analytics Measurement Protocol, which is a GA feature created to allow us to pass offline data back into our accounts. As always, the spammers have managed to corrupt something that was once good. New ghost referral spammers crop up all of the time, so it's really impossible to come up with a "master list" to target for filtering, but there is one quick way to scrub this filth from your reports going forward, which we'll discuss shortly.

Crawler Referral Spam or Bot Spam

The other type of referral spam you've likely experienced comes from "bots", "crawlers" or "spiders" that actually do visit your website - but they aren't real, human visitors. Unlike ghost referral spam, these bots/spiders/crawlers must be filtered individually and it could be an ongoing battle. Still, there are lists online of the more common offenders, which are a great start and should allow you to filter out most of the current crawler spam. It's worth noting that not all bots are bad bots. In fact there are many good bots, including arguably the most important crawler: Googlebot. This, as the name implies, is Google's own bot that crawls sites all over the web to discover and index new and updated pages. In short: it's GoogleBot that makes it possible for your site to appear in Google search results. Most good bots also don't execute JavaScript and therefore don't skew your GA reports like the bad spam bots do.

How to Filter Out Spam Referrers in Google Analytics

Now that you understand where your nonsense traffic data comes from, what are you going to do about it? Well, if you're here then you've either just begun your search for a solution to referral spam (and you've come to the right place!) or you've struck out with some of the other ostensible panaceas that are now cluttering up the web with misinformation, and you're desperate for real answers.

Blocking Referral Spam with .htaccess?

Well intentioned though they might be, there are a lot of blog posts out there claiming that blocking visits from spam bots using your .htaccess is the most effective way to combat referral spam. Unfortunately this method doesn't do anything to filter out ghost referrals, because it only blocks "real" visits to the site. In addition to not being particularly effective, messing around with your .htaccess file is incredibly risky and only those confident in their technical abilities and understanding of this file should even consider it. Did you know that if even one character is out of place in your .htaccess file, you can take down your entire website? A much safer way to remove crawler spam is to use filters in Google Analytics. If you want to cut it off at the pass and and only if you're confident in your technical abilities, you can certainly add records to your .htaccess file, but don't for one second think that this will solve all of your spam referrer problems. Let's get back to real solutions...

Include Hostname Filter to Block Ghost Referrals

I mentioned earlier that "ghost" referrals were the most prevalent in GA reports for many webmasters, and that there was one relatively easy way to mitigate the data corruption. This is done using a valid hostname filter. This may sound counterintuitive because our concern is with bogus traffic sources, but that's precisely why we need to filter based on hostname - the referrals sources aren't real to begin with. The key to getting hostname filtering right is to ensure you use the right Regular Expression (Regex) characters and include the hostnames of any sites you might use your GA code on, such as third party checkout systems, etc. Valid Hostname Filter Google Analytics In the screenshot above, you can see that I've chosen the view I want to apply my new filter to, navigated to the "Filters" section and chosen a "Custom" filter. I then selected the "Include" radio button (because we want to only include certain traffic) and then chosen "Hostname" from the drop-down. When setting up a valid hostname filter, the filter pattern is the only area that can be a little tricky if you aren't familiar with Regex, so we recommend you experiment with it on your "Test" view before rolling it out to your "Master".

*Pro Tip*: Don't have a "Test" view yet? This Analytics Academy lesson explains why you should have 3 views for each property and teaches you how to set them up.

I see some other tutorials leaving in the "www" and not including the backslash before the ".com", and depending on how their site is set up, that make work fine. I chose to leave off any subdomain (www., etc.) , so that if we ever did use any others the string would match them all. I chose to put a backslash in front of the period to turn it (the period) into an "everyday character" instead of a wildcard. Typically swapping in your domain for ours in the example above should be sufficient, but if you have subdomains, a site live at both the www and non-www URL, third-party shopping carts, etc. you're going to want to do your Regex homework before applying this filter. Analytics Edge also offers this great tip on identifying a list of valid hostnames:
Start with a multi-year report showing just hostnames (Audience > Technology >Network > hostname), then identify the valid ones — the servers where you have real pages being tracked. (hint: google.com is not one of them)
The setup of the valid hostnames include filter should only take you a few minutes, but it should do a lot to clean up your reports going forward. As I mentioned, ghost referrals represent the majority of referral spam I've seen for many of our larger accounts, so this filter has had a major impact.

Exclude Filters for Crawler Bot Referral Spam

So, what about the trickier referral spam data that is the result of bots actually hitting your site? To deal with these spammers, you'll need specific exclude filters on the offending domains. You can gather your own list by looking at your "Referrals" report in the "Acquisition>All Traffic" section of your standard reports, but rather than spend the time digging through and trying to determine which referrals are legitimate and which are spammers, wouldn't it be nice if someone was keeping an updated list online for you to use? Voilà! That's right, a comprehensive list of referrer spammers is being maintained by GitHub users, which makes your filtering a heck of a lot easier - just follow that link and click on the "spammers.txt" link for the full list. Once you have it, here's what you do with it: Exclude Campaign Source Filter Google Analytics Like with the other filter, you want to choose your view, navigate to the filters section and choose "Custom", but this time you're going to choose the "Exclude" radio button and then select "Campaign Source" from the drop-down. Then your "Filter Pattern" should be the list of domains from GitHub, separated by pipes (hint: the vertical line you get when you hold shift+backslash). Sadly Google limits the filter pattern field to 255 characters, so if you're using the full GitHub list, you'll need to create 15+ exclude filters like I did. This might seem like overkill if you've only been attacked by a few, but why run the risk of having to go back later and create additional filters? There's always the chance that others will crop up in the future, but I think it's easier to knock out as many as you can up front. Like the include filter, you can use your own understanding of Regex to set up your filter patterns, but I chose to use all of the domains and again turn the period from a wildcard into an actual period by preceding it with a backslash, but dropping the TLDs (e.g. .com, .org, etc.). This way, if a spammer was to suddenly stop using the .com for their domain and start using the .net, they'd still be excluded by my filter.

Bot Filtering - Exclude all hits from known bots and spiders

Google Analytics Bot Filtering   Now you're probably wondering why I didn't tell you about this feature before the last step, but there's a good explanation I swear - the bots and spiders we just filtered manually aren't "known" to Google. In fact we don't, er.. know, which ones are known. So, it's unlikely that this option is nearly as effective as the other custom filters we've set up, but it's probably a good step to take either way if you're being diligent. It's as simple as navigating to your "View Settings" and checking the "Bot Filtering" box. Now that your GA reports will be free of referral spam you can get back to business!.... Oh wait, we forgot about your historical data, didn't we?

Use Segments to Clean Up Historical Google Analytics Data

Yes, sadly Google Analytics filters only apply to future traffic data. To declutter your historical data, you'll need to set up a pretty simple segment. The only downside is that you'll have to apply the segment each time you view your reports, but it's a matter of a couple a couple of clicks. Referral Spam Segment - Google Analytics You can set "Segments" up either while viewing your reports or from the "Admin" area. The screenshot comes from the the Segments section of the Admin area, but the settings used to filter out your historical referral spam data are the same regardless of where you set it up, and they essentially mirror the "Filter" settings we just went over:
  • Name your segment something clear and meaningful
  • Create an "Include" filter for "Hostnames" using the list of hostnames you entered in the hostname filter we discussed earlier, using the condition "matches regex"
  • Add an "Exclude" filter for the traffic "Sources" you created the 15 filters for, that we discussed earlier, using the condition "matches regex". The beauty here is that there doesn't appear to be a character limit on the segment field so you can include all spam sources at once.
  • When setting up your segment, be sure to mimic the same regex patterns you used when creating your custom filters (e.g. pipe separators, backslashes before periods, etc.)
Now all you have to do is add the segment by clicking the "+" button at the top of any of your reports. Adding Segments in Google Analytics After adding your new segment, you can then click the drop-down next to the default "All Sessions" segment and remove it if you'd like, but for now let's leave both segments on so you can see the contrast in the data for one of our most affected clients: Traffic Sources Report - Segment Applied in Google Analytics As you can see, the new segment ripped out over 67% of the traffic which was all bogus referral spam and you can actually see that the majority of that was attributable to just 2 sources - trafficmonetiz.org and 4webmasters.org.

Conclusion

If you've been struggling lately to separate your website's real traffic data from referral spam data, there are steps you can take to clean your reports up. Filters will prevent most referral spam going forward and you can set up an advanced segment any time you want to any time you want to go back and review filtered historical data. Keep in mind you'll have to reapply the segment each time you view historical reports, but it only takes a few seconds and a couple of clicks. Try not to be overwhelmed by the lengthy explanations in the tutorials of this post - the actual setup of these filters and segments should only take you a few moments.

Google's Referral Spam Solution Forthcoming?

The SEM Post reported a few weeks ago that Google is working on their own universal solution that would prevent the need for all of these workarounds, but for the time being it's they should do the trick and at least allow you to analyze real website traffic data. Please let us know what you think of these solutions in the comments below or if you have alternatives.

Other Related Guides: