Your web analytics are important. It’s critical that you are able to measure the activity on your website accurately. I’ve said it from the very beginning: your analytics package is your most valuable marketing tool. That’s one reason I’ve been urging people to upgrade to Universal Analytics from their old Google Analytics implementation. There is a great deal of added functionality with UA.
But there’s an unexpected issue that comes along with Universal Analytics: Referral Spam. This isn’t a reason not to upgrade to Universal Analytics (I’ll talk about how to eliminate most referral spam in a minute), but referral spam can seriously affect the usefulness of your analytics—especially for small and medium sized sites.
What Is Referral Spam?
Referral Spam is the presence of non-human visits showing up in your analytics. It’s probably easiest just to show you. Go ahead and follow along with me in your own Google Analytics account. Open it up for any given time frame of at least a month or two and pull up the Source/Medium report under Acquisition > All Traffic. You probably see several different referring sites, including Google, Bing, and some websites that are relevant to your business.
But I bet you’re also seeing referrals from sites such as 4webmasters.org, semalt.com, butons-for-website.com, darodar.com, and many others you haven’t heard of, right? Those are prime examples of Referral Spam.
The thing is, most or all of these visits never really occurred. At least not by human beings. These visits are caused either by robots or spiders crawling your site, or robots just sending Universal Analytics the code to log a visit to your site.
Why This Is A Problem And Why You Care
These fake visits do create issues for you. They are almost always one page visits and always by a non-human. So throwing a bunch of extra visits that didn’t really occur can seriously mess up what the analytics should be telling you. Your boss may be happy that traffic is growing (even though it really isn’t). But your bounce rate is showing much higher than it really should be (because all these visits look like bounces). And your conversion rate is being understated.
Worse, this problem has been growing and growing. Each month, we see more and more of this in our clients’ analytics. And it’s especially hurtful for small and medium sized businesses. If your site averages 4,000 visits per month, and all of a sudden you get 600 Referral Spam visits, that looks like a 15% traffic increase. That looks great! Except it’s fake. It didn’t really happen. Worse, your other stats like Conversion Rate, Bounce Rate, Time on Site, etc., will all look worse than they really are.
For huge sites with millions or even hundreds of thousands of visits per month, a few hundred robot visits are not even a blip on the radar. Perhaps that’s why this problem isn’t discussed more than it is. But the vast majority of sites are small enough that Referral Spam throws off their numbers significantly. Is yours one of those?
What’s The Point? Why Do They Even Do This?
Like most spam, the objective is to get you to visit their website. If you are a webmaster, or someone marketing your site and you are looking through the Referrals report and see something like Semalt.com sending you a bunch of traffic, it’s only natural that you want to go look and see what that site is that is linking to you and giving you so many visitors (that all seem to bounce). Do not go visit the sites you don’t recognize in your referrals report.
Like other kinds of spam, most of these sites are relatively harmless, just trying to get you to their site, either in hopes of selling you something, or just so they can show the owner that they generated a great deal of traffic. But again like other kinds of spam, sometimes the links are to malicious sites. Don’t visit sites you don’t know anything about.
Different Kinds of Referral Spam
Before you can stop Referral Spam, you need to understand the different flavors of it. There are essentially two kinds of Referral Spam. Crawlers that actually do visit your site, and robots that just send data straight to Google (now being called “Ghost Referrers”). They each act differently, and therefore must be treated differently.
Referral Spam crawlers, such as semalt.com are much like legitimate crawlers like Google’s crawler. They are software programs that follow links to crawl the web. They come to your site and use your links to go to many or all of the pages on your site. In the case of legitimate crawlers, their purpose is to get information about your site that makes the web easier to use. For these shady crawlers, their purpose is mostly to leave their website as the referrer in your analytics in hopes that you’ll follow the link back to their site. And I repeat: don’t do that.
Currently, the best way to stop these from affecting your analytics is to either block their access to your site entirely via your .httaccess file, or by setting filters or segments in Google Analytics to keep those visits out of the data you’re looking at (more on this below).
Ghost Referrers are software programs too, but they differ from Crawlers in how they operate. One of the features of Universal Analytics is the “Measurement Protocol”, which enables you to use Universal Analytics to measure & track offline activities (see some cool examples).
But unethical individuals are exploiting that feature to send data to random Google Analytics account id’s, similar to how telemarketers use software to dial sequential blocks of numbers knowing that some will be live numbers and some will not be. These robots are the same way. They send the data to random UA-AccountNumbers knowing that some will be real accounts. They don’t know what websites they’re infecting with their data. They just throw as much out there and hope that some of it hits.
When it does hit, it looks like a visit on your site. And the data includes a referral source in hopes that some people will follow that source and visit the site.
A new breed of the Ghost Referrers has surfaced recently. Instead of sending visit information to Google via the Measurement Protocol, some new bots are sending Analytics Event information. Do you have any “Ghost Events” showing up in your analytics? Open your Behavior > Events > Top Events report and see. Since early May, we’ve seen consistent instances of fake events showing up with Category, Action, and Label all being set to “to use this feature visit Event-Tracking.com”. (I’ll repeat it again, don’t go visit that site.)
I think this is an attempt to fool the novice Analytics user into visiting their site, just like the more “traditional” referral spam operators are. Fortunately, these can be handled the same way as the Ghost Referrers can be.
How To Fight Referral Spam
Unfortunately, since Ghost Referrers (which includes “Ghost Event” spam) don’t actually visit your site, trying to block them in our .htaccess file will do nothing. You have to filter the data out of Google Analytics, either with a custom Filter or a custom Segment. Undesired Crawlers can be addressed via the .htaccess file, but it’s wise to create a Filter and/or Segment for them too.
So how do you create these filters?
Filters for Ghost Referrers
Right now, I see a lot more Ghost Referrer (both regular and those inserting Ghost Events) issues, so let’s start with that. There is one characteristic that is common to all Ghost Referrers. Since they don’t actually visit your site, they don’t know what your site is. Remember, they just use random Google Analytics account numbers hoping that many of them are for real sites. We can capitalize on the fact that they don’t know what your site is. (Thanks to Mike Sullivan of Analytics Edge who was the first person I saw come up with this method to keep Ghost Referrers out!)
There is a dimension in Google Analytics called Hostname. This is essentially the domain the visitor used to visit your site, either by typing it in, or from the link they clicked. For all visits to your site by humans or by crawlers, some version of your site’s domain will be the Hostname that is passed into Google Analytics. But for Ghost Referrers, since they don’t actually visit your site, the Hostname will either be listed as (not set) or it will be some other website, such as amazon.com.
So we can create a filter to only allow visits that have a Hostname that you will accept. But you want to make sure to get ALL valid hostnames. Almost all of them will be your domain, but in some instances, you may want other hostnames to count as well. For example, if your shopping cart exists on another domain, you definitely want to include that domain in your valid Hostname list.
The basic rule of thumb is any website that you have code with the Universal Analytics tracking snippet would be valid as a Hostname. A good way to identify everything is to look at a report in Google Analytics that shows the hostnames of visitors to your website. Set your date range to something big, like the past two years. Then, in the Audience section of the left-hand navigation, click on Technology, and then Network. At the top of that report, click on “Hostname” as the Primary Dimension.
This will show you the Hostnames for all your visitors in the past two years. Different versions of your own domain should be at the top. As you look at others in the list that sound familiar, think about whether you have ever configured any Google Analytics for something you might be doing on those sites. For example, some people host embedded videos on YouTube and enter their site’s account number into YouTube for tracking purposes. In this case, you’d want to include YouTube.com as a valid hostname. But entries like amazon.com, huffingtonpost.com, etc. would not be valid hostnames.
You might see translate.googleusercontent.com in the list. If you have many foreign visitors who use Google to translate your content, this might be a valid host. You’ll have to make that call. I wouldn’t be surprised to see some spammers start masquerading as this hostname in hopes of getting through a filter in the future.
Setting Up The Filter
Once you have your list of Hostnames you will allow, setting up the filter is simple. In your Admin section of Google Analytics, select the View you want to apply this filter to and click on Filters. Then click on the New Filter button. Give this new filter a name, such as “Valid Hosts” and choose Custom as the Filter type. Select Include, and for the Filter Field, select Hostname. For the pattern, enter each valid host separated by a vertical bar (no commas, no spaces). You can wildcard each domain name so that it covers all subdomains. For example, instead of entering www.websiteoptimizers.com, I can enter .*websiteoptimizers.com instead.
Leave the Case Sensitive box unchecked and save your filter. From this point forward, all of your Ghost Visits should be excluded from Google Analytics.
In the example above, I chose to include only the websiteoptimizers.com domain as the only valid host. If you have other valid hosts (like third-party shopping carts, etc.) as described above, you would add these to the regex separated by a vertical bar.
Two things to keep in mind when setting up this filter (and the next one as well). First, it’s a wise move to have a separate View set up as a “Test” view and setting this filter up there and letting it run for a few days just to make sure things are running well and the results are about what you’d expect. Second, it’s always a very wise move to keep a separate View set up with no filters on it at all, so that if you do mess anything up, you’ll always have that unfiltered view to fall back on.
Filters For Crawlers
Crawlers are a little trickier than Ghost Referrers. The solid way right now to filter them out is to add them to a list of crawlers you know of that you want to exclude. You do this very much the same way as setting up the filter for Ghost Referrers, except that here, you want to Exclude (instead of Include) on the Filter Field of Campaign Source. Enter a list of referral sources, separated by a vertical bar, with no commas or spaces.
As of this writing, here is the list we use for most sites:
This is based on crawlers we have seen with many of our clients. You may have others, and you will undoubtedly start to see others in the future, which means you’ll have to add them to this filter as you discover them.
Since we are excluding these instead of including them above, we can set up the filter to be a little broader. It may be semalt.com today, but tomorrow it could be semalt.net. So in my filter string, I’m just going to use the root of the domain (semalt), and I’m going to do it that way for each entry. This will also help me save space on my filter string, which has a limit of 255 characters. So our string, based on the spammers listed above, is:
If you do get your string to beyond 255 characters as you add more sources over time, you’ll probably be able to remove some old ones that are no longer sending traffic to you. Use your unfiltered view to identify which those could be.
When you see a referral from a site you don’t recognize, how can you tell if it’s a crawler or not? As I’ve said, don’t go visit the site to find out, as it could be a malicious site. Generally, however, there are some clear indicators that it is not a human looking at your site.
In most cases, each page a crawler looks at is recorded as its own session. This means that these misleading crawlers typically show 100% Bounce Rates and exactly 1.00 pages per session. They also generally show 100% New Users (that is, the number of New Users is the same as the number of Sessions). If you see referrals from sources with numbers like these, or close to them (99% Bounce Rate, 1.01 pages per session, etc.), go google the referrer. If they are spam, you’ll see plenty of search results telling you as much.
Remember, if it’s a referral for which you only have a couple of sessions, a 100% bounce rate is not so out of the ordinary. That’s more reason to google it to learn more.
Filters vs Segments
You’ll note that up above, I started talking about creating filters and/or segments, and then went into details just for filters. It’s important to note that setting up a filter will keep the other data out of that view entirely. But it also will only work from the time you create the filter and in the future. It won’t affect the data prior to your creating the filter.
So if you want to analyze older data, you’ll have to create a segment instead. See our earlier post on creating advanced segments, and apply the same logic above.
Clean Analytics Are Critical
The smaller your site is in terms of monthly visitors, the bigger the impact will be from Referral Spam. But even if your traffic is measured in the millions, it’s smart to filter out all of the inaccurate spam you can. These two simple filters can do that for you.
What other causes of inaccurate analytics are you experiencing? Are there other spammers infiltrating your analytics not in the list above? Leave a comment and help out the entire community minimize this problem!