What are web bots, how they affect e-commerce, and an extremely simple machine-learning recipe to fight them.
What are bots?
In the context of websites, bots are computer programs that “visit” a website in a way similar to that of a human using a browser. The intents of bots are varied. Helpful bots are sent by search engines to index and crawl content, or to render useful previews of the website in social networks when somebody shares your content. Unfortunately not all bots are helpful, and many of them have less than noble purposes, like taking control of your site and ransoming your data or your business.
As a website operator, why should you care for bots?
As pointed in a CNBC article from February 2017, cybercrime accounted for 450 billion USD in losses during 2016, and bad bots are often the forerunners and vectors of cybercrime, specially when it comes to e-commerce operators.
Furthermore, bots are a considerable fraction of the traffic of many websites, and by locking resources that otherwise would be used to serve your human visitors, they can force you to pay more for hosting hardware of what you otherwise would need to. Just for illustration, a typical website interacts with 400 different bots everyday. The amount of hits those bots generate varies, but roughly speaking is in the order of 10000 per day.
For lightweight blogs with mostly static content, a few tens of thousands of daily bot hits are little more than an annoyance. However, for complex e-commerce shops that run hefty database workloads to render each page, traffic from bots may actually degrade loading times for human visitors. Things get more complicated when HTTP/2 is put into the equation, since some of the benefits of this most recent version of the protocol, i.e. long connections, are good for human users but unnecessary and detrimental when serving bots.
There is yet a third type of bots, which we call “copyright ransomware”, and which have a more unpleasant behavior. These little beasts do a thoroughly scanning of all the images and videos in a site — which can be a considerable amount in a big-ecommerce full of product images and videos– looking for copyright violations in the media. If they find any, the company behind the bot contacts the website owner issuing a cease-and-desist and asking for compensation. While no serious shop owner would publish unlawful media content on her site, there is still no reason to allow these bots unwarranted roving.
The conflict between website owners and bots has been raging for years and both parts are locked in an arms race. Bots have been getting more sophisticated, and very basic approaches to block or control bots are not effective any longer. We have compiled some of those approaches below, in a not-to-do list. Still, it is relatively straightforward to create your own bot firewall, and we will explain our recipe below, one that you can implement at home in your “virtual garage”.
Things not-to-do when it comes to bots
Do not block-and-forget
Bots, of course, need to access your site from some IP address, and such addresses can be blocked using a firewall. The tricky part is identifying the addresses themselves. One way is to procure lists of IP addresses for known bots and threats, with a service like “abuseipdb.com”. If you are using a list of abuse addresses from the Internet, be sure that you can trust its contents and that there is a way for you to obtain updated copies regularly. This is because IP addresses used by bots tend to change often and bots are always looking for new ways to reach their prey.
A second way is to examine the log messages in your site itself to identify those addresses. Our recipe will be actually based on that, but let’s not get ahead of ourselves.
Do not block beneficial bots, but do not allow them unfettered rampaging
Some bots are beneficial — web crawlers and the like– and should not be stopped. That said, the beneficial bots tend to generate a lot of traffic, and it’s a good idea to give preference to human visitors and to mark traffic coming from the good bots so that they don’t confuse your analytics and reports.
Do not tamper with robots.txt
The so called
robots.txt is a file at the root of your website that can tell bots to not crawl your site. Unfortunately, it’s a voluntary thing, on the bot. Good bots will follow it, and bad bots won’t. The end result being that
robots.txt can restrict good bots, but not bad bots. Counter-productive. Therefore, it is better to leave
robots.txt alone or to not have it at all.
Do not rely on the user-agent string
The “user agent string” is an identification provided by browsers and other web clients to the website. It is used by the browsers to mark their version, and by good bots to identify themselves. Unfortunately, it is very easy to fake, and bad robots won’t miss the opportunity. In our sample, 30% of visits coming from IP addresses used by bots do not self-reported as bots in the UA string, but instead masked themselves with a user-agent string similar to that of a normal browser.
The two-step solution
If IP addresses, user-agent strings, and robots.txt are not to be trusted, what’s left? Rather than focus on a single piece of data, we shall combine multiple pieces of information to decide if a visitor is a bot or a human. All said, the solution comes in two steps: first find a way to identify bots reliably, and then use that information to block or deprioritize the bots visiting your site. We will give below a general recipe for the first part, and leave you the second part as homework or for a second blog post.
The actual method leverages machine learning, but to keep things easy to understand, we will avoid using technical lingo.
Putting visitor data in a table
Bots usually don’t behave like humans, that’s why we want to identify bots by their behavior. We understand behavior better when looking to several independent characteristics of visits. Let’s call these characteristics “features”, borrowing the term from the discipline of machine learning. Depending on your software stack, there are many features that is possible to use to identify bots; we have dozens of them in the ShimmerCat stack. Some of the features are better though, and with careful selection two to three features can identify a bot with 95% confidence.
Features are divided in “features used for learning” and “high confidence features”. High confidence features allow you to identify some visitors with high confidence as either humans or bots. For example, Google bot identifies itself as a bot, and so do many other bots. On the other hand, humans are the only ones with credit cards and putting orders in your system.
In the Venn diagram, we have represented the classifications that high confidence features allow us to make. The caveat is that high confidence features only decide for a small subset of the visitors. For example, many bots don’t identify themselves as bots, and many human visitors don’t actually put an order in the system. Still, the high confidence features can be used to learnabout how bots and humans behave with regard to other features, that we conveniently call “learning features”. Just apply the high confidence features to filter your visitors of say, last week. For the visitors that you can definitely classify as either human or bot, get also their learning features.
Examples of learning features are fragments of the requested URL, the user-agent string, the indicated “browser” language, protocol fingerprints, variance of the inter-request time, origin country of the visitor, etc. This is the part where you are better off by doing your own pick and keeping it secret, since you don’t want bots to adapt to your defense mechanisms.
Once you have decided on the learning features, you just need to train an ML classifier on them, using the high confidence classifications for training and validating the classifier. This latest part is very standard routine, and almost any of the hundreds of classifiers readily available will do.