How to solve a problem like scraping

It’s been a while since I last blogged but it doesn’t mean I’ve disappeared. See it as me being deep in thought.

I work for a large web site operating in EMEA that has lots of invaluable data available to the public. This is great, but other people want to take that data wholesale without going through the proper authorised channels. This is known as scraping – effectively “Site Content Raping” to coin a not so nice phrase.

Scraping is very easy to do. There are tools out there that in a few clicks, will spider your site and download the content – after all, the data is public, the hyperlinks are designed to take you through the data. The web search engine bots effectively scrape our site, but the difference is that they report back the links. Scraping content involves downloading the relevant data that causes legal issues. In truth, scraping is a legal issue – but legal routes to stopping scraping is hit by two issues: its a lengthy process and one that needs evidence to support the scraping activity to show its breaking the terms and conditions of your site.

The problem with scraping is being able to identify it in the first place. Some scrapes are relatively benign and easy to spot. Ironically, they’re not usually an issue unless the lazy way they’ve implemented their scrape causes site capacity issues. But well designed site infrastructure should be able to cope with any surge in demand however it is presented. Most scraping activity remains under the radar though and spotting the trail involves understanding how the site can be scraped in the first place, the methods to evade detection and the hardest challenge in all this is distinguishing this from the millions of legitimate traffic accessing the site at the same time.

There are a number of ways to tackle the problem:

– Employ a 3rd party to monitor and report on the scraping activity in real-time on your behalf as part of a monthly service operational expense
– Implement ingress filtering of your data to report on activities in real-time using equipment maintained and set up by teams internally
– Implement log analysis after the event

I’m looking at the log analysis to tackle the issue which involves large data set processing using Hadoop and custom scripts to slice and dice the information to help form conclusions that will help towards writing the reports to support scraping activity.

Over the next few months I aim to track my successes and failures in combating this problem.