Detecting Scraping: Apache, Squid using Hadoop and Pig

Architecture

Squid is operating at the front-end as a reverse proxy-cache behind a hardware load balancer which in turn gets its content from a large load balance pool of Apache (with JBoss) servers.  The Squid logs have some useful information in (IP of remote client) and the Apache logs have some equally important information in (User-Agent, Session ID).

The first challenge: Stitching Squid logs to Apache

The squid logs are in Squid’s native log format which is optimised for speed.  The squid logs rotate hourly and the Apache logs rotate daily.  This arrangement fits with front end speed and the convenience of rotating logs at the one of quietest times of the day (4am).

The Apache logs are a custom log format that includes response time and cookie information:

CustomLog   logs/access.log "%h %D %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""

To stitch the two together, the first thing to realise is that you  only need to look at Squid MISSes – logically these are the ones that have had to go to Apache to get content.

The only identifying piece of information common to both Squid and Apache are the timestamp and URL so immediately there is room for improvement here and a small percentage of hits don’t get stitched together.  This is fine as we’re after a picture of potential scrapers, rather than accurate accounting information.

So the first Pig script does a merge of the data based on timestamp and URL.  This takes a fair amount of time to do and in the end a log file is produced with a lot of useful information in – external IP, user-agent, cookie information, etc. It does this by processing this through a stream using some perl as we need to convert the unix timestamp format in the squid log to the same format as Apache. I end up with the following

03/Oct/2010:04:02:14	44	xxx.xxx.xxx.xxx TCP_MISS/200 365	GET	/static/js/bundle.js	-	172.22.25.155	54905	03/Oct/2010:04:02:14	GET	/static/js/bundle.js	1.1	200	289939	http://mysite.co.uk/some/resource Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; InfoPath.1)	34AA0D711456868AAF7C871F8190C72F	1243622207600.672376.715	LU11LY

Note that final 3 fields.  These are a sessionid, a userid and a postcode [something specific to this particular site].  We can extract any meaningful information from the log entry when we stream the data through some perl script.

The crux of the challenge: What is scraping?

There are a number of issues with scraping. Some scrapes are very easy to spot – sequential page number increasing, single IPs, same identifiable information used in the search strings. The more complex ones mix IP addresses with different User-Agent strings, even the single IP addresses with multiple User-Agents used pose a problem: mega-proxies where many, many genuine home users filter through from AOL, etc.

The first steps: Grouping IP, with Session ID and User-Agent

The first steps towards identifying the slightly less basic scrapers (ones with multiple IPs, user-agents) was to group the IP with Session ID and User-Agent.  This is then counted and ordered by the count to show how often that combination from that IP is seen… notice we don’t need to see the URL.  It is assumed that a high number of this combination is unusual and worthy of further investigation.

This job takes a few hours to run on relatively small amount of data on the current PoC Hadoop cluster so there is room for optimisation here.

Note that this script highlights the bots too – which by the same token are a high number of hits from single or multiple IPs…

The result is something like the following:

I’ll post the work-in-progress code here to share the findings and work on a way of improving this.

How to solve a problem like scraping

It’s been a while since I last blogged but it doesn’t mean I’ve disappeared. See it as me being deep in thought.

I work for a large web site operating in EMEA that has lots of invaluable data available to the public. This is great, but other people want to take that data wholesale without going through the proper authorised channels. This is known as scraping – effectively “Site Content Raping” to coin a not so nice phrase.

Scraping is very easy to do. There are tools out there that in a few clicks, will spider your site and download the content – after all, the data is public, the hyperlinks are designed to take you through the data. The web search engine bots effectively scrape our site, but the difference is that they report back the links. Scraping content involves downloading the relevant data that causes legal issues. In truth, scraping is a legal issue – but legal routes to stopping scraping is hit by two issues: its a lengthy process and one that needs evidence to support the scraping activity to show its breaking the terms and conditions of your site.

The problem with scraping is being able to identify it in the first place. Some scrapes are relatively benign and easy to spot. Ironically, they’re not usually an issue unless the lazy way they’ve implemented their scrape causes site capacity issues. But well designed site infrastructure should be able to cope with any surge in demand however it is presented. Most scraping activity remains under the radar though and spotting the trail involves understanding how the site can be scraped in the first place, the methods to evade detection and the hardest challenge in all this is distinguishing this from the millions of legitimate traffic accessing the site at the same time.

There are a number of ways to tackle the problem:

– Employ a 3rd party to monitor and report on the scraping activity in real-time on your behalf as part of a monthly service operational expense
– Implement ingress filtering of your data to report on activities in real-time using equipment maintained and set up by teams internally
– Implement log analysis after the event

I’m looking at the log analysis to tackle the issue which involves large data set processing using Hadoop and custom scripts to slice and dice the information to help form conclusions that will help towards writing the reports to support scraping activity.

Over the next few months I aim to track my successes and failures in combating this problem.