Detecting Scraping: Apache, Squid using Hadoop and Pig

Architecture

Squid is operating at the front-end as a reverse proxy-cache behind a hardware load balancer which in turn gets its content from a large load balance pool of Apache (with JBoss) servers.  The Squid logs have some useful information in (IP of remote client) and the Apache logs have some equally important information in (User-Agent, Session ID).

The first challenge: Stitching Squid logs to Apache

The squid logs are in Squid’s native log format which is optimised for speed.  The squid logs rotate hourly and the Apache logs rotate daily.  This arrangement fits with front end speed and the convenience of rotating logs at the one of quietest times of the day (4am).

The Apache logs are a custom log format that includes response time and cookie information:

CustomLog   logs/access.log "%h %D %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""

To stitch the two together, the first thing to realise is that you  only need to look at Squid MISSes – logically these are the ones that have had to go to Apache to get content.

The only identifying piece of information common to both Squid and Apache are the timestamp and URL so immediately there is room for improvement here and a small percentage of hits don’t get stitched together.  This is fine as we’re after a picture of potential scrapers, rather than accurate accounting information.

So the first Pig script does a merge of the data based on timestamp and URL.  This takes a fair amount of time to do and in the end a log file is produced with a lot of useful information in – external IP, user-agent, cookie information, etc. It does this by processing this through a stream using some perl as we need to convert the unix timestamp format in the squid log to the same format as Apache. I end up with the following

03/Oct/2010:04:02:14	44	xxx.xxx.xxx.xxx TCP_MISS/200 365	GET	/static/js/bundle.js	-	172.22.25.155	54905	03/Oct/2010:04:02:14	GET	/static/js/bundle.js	1.1	200	289939	http://mysite.co.uk/some/resource Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; InfoPath.1)	34AA0D711456868AAF7C871F8190C72F	1243622207600.672376.715	LU11LY

Note that final 3 fields.  These are a sessionid, a userid and a postcode [something specific to this particular site].  We can extract any meaningful information from the log entry when we stream the data through some perl script.

The crux of the challenge: What is scraping?

There are a number of issues with scraping. Some scrapes are very easy to spot – sequential page number increasing, single IPs, same identifiable information used in the search strings. The more complex ones mix IP addresses with different User-Agent strings, even the single IP addresses with multiple User-Agents used pose a problem: mega-proxies where many, many genuine home users filter through from AOL, etc.

The first steps: Grouping IP, with Session ID and User-Agent

The first steps towards identifying the slightly less basic scrapers (ones with multiple IPs, user-agents) was to group the IP with Session ID and User-Agent.  This is then counted and ordered by the count to show how often that combination from that IP is seen… notice we don’t need to see the URL.  It is assumed that a high number of this combination is unusual and worthy of further investigation.

This job takes a few hours to run on relatively small amount of data on the current PoC Hadoop cluster so there is room for optimisation here.

Note that this script highlights the bots too – which by the same token are a high number of hits from single or multiple IPs…

The result is something like the following:

I’ll post the work-in-progress code here to share the findings and work on a way of improving this.

Hadoop, Pig, Apache and Squid Log Processing

I’ve been experimenting with Hadoop to help process the gigabytes of logs generated from Apache and Squid where I work. Currently, we’ve a very small proof of concept cluster comprising of 5 nodes that is churning through Squid logs hourly to produce GnuPlot graphs of traffic over the last hour and last day.

I’ll not cover Hadoop in any detail here (there are many places to look for this – for example Yahoo! and Cloudera) but I’ll document the scripts used to get Hadoop processing Squid and Apache logs here.